# Exercises - Machine Learning for logistic regression

In [4]:
import numpy as np
import matplotlib.pyplot as plt

## Sonar dataset

In this exercise, you will solve a binary classification problem solved using logistic regression. The dataset consists of 60 features corresponding to sonar measurements, with a binary label that indicates whether the sample is a rock (0) or a mine (1).

In [5]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))

In [6]:
data = np.loadtxt("data/logistic_regression/sonar.csv", delimiter = ",")
X = data[:, :-1]
y = data[:, -1]

In [7]:
# shuffling of the dataset
np.random.shuffle(data)
num_features = X.shape[1]
X_shuffled = data[:,:num_features]
y_shuffled = data[:,num_features]

1. Standardize the shuffled dataset according to the mean and the standard deviation over the samples. Split the dataset into training (80%) and test (20%) sets.

2. Use the _LogisticRegression_ class from the example (toy dataset) to build and train a model. Adjust the number of epochs and learning rate. Plot the cost vs epochs.

3. Evaluate the misclassifications (number of wrong classifications) over the training and the test sets. Compute the accuracy (number of correct classifications/number of samples) over the training and the test sets.

## Microchips dataset

In this exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance. Suppose that you have the test results for some microchips on two tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

In [8]:
# Load the dataset
data = np.loadtxt("data/logistic_regression/microchip.txt", delimiter = ",")
X = data[:, :-1]
y = data[:, -1]

In [9]:
# Auxiliary functions
def sigmoid(x):
    return 1/(1 + np.exp(-x))

def mapFeature(X1, X2):
    degree = 6
    out = np.ones((len(X1),1))
    for i in range(1,degree+1):
        for j in range(0,i+1):
            prod = (X1**(i-j))*(X2**j)
            out = np.append(out, prod.reshape(-1,1), axis=1)
    return out

def plotDecisionBoundary(model):
    x1s = np.linspace(-1.,1.,50)
    x2s = np.linspace(-1.,1.,50)
    xs, ys = np.meshgrid(x1s, x2s, indexing='ij')
    zs = np.zeros((len(x1s),len(x2s)))
    Xs_ = np.zeros((len(x1s), len(x2s), 28))
    for i in range(len(xs)):
        Xs_[i,:,:] = mapFeature(xs[i,:], ys[i,:])
        zs[i,:] = (Xs_[i,:,:] @ model.weights).ravel()

    plt.contour(xs, ys, zs, [0.])
    plt.scatter(X[:,0], X[:,1], c=y)
    plt.show()

In [10]:
# Extend the features of the dataset up to sixth powers
X_ = mapFeature(X[:,0], X[:,1])
num_features = X_.shape[1]

# Split the dataset (0.8-0.1-0.1)
X_train = X_[:94,:]
y_train = y[:94]
X_val = X_[94:106,:]
y_val = y[94:106]
X_test = X_[106:,:]
y_test = y[106:]

y_train = y_train.reshape(-1,1)
y_val = y_val.reshape(-1,1)
y_test = y_test.reshape(-1,1)

1. Plot the dataset.

2. Implement the _LogisticRegression_ class with Tikhonov regularization. Add a method called `accuracy` that computes the prediction accuracy on a given dataset.

3. Train a Logistic Regression model on the training set for 10000 epochs with `reg_param = 0.` and `learning_rate = 10.` and evaluate its accuracy on the training, validation and test sets.

4. Plot the cost (training) vs epochs and the decision boundary (use the auxiliary function defined above).

5. Perform a grid search to find the "best" regularization parameter in the interval (0.,2.), while keeping fixed the learning rate and the number of epochs. Re-train the model using the best value and the training+validation set.

6. Plot the decision boundary of the fitted model and evaluate the accuracy on the test set.