In [None]:
import numpy as np
import matplotlib.pylab as plt
import pandas as pd

Even though it is possible to write machine learning models from scratch, this can be a tedious task.

Fortunately, there are many different machine learning libraries available in various languages. In this workshop we adopt python and scikit-learn library, which is widely used both in academic research as well as industry. 


<h2> Linear Regression </h2>

The first machine learning algorithm we will use is linear regression. Linear regression has a simple model that can be written as:

$y = ax + b$, where $y$ represents labels -- in this case continuous -- and $x$ represent features. 

The model has two parameters: $a$ and $b$. Our goal, which is one of the main goals in most machine learning algorithms, is to estimate these parameters using the data such that the model is able to predict labels from features as accurate as possible. 

In [None]:
# Reading and visualizing the data
features = np.loadtxt('data/features_linear_regression.txt')[:,np.newaxis]
labels = np.loadtxt('data/labels_linear_regression.txt')
nsamples = features.size
print ('Number of samples: {}'.format(nsamples))
# Plotting
plt.scatter(features, labels, color='b')
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
plt.show()

In [None]:
# we first import the necessary library: 
from sklearn import linear_model

In scikit-learn machine learning tools are designed as objects. Different algorithms are created and trained using similar syntax. Here we first create a linear regression object and then train it with the data. 

To train the model we will use the data we just read from the txt files. 

In [None]:
# We create an object that can do linear regression
regr = linear_model.LinearRegression()
# We use the data to estimate its parameters with the fit function
regr.fit(features, labels)

The fit function internally solved the problem: 

$\arg_{a,b}\min \sum_{n=1} ( y_n - (ax_n+b) )^2$

We can now look at the linear model it found by simply plotting the line with the determined parameters. The learned parameters are saved in the linear regression object we created "regr": 

In [None]:
# Plotting
plt.scatter(features, labels,color='b')
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[-0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

Prediction for unseen data is performed using the object "regr" as well. Actually, while plotting line we already used this function "regr.predict". 

Let us read some unseen data from file, predict the labels and plot them. 

In [None]:
test_features = np.loadtxt('data/test_features_linear_regression.txt')[:,np.newaxis]
print("Test sample's features:\n {}".format(test_features))
# We use the predict function of the object to predict for a new set of samples
test_predict = regr.predict(test_features)
print("Predicted labels:\n {}".format(test_predict))

# Plotting
plt.scatter(test_features, test_predict, color='g', s=100)
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[-0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

Models are often not perfect. We have also seen in the training data that there is a discrepancy between model line and labels of the data. Such a discrepancy will also exist in the training set. 

Let us now read the "true" labels of the test set and visualize the difference with the model predictions. 

In [None]:
test_labels = np.loadtxt('data/test_labels_linear_regression.txt')

# Plotting
plt.scatter(test_features, test_predict, color='g', s=100)
plt.scatter(test_features, test_labels, color='k', s=100)
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[-0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

We can also quantify the discrepancy between model prediction and "true" labels using the same cost function as we used in the training part: 

$\sum_{n=1} ( y_n - (ax_n+b) )^2$

In [None]:
total_test_error = np.sum((test_labels - test_predict)**2)
mean_squared_error = np.mean((test_labels - test_predict)**2)
root_mean_squared_error = np.sqrt(np.mean((test_labels - test_predict)**2))
print("Total test error: {}".format(total_test_error))
print("Mean squared error: {}".format(mean_squared_error))
print("Root mean squared error (RMSE): {}".format(root_mean_squared_error))

<h2> Exercise 2:</h2>
In this small exercise, you will
- Read features and labels from txt files and visualize: 
The feature file : "data/ex2_features_regression.txt".
The label file : "data/ex2_labels_regression.txt"

- Fit a linear regression model to the data. Please name the linear regression object differently, e.g. regr_ex

- Visualize the learned model 

- Read features of test samples from another txt file and predict labels: 
The feature file : "data/ex2_test_features_regression.txt"

- Read "true" labels of the test samples from the last file (data/ex2_test_labels_regression.txt), compare the predicted values with real labels and compute RMSE. 

In [None]:
# TODO


<h2> Binary Classification with Logistic Regression </h2>

The second task we will focus on is binary classification and we will use Logistic Regression for this task. 
Logistic regression also has a very simple model:

$y = \sigma(ax + b)$, where $y$ represents labels -- in this case binary -- $x$ represent features and $\sigma(\cdot)$ represents the sigmoid function

$\sigma(w) = \frac{1}{1 + e^{-w}}$

The predictions are considered as probabilities, i.e. $p(y=1|x) = \sigma(ax+b)$

In the same formulation we can also consider multiple features, e.g. $x_1$ and $x_2$. In this case the only difference is that the product $ax$ becomes a vector product: $a\cdot x = a_1x_1 + a_2x_2$. The model becomes: 

$y = \sigma(a\cdot x + b)$

Logistic regression also has two parameters $a$ and $b$, and where there are multiple features $a$ will be a vector and composed of multiple parameters.

Let us focus on a specific dataset: 

In [None]:
# Reading data
features = pd.read_csv('data/features_linear_classification.csv')
labels = pd.read_csv('data/labels_linear_classification.csv')

# Plotting
pos_rows = labels['0'] > 0
neg_rows = labels['0'] <= 0
plt.plot(features.feature1[pos_rows],features.feature2[pos_rows],'+',markersize=10,mew=2)
plt.plot(features.feature1[neg_rows],features.feature2[neg_rows],'_',markersize=10,mew=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
plt.show()

# converting Pandas data frame into numpy arrays: 
features = features.values
labels = labels['0'].values

Logistic regression is in the same module as linear regression in the scikit-learn package. We directly create the necessary object and train the model with the available data: 

In [None]:
# We create an object that can do logistic regression
clas = linear_model.LogisticRegression()
# We use the data to estimate its parameters with the fit function
clas.fit(features, labels)

There are several things to note here. 

First, the creation of the logistic regression object and training is done exactly the same way as in linear regression. This extends to almost all models in the scikit-learn package. More importantly, this also extends to almost all machine learning algorithms conceptually. Once you have the data you determine the parameters of the model that best predicts labels from features in the training data. 

The differences are under the hood. Models differ, which we have seen earlier, and costs functions differ. The main cost function that got minimized for the logistic regression is the cross-entropy: 

$\arg_{a,b}\min \sum_{n=1} y_n \ln \hat{y}_n + (1 - y_n)\ln (1 - \hat{y}_n)$, where $\hat{y}_n=\sigma(a\cdot x_n + b)$


We can now also visualize the learned model. To do this we will look at the decision boundaries that was created. 

In [None]:
# Plotting
pos_rows = labels > 0
neg_rows = labels <= 0
plt.plot(features[pos_rows,0],features[pos_rows,1],'+',markersize=10,mew=2)
plt.plot(features[neg_rows,0],features[neg_rows,1],'_',markersize=10,mew=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)

x = np.asarray([[-20], [25]])
# coefficients of the logistic regression are saved in the "clas" object and can be constructed into a line as
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.show()

As before, let us now read features of test samples and perform prediction

In [None]:
test_features = np.loadtxt('data/test_features_linear_classification.txt')
print("Test sample's features:\n {}".format(test_features))
# We use the predict function of the object to predict for a new set of samples
test_predict = clas.predict(test_features)
print("Predicted labels:\n {}".format(test_predict))

# Plotting
x = np.asarray([[-20], [25]])
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
plt.plot(test_features[:,0],test_features[:,1],'p',markersize=10,mew=2)
plt.show()

There is an important point to notice. Prediction is performed in the same way as we have done in the linear regression case. 

As in the linear regression case -- and as in most cases -- models are not perfect and will make errors when predicting. We can visualize this by looking at the "true" labels of the test samples. Let us visualize this first and then quantify the error in terms of "classification accuracy"

In [None]:
test_labels = np.loadtxt('data/test_labels_linear_classification.txt')
# Plotting
x = np.asarray([[-20], [25]])
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
# getting the indices of the correct and wrong predictions by comparing with the true labels: 
correct_predictions = np.where(test_labels == test_predict)[0]
wrong_predictions = np.where(test_labels != test_predict)[0]
plt.plot(test_features[correct_predictions,0],test_features[correct_predictions,1],'pg',markersize=10,mew=2)
plt.plot(test_features[wrong_predictions,0],test_features[wrong_predictions,1],'pr',markersize=10,mew=2)
plt.show()

We see that there are three test samples where logistic regression made the wrong prediction. We can quantify this using different quantities: 

In [None]:
clas_accuracy = np.sum(test_predict == test_labels) / test_features.shape[0]
clas_fps = np.sum(test_predict > test_labels) / test_features.shape[0]
clas_fns = np.sum(test_predict < test_labels) / test_features.shape[0]
print('Classification accuracy: {}'.format(clas_accuracy))
print('False positive rate: {}'.format(clas_fps))
print('False negative rate: {}'.format(clas_fns))

<h2> Exercise 3:</h2>
In this small exercise, you will
- Read features and labels from txt files and visualize: 
The feature file : "data/ex3_features_classification.txt".
The label file : "data/ex3_labels_classification.txt"

- Fit a logistic regression model to the data. Please name the classification object differently, e.g. clas_ex

- Visualize the boundary - here you can directly copy the form we used above.

- Read features of test samples from another txt file and predict labels: 
The feature file : "data/ex3_test_features_classification.txt"

- Read "true" labels of the test samples from the last file (data/ex3_test_labels_classification.txt), compare the predicted values with real labels and compute classification accuracy and false positive rate. 

In [None]:
# TODO
