***
<h2> <u>Goals of this notebook</u> </h2>

* Now that we understand how to load and visualize data, let's fit some simple machine learning models to data.
* Specifically, we will look at one model for regression: '**Linear regression**' and one model for binary classification: '**Logistic regression**'.

***
<h2> <u>What am I supposed to do?</u> </h2>

* As in the first notebook, the code in the first few cells is already written.
* Go through the same and understand it. Feel free to insert new code cells in between and print stuff in order to better understand what is going on.
* Towards the end, some blocks are left empty for you to fill in.

***
<h2> <u>Some tips</u> </h2>

* The print command is your friend. Once you read anything into a variable v, you can print(v) to see the contents of v. Use this to understand the data inside v. Further, depending on the type of v, you can do print(v.shape), print(len(v)), print(v.size). Use print extensively to understand as well as to debug your own code!

***
***

<h2> <u>Import required modules</u> </h2>


In [None]:
import numpy as np
import matplotlib.pylab as plt
import pandas as pd

***
***

Even though it is possible to write machine learning models from scratch, this can be a tedious task. Fortunately, there are many different machine learning libraries available in various languages. In this workshop, we adopt python and scikit-learn library, which is widely used both in academic research as well as in industry.


<a href="https://scikit-learn.org/stable/modules/classes.html#">scikit-learn API</a>

***
***

<h2> <u>Linear Regression</u> </h2>

* The first machine learning algorithm we will use is linear regression - that is, we will use a linear model to do a regression task.
* Linear regression has a simple model that can be written as: $y = ax + b$, where $y$ represents labels (in this case, continuous) and $x$ represent inputs / features.
* The model has two parameters: $a$ and $b$. Our goal (which is one of the main goals in most machine learning algorithms) is to estimate these parameters using "training data", such that the model is able to predict labels from features of "test data" as accurately as possible.

***
* First, we will read and visualize the data as we did in the 01a notebook.

<h2> <u>Mount Google drive folder</u> </h2>


In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/My Drive/ML_workshop

In [None]:
ls

In [None]:
# read and visualize the data
features = np.loadtxt('machine_learning/data/features_linear_regression.txt')[:,np.newaxis]
labels = np.loadtxt('machine_learning/data/labels_linear_regression.txt')
nsamples = features.size
print ('Number of samples: {}'.format(nsamples))
print (features.shape)

# plot
plt.scatter(features, labels, color='b')
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
plt.show()


***
* Next, we import the necessary module, containing the machine learning models

In [None]:
from sklearn import linear_model

***
* In scikit-learn machine learning tools are designed as objects. Different algorithms are created and trained using similar syntax.

* In the following cell, we first create a linear regression object. 

* Next, we train it using the training data read from the txt files.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression">LinearRegression documentation</a>


In [None]:
# first, we create an object that can do linear regression
regr = linear_model.LinearRegression()

# now, we train the model - that is, we use the training data to estimate the model parameters.
regr.fit(features, labels)

***
* Wow, that was fast!
* Internally, the fit function solves the following optimization problem:
$\arg_{a,b}\min \sum_{n=1}^{N} ( y_n - (ax_n+b) )^2$

***
* We can now look at the optimal linear model by plotting the line with the determined parameters.
* The learned parameters are saved in the linear regression object we created "regr": 

In [None]:
# plot the trained model
plt.scatter(features, labels,color='b')
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

***
* The red line in the above plot indicates the trained linear model, while the blue points indicate the training data points. Would you say that the model does a good job of modeling the pattern in the training data?

***
* Next, let's see how good the model is at predicting the labels for new test points, that were not seen during the training. After all, this is the behaviour that is of most concern from a practical point of view!

* Prediction for unseen data is performed using the object "regr" as well. (Actually, while plotting line we already used this function "regr.predict".)

* First, let us read some unseen data from file, predict the labels and plot them. 

In [None]:
# read test data inputs / features
test_features = np.loadtxt('machine_learning/data/test_features_linear_regression.txt')[:,np.newaxis]
print("Test sample's features:\n {}".format(test_features))

# use the predict function of the object to predict for a new set of samples
test_predict = regr.predict(test_features)
print("Predicted labels:\n {}".format(test_predict))

# plot the predicted labels for the test features
plt.scatter(test_features, test_predict, color='g', s=100)
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[-0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

***
* Models are often not perfect.
* We have also seen in the training data that there is a discrepancy between model line and labels of the data.
* Such a discrepancy will also exist in the test set. 
* Let us now read the "true" labels of the test set and visualize the difference with the model predictions. 

In [None]:
# read the true labels of the test data
test_labels = np.loadtxt('machine_learning/data/test_labels_linear_regression.txt')

# plot the true labels along with the predicted labels to visualize their discrepancy
plt.scatter(test_features, test_predict, color='g', s=100)
plt.scatter(test_features, test_labels, color='k', s=100)
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
x = np.asarray([[-0], [30]])
plt.plot(x, regr.predict(x), 'r', linewidth=2.5)
plt.show()

***
* We can also quantify the discrepancy between model prediction and "true" labels using the same cost function as we used in the training part: 
$\sum_{n=1} ( y_n - (ax_n+b) )^2$

In [None]:
total_test_error = np.sum((test_labels - test_predict)**2)
mean_squared_error = np.mean((test_labels - test_predict)**2)
root_mean_squared_error = np.sqrt(np.mean((test_labels - test_predict)**2))
print("Total test error: {}".format(total_test_error))
print("Mean squared error: {}".format(mean_squared_error))
print("Root mean squared error (RMSE): {}".format(root_mean_squared_error))

***
* Phew! So finally we reach the end of the toy linear regression example. To summarize, we did the following steps:
   * read the training features and their corresponding labels.
   * visualized the training data.
   * created an instance of the machine learning model to be fit to the training data.
   * solved an optimization problem to fit the model to the training data.
   * visualized the trained model and it's performance on the training data.
   * read the features of the test data and used the trained model for doing predictions on new test features, that were not part of the training dataset.
   * compared the predictions with the true labels of the test features.
***
* If you would like to, scroll back to the top of the notebook and ensure that you understand where and how each of these steps is being done.

***
***
<h2> <u> Exercise 2:</u></h2>

* In this exercise, you will
  * Read new training data: features (from data/ex2_features_regression.txt) and labels (from data/ex2_labels_regression.txt) and visualize the same. 
  * Fit a linear regression model to the training data. Please name the linear regression object differently, e.g. regr_ex.
  * Visualize the trained model.
  * Read features of test samples (from "data/ex2_test_features_regression.txt") and predict labels for the same.
  * Read "true" labels of the test samples (from data/ex2_test_labels_regression.txt), compare the predicted values with real labels and compute RMSE. 

In [None]:
# TODO


* Would you say that the linear model does a good job in this new regression task that you solved in the exercise?
* If not, how else could the data be modelled? With a quadratic function, perhaps? Which parts of the code will have to change in order to do this? You need not implement this, but it would instructive to simply figure out which parts of the code need to change.
***
***
***

<h2> Binary Classification with Logistic Regression </h2>

* Okay, let's now digress from regression!
* In the second task, we will focus on is binary classification.
* We will use Logistic Regression for this task. 

***
* Logistic regression also has a simple model: $y = \sigma(ax + b)$, where $y$ represents labels (in this case, binary), $x$ represent features and $\sigma(\cdot)$ represents the sigmoid function: $\sigma(w) = \frac{1}{1 + e^{-w}}$. 

* The output of the <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid function</a> is in the range [0, 1]. Thus, the predictions are considered as probabilities, i.e. $p(y=1|x) = \sigma(ax+b)$.

***
* In the same formulation we can also consider the case where each input data point has multiple features, e.g. $x_1$ and $x_2$.
* In this case, the only difference is that the product $ax$ becomes a vector product: $a\cdot x = a_1x_1 + a_2x_2$. Accordingly, the model becomes: $y = \sigma(a\cdot x + b)$.
* Logistic regression also has two parameters $a$ and $b$. The parameter $a$ will be a vector of the same size as the number of features for each data point.

***
* Let us focus on a specific dataset: 

In [None]:
# read features and corresponding labels
features = pd.read_csv('machine_learning/data/features_linear_classification.csv')
labels = pd.read_csv('machine_learning/data/labels_linear_classification.csv')

# visualize the read data
pos_rows = labels['0'] > 0
neg_rows = labels['0'] <= 0
plt.plot(features.feature1[pos_rows],features.feature2[pos_rows],'+',markersize=10,mew=2)
plt.plot(features.feature1[neg_rows],features.feature2[neg_rows],'_',markersize=10,mew=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
plt.show()

# convert Pandas dataframe into numpy arrays: 
features = features.values
labels = labels['0'].values

***
* Logistic regression is in the same module as linear regression in the scikit-learn package.
* We create the necessary object and train the model with the available data.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression">LogisticRegression documentation</a>

In [None]:
# We create an object that can do logistic regression
clas = linear_model.LogisticRegression()
# We use the data to estimate its parameters with the fit function
clas.fit(features, labels)

***
* There are several things to note here: 

  * First, the creation of the logistic regression object and training is done exactly the same way as in linear regression. This extends to almost all models in the scikit-learn package. **More importantly, this also extends to almost all machine learning algorithms conceptually. Once you have the data, you determine the parameters of the model that best predicts labels from features in the training data.**

  * The differences are under the hood:
    * Models differ: so far, we have only seen linear models. There are other more complex models, as we will see in the next notebook.
    * Costs functions (which are optimized to obtain the model parameters) differ.
   
***
* The main cost function that got minimized for the logistic regression is the **cross-entropy**:

$\arg_{a,b}\min \sum_{n=1} y_n \ln \hat{y}_n + (1 - y_n)\ln (1 - \hat{y}_n)$, where $\hat{y}_n=\sigma(a\cdot x_n + b)$.

***
* Let's visualize the learned model. To do this, we will look at the decision boundary between the two categories. 


In [None]:
# plot the training data
pos_rows = labels > 0
neg_rows = labels <= 0
plt.plot(features[pos_rows,0],features[pos_rows,1],'+',markersize=10,mew=2)
plt.plot(features[neg_rows,0],features[neg_rows,1],'_',markersize=10,mew=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)

# overlay the decision boundary
x = np.asarray([[-20], [25]])
# coefficients of the logistic regression are saved in the "clas" object and can be constructed into a line as
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.show()

***
* As before, let us now read features of test samples and perform prediction.

In [None]:
# read test features
test_features = np.loadtxt('machine_learning/data/test_features_linear_classification.txt')
print("Test sample's features:\n {}".format(test_features))

# use the predict function of the object to predict for a new set of samples
test_predict = clas.predict(test_features)
print("Predicted labels:\n {}".format(test_predict))

# plot
x = np.asarray([[-20], [25]])
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
plt.plot(test_features[:,0],test_features[:,1],'p',markersize=10,mew=2)
plt.show()

***
* There is an important point to notice: **Prediction is performed in the same way as we have done in the linear regression case.*** 

* As in the linear regression case (and as in most cases), models are not perfect and will make errors when predicting.

* We can visualize this by looking at the "true" labels of the test samples.

* Let us visualize this first and then quantify the error in terms of "classification accuracy".

In [None]:
# read true labels
test_labels = np.loadtxt('machine_learning/data/test_labels_linear_classification.txt')

# plot
x = np.asarray([[-20], [25]])
m = clas.coef_[0,0] / clas.coef_[0,1]
b = clas.intercept_ / clas.coef_[0,1]
plt.plot(x[:,0], b - m*x[:,0], 'r--', linewidth=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)

# get the indices of the correct and wrong predictions by comparing with the true labels: 
correct_predictions = np.where(test_labels == test_predict)[0]
wrong_predictions = np.where(test_labels != test_predict)[0]
plt.plot(test_features[correct_predictions,0],test_features[correct_predictions,1],'pg',markersize=10,mew=2)
plt.plot(test_features[wrong_predictions,0],test_features[wrong_predictions,1],'pr',markersize=10,mew=2)
plt.show()

***
* In the above plot, the test points that were correctly classified are shown in green, while the red points indicate the misclassified test points.
* We see that there are three test samples where logistic regression made the wrong prediction.
* We can quantify this using different quantities: 

In [None]:
class_accuracy = np.sum(test_predict == test_labels) / test_features.shape[0]
class_fps = np.sum(test_predict > test_labels) / test_features.shape[0]
class_fns = np.sum(test_predict < test_labels) / test_features.shape[0]
print('Classification accuracy: {}'.format(class_accuracy))
print('False positive rate: {}'.format(class_fps))
print('False negative rate: {}'.format(class_fns))