***
<h2> <u>Goals of this notebook</u> </h2>

* So far, we looked at linear models for regression and classification.
* In this notebook, we consider **more complicated models**!

*** 
<h2> <u>What am I supposed to do?</u> </h2>

* **The code in the all the cells in this notebook is already written!**
* So sit back and relax! Simply go through the notebook, execute the cells and try to understand what is going on. Feel free to insert new code cells in between and print stuff in order to better understand what is going on.

***
***
<h2> <u>Import required modules</u> </h2>


In [None]:
import numpy as np
import matplotlib.pylab as plt
import pandas as pd

<h2> <u>Mount Google drive folder</u> </h2>





In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/My Drive/ML_workshop

In [None]:
ls

***
Let us first define some helper functions.

Function for data visualization: 

In [None]:
def VisualizeBinaryClassificationData(features, labels):
    # function to visualize binary classification data. 
    # function only visualizes the first two features in 2D cartesian grid. 
    # Samples with label = 1 are depicted with "+" and those with label = 0 with "-"
    plt.figure()
    pos_rows = labels > 0
    neg_rows = labels <= 0
    plt.plot(features[pos_rows,0],features[pos_rows,1],'+',markersize=10,mew=2)
    plt.plot(features[neg_rows,0],features[neg_rows,1],'_',markersize=10,mew=2)
    plt.grid('on')
    plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
    plt.show()

***
Function for visualization of decision boundaries: 

In [None]:
def VisBinClassDecisionBoundaries(model, xlims=[-30,30], ylims=[-30,30], h=0.05, features=None, 
                                  labels=None, alpha=0.25, title=None):
    # function visualizes decision boundaries using color plots
    # model is the classification model that can be any model in the scikit-learn package
    
    # creating meshgrid for different values of features
    xx, yy = np.meshgrid(np.arange(xlims[0], xlims[1], h), np.arange(ylims[0], ylims[1], h))
    # extracting predictions at different points in the mesh
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # plotting the mesh
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, alpha=alpha)
    plt.grid()
    
    # if the samples are given plot them on the same plot
    if (features is not None) and (labels is not None): 
        pos_rows = labels > 0
        neg_rows = labels <= 0
        plt.plot(features[pos_rows,0],features[pos_rows,1],'k+',markersize=10,mew=2)
        plt.plot(features[neg_rows,0],features[neg_rows,1],'r_',markersize=10,mew=2)
        plt.grid('on')
        
    plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
    if title is not None: 
        plt.title(title, fontsize=16)
    plt.show() 

***
* So far, we have only seen regression of continuous labels with linear models and binary classification with logistic regression.

* In this section, we will first see different models for the binary classification problem. 

* Let us start once more by reading training data and visualizing the samples. 

* **Note:** For the problem we have been working on so far, the feature dimension is only two. This allows for easy visualization of the samples and the class separation. In more complicated problems with higher number of feature dimensions, such visualization may not be feasible. In these cases, we might have use basic one dimensional histograms and visualize data in a univariate fashion.

In [None]:
# Reading data
features = pd.read_csv('machine_learning/data/features_linear_classification.csv')
labels = pd.read_csv('machine_learning/data/labels_linear_classification.csv')

# converting Pandas data frame into numpy arrays: 
features = features.values
labels = labels['0'].values

# Plotting
VisualizeBinaryClassificationData(features, labels)

***
* In the previous notebook, we have seen how to logistic regression. Let us apply that again for the sake of comparison.

In [None]:
# We import the necessary modules to perform logistic regression
from sklearn import linear_model

# We create an object that can do logistic regression
logr = linear_model.LogisticRegression()

# We use the data to estimate its parameters with the fit function
logr.fit(features, labels)

# visualize the decision boundary of the logistic regression model
VisBinClassDecisionBoundaries(logr, features=features, labels=labels, title='Logistic Regression')

***
* This plot shows the regions where the logistic regression classifier will assign label = 1 and label = 0. 
* Note that this is a linear model and therefore, the decision boundary is linear.


* Now, let us examine other classifier models with different complexities and see the corresponding decision boundaries.
* Some of the models are optimized in a stochastic manner. Therefore, they may give slightly different results if they are run multiple times. You can check this behaviour by running each cell two or three times successively.

***

<h3> Support Vector Machines (SVM) - linear </h3>

In [None]:
from sklearn import svm
svml = svm.SVC(kernel='linear')
svml.fit(features, labels)
VisBinClassDecisionBoundaries(svml, features=features, labels=labels, title='SVM Linear')

<h3> Support Vector Machines (SVM) - polynomial kernel </h3>

In [None]:
svmp = svm.SVC(kernel='poly')
svmp.fit(features, labels)
VisBinClassDecisionBoundaries(svmp, features=features, labels=labels, title='SVM Polynomial')

<h3> Decision Trees </h3>

In [None]:
from sklearn import tree
dtree = tree.DecisionTreeClassifier()
dtree.fit(features,labels)
VisBinClassDecisionBoundaries(dtree, features=features, labels=labels, title='Decision Tree')

<h3> Random Decision Forests </h3>

In [None]:
from sklearn.ensemble import RandomForestClassifier
randfor = RandomForestClassifier(n_estimators=50)
randfor.fit(features,labels)
VisBinClassDecisionBoundaries(randfor, features=features, labels=labels, title='Random Forest')

<h3> Neural Network with 1 hidden layer</h3>

In [None]:
from sklearn.neural_network import MLPClassifier
nn1 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(2), random_state=1)
nn1.fit(features, labels)
VisBinClassDecisionBoundaries(nn1, features=features, labels=labels, title='Neural Network - 1 hidden layer')

<h3> Neural Network with 5 hidden layers </h3>

In [None]:
nn2 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(30,30,30,30,30), random_state=1)
nn2.fit(features, labels)
VisBinClassDecisionBoundaries(nn2, features=features, labels=labels, title='Neural Network - 5 hidden layers')

***

* In a practical situation, the selection of which model to use is a design choice. 
* In general, simpler models are less likely to give a very small training error. However, the advantage with simpler models is that they are also less likely to 'overfit' to the training data. Hence, they are more likely to generalize to unseen test data. The general rule of thumb is to follow the spirit of <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">occam's razor</a>: that is, to choose the simplest model that provides satisfactory performance.

<h2> Cross validation </h2>

Having two or three (in case of validation set) separate datasets may not be feasible for some applications, where available data is limited. In these cases, dividing the already small sample sizes into training and test sets might yield very small training samples and create problems in the estimation of the model parameters. 

To address small sample sizes, an imperfect but widely used approach is cross-validation. In cross-validation the data is divided into different partitions, for instance $K$. The specific instance of this cross-validation is named K-fold cross-validation. The model is trained and prediction accuracy estimated $K$ times. Each time $K-1$ partitions of the dataset is used for training and the remaining partition is used to compute prediction accuracy. In the next round, another partition is set aside. At the end, each test sample is used once as a test sample. The final generalization accuracy is computed as the average of the individual runs. 

In scikit-learn this is already implemented. Let us see how it is used. 

In [None]:
# Reading data
training_features = np.loadtxt('machine_learning/data/train_genacc_features.txt')
training_labels = np.loadtxt('machine_learning/data/train_genacc_labels.txt')

# Plotting
VisualizeBinaryClassificationData(training_features, training_labels)


In [None]:
# We import the necessary modules to perform logistic regression
from sklearn import linear_model
# We create an object that can do logistic regression
logr = linear_model.LogisticRegression()
# We use the data to estimate its parameters with the fit function
logr.fit(training_features, training_labels)

# Import the necessary models for MLP classifier
from sklearn.neural_network import MLPClassifier
# Create an object that will do the classification
nn2 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(3,3), random_state=1)
# We use the training data to estimate the parameters. 
nn2.fit(training_features, training_labels)


In [None]:
# import the required function to perform 5-fold stratified cross-validation
# in stratified K-fold cross validation in each fold the ratio of the 
# number of different classes is the same as the entire dataset. 
from sklearn.model_selection import StratifiedKFold
# creating an object to create partitions for the 5 fold cross validation
skf = StratifiedKFold(n_splits=5)

# in this for loop we go over different partitions. 
# "skf" object produces indices of samples for different folds
for trainind, testind in skf.split(training_features, training_labels):
    print("Training indices: {}".format(trainind))
    print("Test indices: {}".format(testind))


Let us now run the entire cross-validation experiment to estimate generalization accuracy and compare it with the training accuracy we computed above:

In [None]:
# import the required function to perform 5-fold stratified cross-validation
# in stratified K-fold cross validation in each fold the ratio of the 
# number of different classes is the same as the entire dataset. 
from sklearn.model_selection import StratifiedKFold

# import the required function to compute classification accuracy
from sklearn.metrics import accuracy_score

# creating an object to create partitions for the 5 fold cross validation
numFolds = 5
skf = StratifiedKFold(n_splits=numFolds)

# creating a vector to hold accuracies of different folds: 
acc_vec_logr = np.zeros(numFolds)
acc_vec_nn2  = np.zeros(numFolds)
# in this for loop we go over different partitions. 
n = 0
for trainind, testind in skf.split(training_features, training_labels):
    # training both classification models using the training partitions of the dataset. 
    logr.fit(training_features[trainind,:], training_labels[trainind])
    nn2.fit(training_features[trainind,:], training_labels[trainind])
    
    # predictions in the test partition of each fold
    preds_cv_logr = logr.predict(training_features[testind,:])
    preds_cv_nn2 = nn2.predict(training_features[testind,:])
    
    # computing accuracy for the test partitions
    acc_vec_logr[n] = accuracy_score(training_labels[testind], preds_cv_logr)
    acc_vec_nn2[n]  = accuracy_score(training_labels[testind], preds_cv_nn2)
    n += 1

print("Accuracies at different folds:")
print("=============================")
print("Logistic Regression: {}".format(acc_vec_logr))
print("Neural Networks with 2 HL: {}".format(acc_vec_nn2))
print("\n")
print("Generalization accuracy estimates:")
print("=============================")
print("Logistic Regression: {}".format(np.mean(acc_vec_logr)))
print("Neural Networks with 2 HL: {}".format(np.mean(acc_vec_nn2)))

<b>Note:</b> Comparing generalization accuracy estimate computed with 5-fold cross-validation and training set accuracy shows the difference between these two approaches. 5-fold cross-validation is also not a perfect estimation technique however, this time it gets a much closer estimate to the accuracy on the separate test set than prediction accuracy on the training set. 

<b>Note:</b> Cross-validation is an experiment to estimate generalization accuracy. It is strategy to best make use of the available data. When creating the final model the learning algorithm is trained with all available training data and shipped. 