In [None]:
import numpy as np
import matplotlib.pylab as plt
import pandas as pd

In [None]:
def VisualizeBinaryClassificationData(features, labels):
    # function to visualize binary classification data. 
    # function only visualizes the first two features in 2D cartesian grid. 
    # Samples with label = 1 are depicted with "+" and those with label = 0 with "-"
    plt.figure()
    pos_rows = labels > 0
    neg_rows = labels <= 0
    plt.plot(features[pos_rows,0],features[pos_rows,1],'+',markersize=10,mew=2)
    plt.plot(features[neg_rows,0],features[neg_rows,1],'_',markersize=10,mew=2)
    plt.grid('on')
    plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
    plt.show()

So far, we have only seen regression of continuous labels with linear models and binary classification with logistic regression. In this section, we will first see different models for the binary classification problem. 

Let us start once more by reading training data and visualizing the samples. 

<b>Note:</b> For the problem we have been working on so far, the feature dimension is only two. This allows for easy visualization of the samples and the class separation. In more complicated problems with higher number of feature dimensions, such visualization may not be feasible. In these cases we might have use basic one dimensional histograms and visualize data in a univariate fashion.

In [None]:
# Reading data
features = pd.read_csv('data/features_linear_classification.csv')
labels = pd.read_csv('data/labels_linear_classification.csv')

# converting Pandas data frame into numpy arrays: 
features = features.values
labels = labels['0'].values

# Plotting
VisualizeBinaryClassificationData(features, labels)

In the previous section we have seen how to logistic regression. Let us apply that again for the sake of comparison. Also, we will see another way to visualize the classification results -- through plotting decision boundaries. 

In [None]:
# We import the necessary modules to perform logistic regression
from sklearn import linear_model
# We create an object that can do logistic regression
logr = linear_model.LogisticRegression()
# We use the data to estimate its parameters with the fit function
logr.fit(features, labels)

In [None]:
def VisBinClassDecisionBoundaries(model, xlims=[-30,30], ylims=[-30,30], h=0.05, features=None, 
                                  labels=None, alpha=0.25, title=None):
    # function visualizes decision boundaries using color plots
    # model is the classification model that can be any model in the scikit-learn package
    
    # creating meshgrid for different values of features
    xx, yy = np.meshgrid(np.arange(xlims[0], xlims[1], h), np.arange(ylims[0], ylims[1], h))
    # extracting predictions at different points in the mesh
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # plotting the mesh
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, alpha=alpha)
    plt.grid()
    
    # if the samples are given plot them on the same plot
    if (features is not None) and (labels is not None): 
        pos_rows = labels > 0
        neg_rows = labels <= 0
        plt.plot(features[pos_rows,0],features[pos_rows,1],'k+',markersize=10,mew=2)
        plt.plot(features[neg_rows,0],features[neg_rows,1],'r_',markersize=10,mew=2)
        plt.grid('on')
        
    plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
    if title is not None: 
        plt.title(title, fontsize=16)
    plt.show() 

In [None]:
VisBinClassDecisionBoundaries(logr, features=features, labels=labels, title='Logistic Regression')

This plot shows the regions where the classifier will assign label = 1 and label = 0. 

Let us examine different classifier models and see their behavior. 

<h3> Support Vector Machines (SVM) - linear </h3>

In [None]:
from sklearn import svm
svml = svm.SVC(kernel='linear')
svml.fit(features, labels)
VisBinClassDecisionBoundaries(svml, features=features, labels=labels, title='SVM Linear')

<h3> Support Vector Machines (SVM) - polynomial kernel </h3>

In [None]:
svmp = svm.SVC(kernel='poly')
svmp.fit(features, labels)
VisBinClassDecisionBoundaries(svmp, features=features, labels=labels, title='SVM Polynomial')

<h3> Decision Trees </h3>

In [None]:
from sklearn import tree
dtree = tree.DecisionTreeClassifier()
dtree.fit(features,labels)
VisBinClassDecisionBoundaries(dtree, features=features, labels=labels, title='Decision Tree')

<h3> Random Decision Forests </h3>

In [None]:
from sklearn.ensemble import RandomForestClassifier
randfor = RandomForestClassifier(n_estimators=50)
randfor.fit(features,labels)
VisBinClassDecisionBoundaries(randfor, features=features, labels=labels, title='Random Forest')

<h3> Neural Networks with 1 hidden layer</h3>

In [None]:
from sklearn.neural_network import MLPClassifier
nn1 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(2), random_state=1)
nn1.fit(features, labels)
VisBinClassDecisionBoundaries(nn1, features=features, labels=labels, title='Neural Network - 1 HL')

<h3> Neural Networks with 2 hidden layer </h3>

In [None]:
nn2 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(3,3), random_state=1)
nn2.fit(features, labels)
VisBinClassDecisionBoundaries(nn2, features=features, labels=labels, title='Neural Network - 2 HL')

<h2> Exercise 4:</h2>
In this small exercise, you will apply different classification models to the data in "data/ex4_features_classification.txt" and "data/ex4_labels_classification.txt". The goal is to use the data to train different models, visualize decision boundaries and observe how things change with different parameter settings. Please use the functions we already used above. 

To this end, choose a model, look-up the description in scikit-learn webpage and choose a set of parameters. Modify the parameters and observe how the classifier's decision boundary changes. 

As an example, you can choose a random forest model and change the number of estimators (number of trees) and the depth of the trees. These properties can be set with different parameters as explained in the respective website. 

Another example can be to choose a neural network model and change the number of layers and number of nodes in the hidden layers. 

<b>Note :</b> The files are txt files. Please read them with np.loadtxt and not pd.read_csv

<b>Note :</b> If you choose to work with SVM try the 'rbf' kernel as well.

In [None]:
# TODO


<h3> Different models - Regression example </h3> 

Just as in the classification task, you can also use different methods for the regression task. Let us quickly look at a regression problem and try with decision trees. Specifically, let us consider the example you worked on as the regression exercise earlier today. 

In [None]:
# Reading and visualizing the data
features = np.loadtxt('data/ex2_features_regression.txt')[:,np.newaxis]
labels = np.loadtxt('data/ex2_labels_regression.txt')

# Plotting
plt.scatter(features, labels, color='b')
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')

# We create an object that can do linear regression
rtree = tree.DecisionTreeRegressor()
# We use the data to estimate its parameters with the fit function
rtree.fit(features, labels)

# Plotting
x = np.arange(0, 30., 0.05)[:,np.newaxis]
plt.plot(x, rtree.predict(x), 'r', linewidth=2.5)

# Reading features of the test samples and predicting with the learned model: 
test_features = np.loadtxt('data/ex2_test_features_regression.txt')[:,np.newaxis]
print("Test sample's features:\n {}".format(test_features))
test_predict = rtree.predict(test_features)
print("Predicted labels:\n {}".format(test_predict))

# Reading true labels and computing RMSE
test_labels = np.loadtxt('data/ex2_test_labels_regression.txt')
root_mean_squared_error = np.sqrt(np.mean((test_labels - test_predict)**2))
print("Root mean squared error (RMSE): {}".format(root_mean_squared_error))

# Plotting
plt.scatter(test_features, test_predict, color='g', s=100)
plt.scatter(test_features, test_labels, color='k', s=100)
plt.show() # showing everything on the screen

Note that the model captures the non-linearity in the data. It is possibly not the cleanest model as the regressor "wiggles" quite a bit. 

<h3> Exercise 5 - optional:</h3>
Try out different regression models on the same data. One choice can be random forest regressor.

In [None]:
# TODO
