## Exercise 9: Classification with Scikit-Learn

In [3]:
import numpy as np
import pandas as pd

###  Task 1: No Free Lunch

In this task we consider the Breast Cancer Wisconsin dataset which is included in the ```sklearn.datasets``` module. In this dataset we aim to predict/classify whether a tumor is malignant or benign, using features of a mammal screen image. More information on the dataset can be found below, feel free to further explore the dataset yourself.

In [4]:
from sklearn import datasets
dta = datasets.load_breast_cancer()
print(dta.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [5]:
X = dta.data
y = dta.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

We now want to apply three different classifiers in sklearn, namely a logistic regression classifier ```LogisticRegression```, a support vector classifier ```SVC```, and a Nearest Neighbors classifier ```KNeighborsClassifier```, to perform classification on this dataset. The corresponding algorithms are imported in the cell below.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

#### a) Applying Default Settings

Split the data into a training and a test set, using a relative test set size of 30%. Afterwards, apply each of the three algorithms with their default parameters on the test data and check their accuracy and AUC score. Which algorithm appears to work best?

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

In [33]:
#LogisticRegression
clf = LogisticRegression(fit_intercept=True, max_iter=10000).fit(X_train, y_train)
print(clf.coef_)
print(clf.intercept_)

[[ 0.76985545  0.21492188 -0.37055614  0.02851438 -0.15509722 -0.14215331
  -0.41688855 -0.25143529 -0.20250415 -0.01826685 -0.00260859  1.55957726
   0.25416689 -0.11050661 -0.02400645  0.0906373  -0.00404184 -0.02743262
  -0.02139792  0.01723522  0.32367358 -0.49731579 -0.08725047 -0.01377542
  -0.30869984 -0.3439006  -1.04332174 -0.46895181 -0.58820337 -0.0594195 ]]
[31.65698928]


In [34]:
#Model evaluation
from sklearn.metrics import mean_squared_error, r2_score
y_pred = clf.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.017543859649122806
0.9235469448584203


In [40]:
#SVM
svm_model = SVC()
svm_model.fit(X_train, y_train)
#print(svm_model.get_params())

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [42]:
#SVM evaluation
y_pred = svm_model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.03508771929824561
0.8470938897168405


In [44]:
#Knearestneighbors
Knei_model = KNeighborsClassifier()
Knei_model.fit(X_train, y_train)

KNeighborsClassifier()

In [46]:
#Kneearestneighbors evaluation
y_pred = Knei_model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.05263157894736842
0.7706408345752609


#### b) Tweaking Parameters

We now want to check whether we can even improve the performances of our algorithms. For that matter, we want to tweak some parameters of the three given algorithms. More precisely, we want to tweak the following parameters:
* ```LogisticRegression```: the regularization parameter ```C``` and the ```penalty``` function
* ```SVC```: the ```kernel```function and the regularization parameter ```C```
* ```KNeighborsClassifier```: the neighborhood size ```n_neighbors```

Conduct your own parameter search and determine the parameter setting which yields the best cross-validated accuracy score. Is there a clear best algorithm?

In [48]:
clf = LogisticRegression(penalty='l2', C=0.4, fit_intercept=True, max_iter=10000).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.017543859649122806
0.9235469448584203


In [73]:
#SVM
svm_model = SVC(C=100, kernel='linear')
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.023391812865497075
0.8980625931445604


In [72]:
#Knearestneighbors
Knei_model = KNeighborsClassifier(n_neighbors=15)
Knei_model.fit(X_train, y_train)
y_pred = Knei_model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

0.029239766081871343
0.8725782414307005


### Task 2: Multi-Class Classification

In lecture we have focused on the binary case in classification. However, in many applications there are more than two classes to predict.  
One standard strategy to extend binary classification methods to the multiclass case is the so-called _one-vs-all_ approach, which requires a classifier that can also produce a confidence score for its predictions rather than the crisp predictions themselves. In logistic regression for instance, such a confidence score is given by the class probabilities.  
Assuming that you there are $L$ classes $C_1,\dots, C_L$, in the one-vs-all approach one fits $L$ binary models where in the $l$-th model the goal is to predict whether a sample belongs to class $C_l$ or not. After fitting the $l$-th model, the confidence score of each sample belonging to class $C_l$ is then computed and stored, and in the end, the class that achieved the highest confidence score among all $L$ classes is then chosen as the final prediction.

In this task, we are going to implement this approach and apply it on a student evaluation dataset, aiming to predict which of three course instructors has given a class. The dataset is loaded in the cell below, and documented on http://archive.ics.uci.edu/ml/datasets/turkiye+student+evaluation. Feel free to explore the data beforehand. Note that we drop the class (lecture) identifier, since this uniquely determines the instructors.

In [75]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00262/turkiye-student-evaluation_generic.csv"

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00262/turkiye-student-evaluation_generic.csv",
                  index_col = False,
                  sep = ',',
                  skipinitialspace = True)

# drop class for training purposes - this would uniquely identify instructors
df = df.drop(columns = ['class','nb.repeat'])
df#.head()

Unnamed: 0,instr,attendance,difficulty,Q1,Q2,Q3,Q4,Q5,Q6,Q7,...,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28
0,1,0,4,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
1,1,1,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
2,1,2,4,5,5,5,5,5,5,5,...,5,5,5,5,5,5,5,5,5,5
3,1,1,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
4,1,0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5815,3,0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
5816,3,3,4,4,4,4,4,4,4,4,...,5,5,5,5,4,5,5,5,5,5
5817,3,0,4,5,5,5,5,5,5,5,...,5,5,5,5,5,5,5,5,5,5
5818,3,1,2,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


#### a) Implementing Multiclass-Classification with Logistic Regression

Implement a function mc_predict() that uses the one-vs-all approach to perform multiclass classififcation based on logistic regression. Thus, you have to fit multiple binary models. Use the function signature given in the cell below, i.e. include as input both training data as well as a test set to predict on after fitting the model. You may use sklearn to fit the binary models.  
Apply your function on the student evaluation data, and compute the accuracy on both training and test set!

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def mc_predict(X_train,y_train,X_test):
    """
    :param X_train: two dimensional numpy array describing the feature matrix to train on
    :param y_train: one dimensional numpy array or list representing the class vector used in training
    :param X_test: two dimensional numpy array describing the feature matrix to train on.
    :
    :return: one dimensional numpy array epresenting the predictions on test data
    """
    # your code here
    raise NotImplementedError

#### b) Comparison to Scikit-learn

Note that the logistic regreesion in sklearn also enables the multiclass case as it has a built-in one-vs-all functionality. Compare your predictions and accuracy scores to those resulting from the built-in functions.

### Task 3: The Ensemble Effect: Decision Trees and Random Forests

Other than logistic regressors, decision trees (and thus also random forests) naturally allow for multiclass classification without having to employ a one-vs-all strategy.

#### a) Classification with Decision Trees

Apply the decision tree classifier from sklearn on the student evaluation data and compare the accuracies on training and test data with those resulting from the logistic regression. What do you observe?

**Answer:** The decision tree model appears to work very well on the training data, but not to generalize that well. In particular, a single tree by itself gets outperformed by the logistic regression models that we used above.

#### b) Growing a Forest

For all $n\in\{3,\dots,100\}$, apply sklearns functionalities to fit a random forest with $n$ trees on the student evaluation data. For each model, compute the accuracy on the test set, and plot the number of trees against the resulting accuracies.