# Classification algorithms

The goal of this notebook is to manipulate some of the classification algorithms of [scikit-learn](http://scikit-learn.org/stable/documentation.html). 

This notebook was created by [Chloé-Agathe Azencott](http://cazencott.info).

This notebook was created using
* python 3.4.3
* numpy 1.15.0
* matplotlib 2.2.2
* scikit-learn 0.19.2

You can check your version of Python by running
```python
import sys
print(sys.version)
```

and the version of any module by running
```python
import <module name>
print(<module name>.__version__)
```

### Loading the data science libraries

In [None]:
%pylab inline
import pandas as pd

## 1. Predicting mushroom edibility

We will work with a data set that describes mushrooms according to the shape of their cap and stalk, their odor, the type of their veil, etc. This data set also contains information on whether a mushroom is edible or not, and that is what we will try to predict.

Data are available as `data/mushrooms.csv`. Let us load them in a pandas DataFrame called `df`.

In [None]:
df = pd.read_csv('data/mushrooms.csv', 
                 #index_col=0 # the first column is the IDs of the samples
                )

Let us look at the first few lines of df

In [None]:
df.head()

## 2. Preparing the data

## 2.1 Converting the features into numerical values

As you can see, the features are encoded as _letters_. Each letter correspond to a category . For example, for the `cap shape` feature, `b` corresponds to a bell cap, `c` to a conical cap, `f` to a flat cap, `k` to a knobbed cap, `s` to a sunken cap, and `x` to a convex cap. For more details about their meaning, you can consult [the documentation of the data set](https://archive.ics.uci.edu/ml/datasets/Mushroom).

In order to work with this data, we need to convert the categorical attributes into numerical values. Here we will simply convert each letter to a number between 0 and the number of categories, using scikit-learn's [preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

This encoding is not necessarily the best, as (for example), an algorithm that uses the Euclidean distance will consider that a convex cap (`x` converted to 5) is closer to a sunken cap (`s` converted to 4) than to a conical cap (`c` converted to 1), and the [one-hot encoding](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features) is a good alternative. However, it has the drawback of increasing the number of features, and of creating correlated features.

In [None]:
from sklearn.preprocessing import LabelEncoder

labelencoder=LabelEncoder()
for col in df.columns:
    df[col] = labelencoder.fit_transform(df[col])

In [None]:
df.head()

## 2.2 Split the data into a train set and a test set

Let us first extact the design matrix `X` and the label vector `y` from the data frame.

The first column, `class`, is our target variable. 

In [None]:
X = np.array(df.iloc[:, 1:]) # exclude first column
y = np.array(df.iloc[:, 0])  # first column only

We will use 70% of the samples of the data for training our models, and the remaining samples to evaluate them.

Let us create the numpy arrays `Xtrain`, `Xtest` and the vectors `ytrain`, `ytest` that we will work with, using scikit-learn's [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

In [None]:
from sklearn import model_selection
Xtrain, Xtest, ytrain, ytest = model_selection.train_test_split(X, y, test_size=0.3)

__Question :__ How many training and test samples do we have? How many features?

In [None]:
# TODO

__Answer:__ 

## 2.3 Model selection by cross-validation
Remember, using the test data to select our best model is likely to result in _overfitting_. 

We will therefore start by splitting the training data into 5 folds of cross-validation. Scikit-learn also does that for us.

scikit-learn model selection: http://scikit-learn.org/stable/model_selection.html#model-selection

In [None]:
# Créate a KFold object
kf = model_selection.KFold(n_splits=5,  # 5 folds
                           shuffle=True # shuffle the samples before splitting 
                          )

# Use kf to split Xtrain into 5 folds. 
# kf.split returns an iterator ((that would be consumed after one iteration). 
# We transform it in a list (on which we can iterate as often as we want).
kf_indices = list(kf.split(Xtrain))

In [None]:
for i, (tr, te) in enumerate(kf_indices):
    print("Training data for fold %d:" % i, tr)
    print("Test data for fold %d:" % i, te)

# 3. Linear models

We will use scikit-learn to train a few [linear models](http://scikit-learn.org/stable/modules/linear_model.html#linear-model) on `(Xtrain, ytrain)`.

In [None]:
from sklearn import linear_model

## 3.1 Logistic regression
How does a logistic regression perform on this data? We'll evaluate it by cross-validation, using the cross-validation folds we have generated above.

But wait! What will our criterion be? There are several you can choose from on http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter. In this lab, we will use the [F1 score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).

__Question :__ What other criterion or criteria could you use?

__Annswer:__ 

Let us create a logistic regression model. We will use [scikit-learn's implementation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression), which includes a regularization parameter.

__Question:__ What is the relationship between the regularization parameter `C` and the regularization parameter `alpha` of the [ridge regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) ?

__Answer:__

In [None]:
model = linear_model.LogisticRegression(C=1e10) # use a large penalty 
                                                # so as not to regularize

scikit-learn allows us to directly compute the cross-validated F1 score (on the train test) of this model:

In [None]:
f1 = model_selection.cross_val_score(model, 
                                     Xtrain, y=ytrain, 
                                     scoring='f1', 
                                     cv=kf_indices)

print(["%.3f" % value for value in f1])

__Question:__ What is the average f1 score of the linear regression over all folds?

In [None]:
# TODO

What are the weights that a logistic regression assigns to each variable on the train set? To determine this, we will train a logistic regression on the entire training data.

In [None]:
# Train a logistic regression on the entire train set
model.fit(Xtrain, ytrain)

# Create a figure
fig = plt.figure(figsize(12, 6))

# Plot, for each feature, its coefficient in the model
num_features = Xtrain.shape[1]
plt.scatter(range(num_features), model.coef_)

plt.xlabel('Feature', fontsize=14)
feature_names = list(df.columns[1:])
tmp = plt.xticks(range(num_features), feature_names, 
                 rotation=90, fontsize=14)
tmp = plt.ylabel('Weight', fontsize=14)

tmp = plt.title('Logistic regression weights', fontsize=16)

__Question:__ Which feature do you think is the most important to predict edibility? Which one is the least important?

In [None]:
# TODO

__Answer:__

## 3.2 Ridge logistic regression

Can we improve performance using regularization? By default, the logistic regression implemented in scikit-learn uses a l2 (i.e. ridge) penalty. Let us now try to find the optimal value of the regularization parameter `C`.

Let us create 50 values of `C` for testing, equally spaced (in log scale) between $10^{-1}$ and $10^4$:

In [None]:
cvalues = np.logspace(-2, 3, 50)

f1_per_c = [] # will store the f1 values for all 50 values of C
weights_per_c = [] #  will store the weights associated with each feature 
                       # for all 50 values of C
for cval in cvalues:
    # Create a logistic regression ridge model
    model = linear_model.LogisticRegression(C=cval)
    
    # Compute the model's cross-validated performance
    f1 = model_selection.cross_val_score(model, 
                                         Xtrain, y=ytrain, 
                                         scoring='f1', 
                                         cv=kf_indices)
    f1_per_c.append(f1)
    
    # Train the model on the entire data set and store the regression coefficients
    model.fit(Xtrain, ytrain)
    weights_per_c.append(model.coef_)

### Evolution of the weights assigned to each feature with C

In [None]:
# Reshape weights_per_c into a (number of values of C, number of features) array
weights_per_c = np.array(weights_per_c)
weights_per_c.shape = (50, X.shape[1])

In [None]:
# Plot, for each feature, the regression weight as a function of alpha
fig = plt.figure(figsize=(8, 8))

lines = plt.plot(cvalues, weights_per_c)
plt.xscale('log')
tmp = plt.legend(lines, feature_names, frameon=False, 
                 loc=(1, 0), fontsize=14)

tmp = plt.xlabel('C', fontsize=14)
tmp = plt.ylabel('Weight', fontsize=14)

tmp = plt.title('Logistic ridge regression weights', fontsize=16)

__Question:__ Is this consistent with what we observed for the non-regularized logistic regression? 

__Answer:__

#### Evolution of the F1 score with C

__Question:__ Plot the f1 score (averaged across the 5 folds) as a function of C.

In [None]:
# TODO

__Question:__ What is the optimal value of C and what is the corresponding average cross-validated f1? Use `np.argmax`.

In [None]:
# TODO

__Answer:__

### Grid search of the optimal parameter

Scikit-learn's [model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) allows us to automatically select the hyperparameters of a learning algorithm that are optimal, in cross-validation, on our training set.

In [None]:
from sklearn import model_selection

In [None]:
# Define a grid of parameter values:
param_grid = {'C': cvalues}

# Define a model that will do a grid search:
model = model_selection.GridSearchCV(linear_model.LogisticRegression(),
                                     param_grid, 
                                     scoring='f1', 
                                     cv=kf_indices)

# Train the model
model.fit(Xtrain, ytrain)

# Best parameter
print(model.best_params_)

__Question:__ Are those results coherent with prior observations?

__Answer:__

The model with optimal parameters has already been trained on the entire training set and is accessible via `model.best_estimator_`

__Question:__ How does the regularized logistic ridge regression compare to the vanilla logistic regression? Are you surprised?

__Answer:__

# 4. Non-linear models
Can a non-linear model improve the performance on this data? Our performance are already very good!

## 4.1 Decision trees

Let us start with a simple non-linear models: a decision tree. They are implemented in scikit-learn's [tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
from sklearn import tree

We create a DT model:

In [None]:
model = tree.DecisionTreeClassifier()

and compute the cross-validated F1 score (on the train test) of this model:

In [None]:
f1 = model_selection.cross_val_score(model, 
                                     Xtrain, y=ytrain, 
                                     scoring='f1', 
                                     cv=kf_indices)

print(["%.3f" % value for value in f1])

__Question:__ What is the average f1 score of the decision tree over all folds? How does it compare to the linear model?

In [None]:
# TODO

### Understanding the Decision Tree

Wow! Our decision tree seems to be *a perfect model*! We can look at its structure in more details, using the code provided in [scikit-learn's documentation](http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html)

In [None]:
# Train our model on the entire train set:
model.fit(Xtrain, ytrain)

n_nodes = model.tree_.node_count
children_left = model.tree_.children_left
children_right = model.tree_.children_right
feature = model.tree_.feature
threshold = model.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

We can also look at the **importance** of each feature, which is the total reduction in Gini criterion that is brought by this feature:

In [None]:
# Create a figure
fig = plt.figure(figsize(12, 6))

# Plot, for each feature, its importance in the model
plt.scatter(range(num_features), model.feature_importances_)

plt.xlabel('Feature', fontsize=14)
tmp = plt.xticks(range(num_features), feature_names, 
                 rotation=90, fontsize=14)
tmp = plt.ylabel('Weight', fontsize=14)

tmp = plt.title('DT importance', fontsize=16)

__Question:__ Which features are the more important? Which ones are not used by the model? How does this compare to the linear models we trained before?

In [None]:
sorted_indices = np.argsort(model.feature_importances_)
print([feature_names[ix] for ix in sorted_indices[-5:]])
print(model.feature_importances_[sorted_indices[-5:]])

__Answer:__

## Making the problem harder
Because this data set is an "easy" data set -- we achieved very good performance with a linear model, and perfection with a decision tree --, it is difficult to use it to illustrate the differences between several non-linear models.

In what follows, we are making the problem harder by introducing noise in the labeled data: We will randomly flip some of the labels.

In [None]:
# make a copy of ytrain
ytrain_hard = np.array(ytrain)

# randomly flip some of the labels:
# define which samples to flip
where_to_flip = np.random.binomial(1, 0.05, size=ytrain.shape)
print("Randomly flipping %d labels" % np.sum(where_to_flip))

# flip ytrain where where_to_flip is not zero
ytrain_hard = np.where(where_to_flip == 0, ytrain, 1-ytrain)

### Decision trees on the more difficult data set

Let us now work with this harder training data.

In [None]:
model = tree.DecisionTreeClassifier()

and compute the cross-validated F1 score (on the train test) of a decision tree:

In [None]:
f1 = model_selection.cross_val_score(model, 
                                     Xtrain, y=ytrain_hard, 
                                     scoring='f1', 
                                     cv=kf_indices)

print(["%.3f" % value for value in f1])

__Question:__ What is the average f1 score of this decision tree over the 5 folds?

In [None]:
# TODO
print("%.3f" % np.mean(f1))

Let us repeat the experience once more:

In [None]:
model2 = tree.DecisionTreeClassifier()
f1_2 = model_selection.cross_val_score(model2, 
                                     Xtrain, y=ytrain_hard, 
                                     scoring='f1', 
                                     cv=kf_indices)

print(["%.3f" % value for value in f1_2])

__Question:__ Do you observe the same f1 scores on each fold as previously? Is this surprising?

__Answer:__

### Linear model on the more difficult data set

__Question:__ Is a linear model better or worse than a decision tree on this more difficult data?

In [None]:
# TODO

__Question:__ Compare, on the same plot, the feature importance given by each of the decision trees and the feature weights in a linear model. (You may need to re-scale the feature weights.) Do the three models give the same importance to the same features?

In [None]:
# TODO

## 4.2 Random forest

Can an ensemble method improve the performance of the decision tree on the difficult data set? We will use the random forest implementation in scikit-learn's [ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn import ensemble

An important hyperparameter of a random forest is the number of trees it contain. We will therefore vary this number.

In [None]:
ntrees_values = [10, 20, 50, 100, 300, 500, 2000]

f1_per_ntrees = [] # will store the f1 values for all values of ntrees
importance_per_ntrees = [] #  will store the importance of each feature 
                          # for all values of ntrees
    
for ntrees in ntrees_values:
    # Create a random forest model
    model = ensemble.RandomForestClassifier(n_estimators=ntrees)
    
    # Compute the model's cross-validated performance
    f1 = model_selection.cross_val_score(model, 
                                         Xtrain, y=ytrain_hard, 
                                         scoring='f1', 
                                         cv=kf_indices)
    f1_per_ntrees.append(f1)
    
    # Train the model on the entire data set and store the regression coefficients
    model.fit(Xtrain, ytrain_hard)
    importance_per_ntrees.append(model.feature_importances_)

### F1 Score

__Question:__ Which number of trees yields the optimal cross-validated f1 score? What is this score? How does it compare to the scores of the previous models?

In [None]:
# TODO

### Feature Importance

Let us now look at the feature importance.

In [None]:
# Reshape importance_per_ntree into a (number of values of ntree, number of features) array
importance_per_ntrees = np.array(importance_per_ntrees)
importance_per_ntrees.shape = (len(ntrees_values), X.shape[1])

In [None]:
# Plot, for each feature, the regression weight as a function of alpha
fig = plt.figure(figsize=(8, 8))

lines = plt.plot(ntrees_values, importance_per_ntrees)
plt.xscale('log')
tmp = plt.legend(lines, feature_names, frameon=False, 
                 loc=(1, 0), fontsize=14)

tmp = plt.xlabel('Number of trees', fontsize=14)
tmp = plt.ylabel('Importance', fontsize=14)

tmp = plt.title('Random forest importance', fontsize=16)

__Question:__ Which features do you now think are the most important to predict edibility?

## 4.3 Support Vector Machines

SVM classifiers are implemented in scikit-learn's [svm.SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

Let us check the cross-validated performance of an SVM with RBF kernel on the training data.

In [None]:
from sklearn import svm

In [None]:
print(np.logspace(-1, 2, 5))

In [None]:
print(np.linspace(0.01, 0.5, 5))

In [None]:
# Define a grid of parameter values:
param_grid = {'C': np.logspace(-1, 2, 5), 
              'gamma': np.linspace(0.01, 0.5, 5)}

# Define a model that will do a grid search:
model = model_selection.GridSearchCV(svm.SVC(kernel='rbf'),
                                     param_grid, 
                                     scoring='f1', 
                                     cv=kf_indices)

# Train the model
model.fit(Xtrain, ytrain)

# Best parameters
print("Best parameters:", model.best_params_)

# Best F1 score
print("Best F1 score: %.3f" % model.best_score_)

__Question:__ How does the SVM compare to previous models?

__Answer:__

# 5. Final model

__Question:__ Based on the above analyses, which model do you finally choose as the most performant (on the difficult data)? What is its confusion matrix __on the test set?__ 

In [None]:
# TODO

__Answer:__