# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report


from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
#Load the data into a pandas dataframe
df = pd.read_csv('../../assets/datasets/car.csv')
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [3]:
# # Encode labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = df['acceptability']
le.fit(y)
y = le.transform(y)

# # Encode categorical features to booleans
# X = pd.get_dummies(df[['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']])

In [10]:
map_maint = {'vhigh': 4,'high': 3,'med': 2,'low':1}
map_doors = {'5more': 4,'4': 3,'3': 2,'2':1}
map_persons = {'more': 3,'4': 2,'2':1}
map_lug_boot = {'big': 3,'med': 2,'small':1}
map_safety = {'high': 3,'med': 2,'low':1}

df.maint = df.maint.map(map_maint)
df.buying = df.buying.map(map_maint)
df.doors = df.doors.map(map_doors)
df.persons = df.persons.map(map_persons)
df.lug_boot = df.lug_boot.map(map_lug_boot)
df.safety = df.safety.map(map_safety)
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,4,4,1,1,1,1,unacc
1,4,4,1,1,1,2,unacc
2,4,4,1,1,1,3,unacc
3,4,4,1,1,2,1,unacc
4,4,4,1,1,2,2,unacc


## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [11]:
#Separate X and y between a train and test set, using 30% test set, random state = 42
#make sure that the data is shuffled and stratified
#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=42)

In [12]:
X = df.drop('acceptability', axis=1)
y = df.acceptability

In [13]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [14]:
#Define a function called evaluate_model, that trains the model on the train set, tests it on the test, calculates:

def evaluate_model(model, name):
    s = cross_val_score(model, X, y, cv=3, n_jobs=-1)
    print "{} Cross Val Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3))

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print "Accuracy: ", model.score(X_test,y_test)
    
    cm = confusion_matrix(y_test, y_pred, labels =['unacc','acc','good', 'vgood'] ) 
    print pd.DataFrame(cm, index=['True unacc','True acc','True good', 'True vgood'], 
                       columns=['Pred unacc','Pred acc','Pred good', 'Pred vgood'] )
    print classification_report(y_test, y_pred)

In [15]:
#Initialize a global dictionary to store the various models for later retrieval

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [18]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

evaluate_model(knn, 'knn')

knn Cross Val Score:	0.739 ± 0.123
Accuracy:  0.946050096339
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         355         8          0           0
True acc            12       103          0           0
True good            0         2         19           0
True vgood           0         4          2          14
             precision    recall  f1-score   support

        acc       0.88      0.90      0.89       115
       good       0.90      0.90      0.90        21
      unacc       0.97      0.98      0.97       363
      vgood       1.00      0.70      0.82        20

avg / total       0.95      0.95      0.95       519



In [19]:
from sklearn.grid_search import GridSearchCV
parameters = {"n_neighbors": [1,2,3,4,5,6,7,8,9,10]}

gs = GridSearchCV(knn, parameters, cv=5, n_jobs=4)
gs.fit(X, y)
print gs.best_score_
gs.best_params_



0.773148148148


{'n_neighbors': 9}

In [20]:
knn = KNeighborsClassifier(n_neighbors=9)

evaluate_model(knn, 'knn with grid Search')

knn with grid Search Cross Val Score:	0.775 ± 0.103
Accuracy:  0.942196531792
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         357         6          0           0
True acc            13       102          0           0
True good            2         4         15           0
True vgood           1         3          1          15
             precision    recall  f1-score   support

        acc       0.89      0.89      0.89       115
       good       0.94      0.71      0.81        21
      unacc       0.96      0.98      0.97       363
      vgood       1.00      0.75      0.86        20

avg / total       0.94      0.94      0.94       519



## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [21]:
from sklearn.ensemble import BaggingClassifier

knn = KNeighborsClassifier()
Bknn = BaggingClassifier(knn)

evaluate_model(Bknn, 'knn with bagging')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


knn with bagging Cross Val Score:	0.742 ± 0.128
Accuracy:  0.940269749518
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         354         9          0           0
True acc            12       102          1           0
True good            2         2         16           1
True vgood           0         2          2          16
             precision    recall  f1-score   support

        acc       0.89      0.89      0.89       115
       good       0.84      0.76      0.80        21
      unacc       0.96      0.98      0.97       363
      vgood       0.94      0.80      0.86        20

avg / total       0.94      0.94      0.94       519



In [22]:
parameters = {"n_estimators": [1, 3, 5, 7, 9, 11],
              'max_features': [1, 2, 3, 4, 5],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]}
gs = GridSearchCV(Bknn, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.77025462963


{'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 4,
 'n_estimators': 3}

In [23]:
Bknn = BaggingClassifier(knn,bootstrap = True,
 bootstrap_features = False,
 max_features = 4,
 n_estimators = 3)

evaluate_model(Bknn, 'knn with bagging and bets params')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


knn with bagging and bets params Cross Val Score:	0.738 ± 0.015
Accuracy:  0.761078998073
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         353        10          0           0
True acc            68        40          4           3
True good            5        15          1           0
True vgood           7        12          0           1
             precision    recall  f1-score   support

        acc       0.52      0.35      0.42       115
       good       0.20      0.05      0.08        21
      unacc       0.82      0.97      0.89       363
      vgood       0.25      0.05      0.08        20

avg / total       0.70      0.76      0.72       519



## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [25]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

evaluate_model(lr, 'Log reg')

Log reg Cross Val Score:	0.707 ± 0.075
Accuracy:  0.782273603083
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         351         9          0           3
True acc            64        49          2           0
True good            4        10          3           4
True vgood           0        17          0           3
             precision    recall  f1-score   support

        acc       0.58      0.43      0.49       115
       good       0.60      0.14      0.23        21
      unacc       0.84      0.97      0.90       363
      vgood       0.30      0.15      0.20        20

avg / total       0.75      0.78      0.75       519



In [26]:
parameters = {
    'penalty': ['l1', 'l2'],
    'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}

gs = GridSearchCV(LR, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.754050925926


{'C': 1.0, 'penalty': 'l1'}

In [27]:
lr = LogisticRegression(C= 1.0, penalty='l1')
blr = BaggingClassifier(lr)
evaluate_model(blr, 'Bagging Logistic Regression')

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


Bagging Logistic Regression Cross Val Score:	0.707 ± 0.101
Accuracy:  0.803468208092
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         350        12          1           0
True acc            59        54          2           0
True good            5        10          6           0
True vgood           0        13          0           7
             precision    recall  f1-score   support

        acc       0.61      0.47      0.53       115
       good       0.67      0.29      0.40        21
      unacc       0.85      0.96      0.90       363
      vgood       1.00      0.35      0.52        20

avg / total       0.79      0.80      0.78       519



## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [29]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
evaluate_model(dt, "Decision Tree")

Decision Tree Cross Val Score:	0.818 ± 0.013
Accuracy:  0.982658959538
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         360         3          0           0
True acc             3       110          2           0
True good            0         0         21           0
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.96      0.96      0.96       115
       good       0.91      1.00      0.95        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.95      0.97        20

avg / total       0.98      0.98      0.98       519



In [30]:
parameters = {
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 1, 2, 3,4,5],
    'max_depth': [None, 4, 5, 6, 7, 8, 9],
    'max_leaf_nodes': [None, 4, 5, 6, 7, 8, 9]
}
gs = GridSearchCV(DT, parameters, cv=5, n_jobs=1)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.857060185185


{'criterion': 'entropy',
 'max_depth': 8,
 'max_features': 4,
 'max_leaf_nodes': None}

In [31]:
dt = DecisionTreeClassifier()
bdt = BaggingClassifier(dt)
evaluate_model(dt, 'Bagging Decision Tree')

Bagging Decision Tree Cross Val Score:	0.815 ± 0.01
Accuracy:  0.982658959538
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         360         3          0           0
True acc             3       110          2           0
True good            0         0         21           0
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.96      0.96      0.96       115
       good       0.91      1.00      0.95        21
      unacc       0.99      0.99      0.99       363
      vgood       1.00      0.95      0.97        20

avg / total       0.98      0.98      0.98       519



## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [32]:
from sklearn.svm import SVC
sv = SVC()
evaluate_model(sv, "Support vector machine")

Support vector machine Cross Val Score:	0.764 ± 0.113
Accuracy:  0.957610789981
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         352        11          0           0
True acc             7       108          0           0
True good            0         2         19           0
True vgood           0         1          1          18
             precision    recall  f1-score   support

        acc       0.89      0.94      0.91       115
       good       0.95      0.90      0.93        21
      unacc       0.98      0.97      0.98       363
      vgood       1.00      0.90      0.95        20

avg / total       0.96      0.96      0.96       519



In [33]:
bsv = BaggingClassifier(sv)
evaluate_model(bsv, "Bagging Support vector machine")

  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


Bagging Support vector machine Cross Val Score:	0.767 ± 0.11
Accuracy:  0.946050096339
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         352        11          0           0
True acc             9       105          0           1
True good            0         4         17           0
True vgood           0         1          2          17
             precision    recall  f1-score   support

        acc       0.87      0.91      0.89       115
       good       0.89      0.81      0.85        21
      unacc       0.98      0.97      0.97       363
      vgood       0.94      0.85      0.89        20

avg / total       0.95      0.95      0.95       519



In [34]:
parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

gs = GridSearchCV(sv, parameters, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.707175925926


{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

In [38]:
sv = SVC(gamma=0.001, C=10000, kernel='rbf', degree=5)

bsv = BaggingClassifier(sv)
bsv.fit(X_train,y_train)

bsv.score(X_test,y_test)

0.94219653179190754

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

In [39]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
evaluate_model(rfc, "Random Forest")

Random Forest Cross Val Score:	0.819 ± 0.029
Accuracy:  0.957610789981
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         352        11          0           0
True acc             4       111          0           0
True good            0         2         19           0
True vgood           0         3          2          15
             precision    recall  f1-score   support

        acc       0.87      0.97      0.92       115
       good       0.90      0.90      0.90        21
      unacc       0.99      0.97      0.98       363
      vgood       1.00      0.75      0.86        20

avg / total       0.96      0.96      0.96       519



In [41]:
param = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_features': [1, 2, 3, 4,5,6],
    'max_depth': [None, 3, 5, 7, 9, 11,13, 15,17]
}

gs = GridSearchCV(rfc, param,cv=5, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.864583333333


{'bootstrap': True, 'criterion': 'entropy', 'max_depth': 11, 'max_features': 6}

In [42]:
rfc = RandomForestClassifier(bootstrap = True,
 criterion= 'entropy',
 max_depth = 11,
 max_features = 6)

evaluate_model(rfc, "Random Forest")

Random Forest Cross Val Score:	0.833 ± 0.054
Accuracy:  0.97880539499
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         359         4          0           0
True acc             6       108          1           0
True good            0         0         21           0
True vgood           0         0          0          20
             precision    recall  f1-score   support

        acc       0.96      0.94      0.95       115
       good       0.95      1.00      0.98        21
      unacc       0.98      0.99      0.99       363
      vgood       1.00      1.00      1.00        20

avg / total       0.98      0.98      0.98       519



In [43]:
from sklearn.ensemble import ExtraTreesClassifier
etc = ExtraTreesClassifier()
evaluate_model(etc, "Extra Tree Classfifier")

Extra Tree Classfifier Cross Val Score:	0.855 ± 0.001
Accuracy:  0.957610789981
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         350        13          0           0
True acc             4       109          1           1
True good            0         2         19           0
True vgood           0         1          0          19
             precision    recall  f1-score   support

        acc       0.87      0.95      0.91       115
       good       0.95      0.90      0.93        21
      unacc       0.99      0.96      0.98       363
      vgood       0.95      0.95      0.95        20

avg / total       0.96      0.96      0.96       519



In [45]:
param = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_features': [1, 2, 3, 4,5,6],
    'max_depth': [None, 3, 5, 7, 9,11,13, 15,17]
}

gs = GridSearchCV(rfc, param,cv=5, n_jobs=2)
gs.fit(X, y)
print gs.best_score_
gs.best_params_

0.869212962963


{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 3}

In [47]:
etc = ExtraTreesClassifier(bootstrap = True,max_depth= None,
 criterion= 'entropy',
 max_features = 6)

evaluate_model(etc, "Extra Tree Classfifier")

Extra Tree Classfifier Cross Val Score:	0.835 ± 0.037
Accuracy:  0.974951830443
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         356         7          0           0
True acc             3       111          0           1
True good            0         1         20           0
True vgood           0         0          1          19
             precision    recall  f1-score   support

        acc       0.93      0.97      0.95       115
       good       0.95      0.95      0.95        21
      unacc       0.99      0.98      0.99       363
      vgood       0.95      0.95      0.95        20

avg / total       0.98      0.97      0.98       519



## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


In [49]:
print "The best model is RandomForestClassifier with Parameters from GridSearch\n"
evaluate_model(rfc, "Random Forest")

The best model is RandomForestClassifier with Parameters from GridSearch

Random Forest Cross Val Score:	0.829 ± 0.053
Accuracy:  0.986512524085
            Pred unacc  Pred acc  Pred good  Pred vgood
True unacc         360         3          0           0
True acc             0       115          0           0
True good            0         1         20           0
True vgood           0         1          2          17
             precision    recall  f1-score   support

        acc       0.96      1.00      0.98       115
       good       0.91      0.95      0.93        21
      unacc       1.00      0.99      1.00       363
      vgood       1.00      0.85      0.92        20

avg / total       0.99      0.99      0.99       519



## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?