# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
%matplotlib inline

In [None]:
df = pd.read_csv('./../../assets/datasets/car.csv')
df.head()

In [None]:
df['safety'].value_counts()

In [None]:
map_maint = {'vhigh': 4,
       'high': 3,
       'med': 2,
       'low':1}
map_doors = {'5more': 4,
       '4': 3,
       '3': 2,
       '2':1}
map_persons = {'more': 3,
       '4': 2,
       '2':1}
map_lug_boot = {'big': 3,
       'med': 2,
       'small':1}
map_safety = {'high': 3,
       'med': 2,
       'low':1}


# def thing(i):return map_maint[i]
X_ = pd.DataFrame()
X_['maint'] = map(lambda(x):map_maint[x], df['maint'])
X_['buying'] = map(lambda(x):map_maint[x], df['buying'])
X_['doors'] = map(lambda(x):map_doors[x], df['doors'])
X_['persons'] = map(lambda(x):map_persons[x], df['persons'])
X_['lug_boot'] = map(lambda(x):map_lug_boot[x], df['lug_boot'])
X_['safety'] = map(lambda(x):map_safety[x], df['safety'])
X = X_
X.head()

In [None]:
# X = pd.get_dummies(df.drop('acceptability', axis=1))
le = LabelEncoder()
y = le.fit_transform(df['acceptability'])

X.head()

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

In [None]:
def evaluate_model(model):
    # create model object
    mod = model
    # fit model
    mod.fit(X_train, y_train)
    y_pred = mod.predict(X_test)
    print 'Accuracy score:', mod.score(X_test, y_test)
    con = pd.DataFrame(confusion_matrix(y_pred, y_test))
    con.columns = ['acc', 'good', 'unacc', 'vgood']
    con.index = ['pred_acc', 'pred_good', 'pred_unacc', 'pred_vgood']
    print con
    print classification_report(y_pred, y_test)
    return mod

models = {}

## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

In [None]:
models['KNN'] = evaluate_model(KNeighborsClassifier())

In [None]:
pg = {'n_neighbors': [i for i in range(1,20)]}
gs = GridSearchCV(models['KNN'], param_grid=pg, cv=5)
gs.fit(X_train, y_train)
gs.best_params_

## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

In [None]:
models['bagging_knn'] = evaluate_model(BaggingClassifier(models['KNN'].set_params(n_neighbors=5)))

In [None]:
pg = {'n_estimators': [4, 8, 10, 12, 15],
     'bootstrap': [True, False],
     'bootstrap_features': [True, False]}
def grid_search(model, pg):
    gs = GridSearchCV(model, param_grid=pg, cv=3)
    gs.fit(X_train, y_train)
    print gs.best_params_
grid_search(models['bagging_knn'], pg)

In [None]:
evaluate_model(models['bagging_knn'].set_params(bootstrap=True, bootstrap_features=False, n_estimators=15))

## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [None]:
models['logreg'] = evaluate_model(LogisticRegression())

In [None]:
pg = {
    'penalty': ['l1', 'l2'],
    'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
}
grid_search(models['logreg'], pg)

In [None]:
models['bagging_logreg'] = evaluate_model(BaggingClassifier(models['logreg'].set_params(penalty='l1', C=100.0)))

In [None]:
pg = {'n_estimators': [4, 8, 10, 12, 15],
     'bootstrap': [True, False],
     'bootstrap_features': [True, False]}
grid_search(models['bagging_logreg'], pg)

In [None]:
evaluate_model(models['bagging_logreg'].set_params(n_estimators=10, bootstrap=True, bootstrap_features=False))

## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [None]:
models['dec_tree'] = evaluate_model(DecisionTreeClassifier())

In [None]:
pg = {
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 1, 2, 3],
    'max_depth': [None, 4, 5, 6, 7, 8, 9],
    'max_leaf_nodes': [None, 4, 5, 6, 7, 8, 9]
}
grid_search(models['dec_tree'], pg)

In [None]:
models['dec_tree'] = evaluate_model(models['dec_tree'].set_params(criterion='entropy'))

In [None]:
models['bagging_dec_tree'] = evaluate_model(BaggingClassifier(models['dec_tree']))

In [None]:
pg = {'n_estimators': [4, 8, 10, 12, 15],
     'bootstrap': [True, False],
     'bootstrap_features': [True, False]}
grid_search(models['bagging_dec_tree'], pg)

In [None]:
models['bagging_dec_tree'] = evaluate_model(BaggingClassifier(models['dec_tree'], n_estimators=4, 
                                                              bootstrap=False, bootstrap_features=False))

## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

In [None]:
models['svc'] = evaluate_model(SVC())

In [None]:
pg = {
    'C': [1.0, 10.0, 100.0, 1000.0, 10000.0],
    'kernel': ['rbf', 'linear', 'poly'],
    'degree': [1, 2, 3, 4, 5]
}
grid_search(models['svc'], pg)

In [None]:
models['svc'] = evaluate_model(SVC(kernel='poly', C=1000.0, degree=3))

In [None]:
models['bagging_svc'] = evaluate_model(BaggingClassifier(models['svc']))

In [None]:
pg = {'n_estimators': [4, 8, 10, 12, 15],
     'bootstrap': [True, False],
     'bootstrap_features': [True, False]}
grid_search(models['bagging_svc'], pg)

In [None]:
models['bagging_svc'] = evaluate_model(BaggingClassifier(models['svc'], n_estimators=4, 
                                                              bootstrap=False, bootstrap_features=False))

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

In [None]:
models['rand_for'] = evaluate_model(RandomForestClassifier())

In [None]:
pg = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 9, 15]
}
grid_search(models['rand_for'], pg)

In [None]:
models['rand_for'] = evaluate_model(RandomForestClassifier(bootstrap=False, criterion='entropy'))

In [None]:
models['et'] = evaluate_model(ExtraTreesClassifier())

In [None]:
pg = {
    'bootstrap': [True, False],
    'criterion':['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 9, 15]
}
grid_search(models['et'], pg)

In [None]:
models['et'] = evaluate_model(ExtraTreesClassifier(bootstrap=False, criterion='entropy', max_depth=15))

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?