# Model Comparison Lab

In this lab we will compare the performance of all the models we have learned about so far, using the car evaluation dataset.

## 1. Prepare the data

The [car evaluation dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/) is in the assets/datasets folder. By now you should be very familiar with this dataset.

1. Load the data into a pandas dataframe
- Encode the categorical features properly: define a map that preserves the scale (assigning smaller numbers to words indicating smaller quantities)
- Separate features from target into X and y

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

%matplotlib inline

In [29]:
def valueChange(x):
    if x == 'med':
        return 2
    elif x == 'low' or x == 'small':
        return 1
    elif x == 'high' or x == 'big':
        return 3
    else:
        return 4
def moreChange(x):
    if 'more' in x:
        return 5
    else:
        return x
        

In [30]:
df = pd.read_csv('/Users/michael/DSI-projects/week-06/3.4-lab-feature-importance/assets/datasets/car.csv')
df['buying']=df['buying'].apply(lambda x: valueChange(x))
df['maint']=df['maint'].apply(lambda x: valueChange(x))
df['lug_boot']=df['lug_boot'].apply(lambda x: valueChange(x))
df['safety']=df['safety'].apply(lambda x: valueChange(x))
df['doors']=df['doors'].apply(lambda x: moreChange(x))
df['persons']=df['persons'].apply(lambda x: moreChange(x))

df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,4,4,2,2,1,1,unacc
1,4,4,2,2,1,2,unacc
2,4,4,2,2,1,3,unacc
3,4,4,2,2,2,1,unacc
4,4,4,2,2,2,2,unacc


In [35]:
X = df[list(df.columns[0:6])]
y = df['acceptability']

In [47]:
y.value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: acceptability, dtype: int64

## 2. Useful preparation

Since we will compare several models, let's write a couple of helper functions.

1. Separate X and y between a train and test set, using 30% test set, random state = 42
    - make sure that the data is shuffled and stratified
2. Define a function called `evaluate_model`, that trains the model on the train set, tests it on the test, calculates:
    - accuracy score
    - confusion matrix
    - classification report
3. Initialize a global dictionary to store the various models for later retrieval


In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [37]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
1178,2,2,5,4,3,3
585,3,3,3,5,1,1
1552,1,2,3,4,2,2
1169,2,2,5,2,3,3
1033,2,3,4,2,3,2


In [52]:
def evaluate_model(): 
    model = LogisticRegression(solver = 'liblinear')
    model.fit(X_train,y_train)
    ypred = model.predict(X_test)
    model_cm = confusion_matrix(y_test, ypred, labels=model.classes_)
    model_cm = pd.DataFrame(model_cm, columns = model.classes_, index=model.classes_)
    print(model_cm)
    print(classification_report(y_test, ypred, labels=model.classes_))
    print('accuracy score: ' + str(accuracy_score(y_test, ypred)))

In [53]:
evaluate_model()

       acc  good  unacc  vgood
acc     62     4     52      0
good    17     1      1      0
unacc   12     2    341      3
vgood   21     0      0      3
             precision    recall  f1-score   support

        acc       0.55      0.53      0.54       118
       good       0.14      0.05      0.08        19
      unacc       0.87      0.95      0.91       358
      vgood       0.50      0.12      0.20        24

avg / total       0.75      0.78      0.76       519

accuracy score: 0.784200385356


## 3.a KNN

Let's start with `KNeighborsClassifier`.

1. Initialize a KNN model
- Evaluate it's performance with the function you previously defined
- Find the optimal value of K using grid search
    - Be careful on how you perform the cross validation in the grid search

## 3.b Bagging + KNN

Now that we have found the optimal K, let's wrap `KNeighborsClassifier` in a BaggingClassifier and see if the score improves.

1. Wrap the KNN model in a Bagging Classifier
- Evaluate performance
- Do a grid search only on the bagging classifier params

## 4. Logistic Regression

Let's see if logistic regression performs better

1. Initialize LR and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 5. Decision Trees

Let's see if Decision Trees perform better

1. Initialize DT and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 6. Support Vector Machines

Let's see if SVM perform better

1. Initialize SVM and test on Train/Test set
- Find optimal params with Grid Search
- See if Bagging improves the score

## 7. Random Forest & Extra Trees

Let's see if Random Forest and Extra Trees perform better

1. Initialize RF and ET and test on Train/Test set
- Find optimal params with Grid Search

## 8. Model comparison

Let's compare the scores of the various models.

1. Do a bar chart of the scores of the best models. Who's the winner on the train/test split?
- Re-test all the models using a 3 fold stratified shuffled cross validation
- Do a bar chart with errorbars of the cross validation average scores. is the winner the same?


## Bonus

We have encoded the data using a map that preserves the scale.
Would our results have changed if we had encoded the categorical data using `pd.get_dummies` or `OneHotEncoder`  to encode them as binary variables instead?

1. Repeat the analysis for this scenario. Is it better?
- Experiment with other models or other parameters, can you beat your classmates best score?