## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [3]:
df=pd.read_csv("401ksubs.csv")

In [4]:
df

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.170,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.230,0,1,35,1,154.000,1,0,3749.1130,1225
2,0,12.858,1,0,44,2,0.000,0,0,165.3282,1936
3,0,98.880,1,1,44,2,21.800,0,0,9777.2540,1936
4,0,22.614,0,0,53,1,18.450,0,0,511.3930,2809
...,...,...,...,...,...,...,...,...,...,...,...
9270,0,58.428,1,0,33,4,-1.200,0,0,3413.8310,1089
9271,0,24.546,0,1,37,3,2.000,0,0,602.5061,1369
9272,0,38.550,1,0,33,3,-13.600,0,1,1486.1020,1089
9273,0,34.410,1,0,57,3,3.550,0,0,1184.0480,3249


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

#1) Whether employer offered a match  
#2) Favours a tax break now, or later

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

This is akin to racial profiling.  
Ideally, race should have no bearing on a person's income and financial decisions, and hence should not be included.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [6]:
df1=df.copy()

In [8]:
df1=df1.drop(columns=['e401k','p401k','pira'])
df1

Unnamed: 0,inc,marr,male,age,fsize,nettfa,incsq,agesq
0,13.170,0,0,40,1,4.575,173.4489,1600
1,61.230,0,1,35,1,154.000,3749.1130,1225
2,12.858,1,0,44,2,0.000,165.3282,1936
3,98.880,1,1,44,2,21.800,9777.2540,1936
4,22.614,0,0,53,1,18.450,511.3930,2809
...,...,...,...,...,...,...,...,...
9270,58.428,1,0,33,4,-1.200,3413.8310,1089
9271,24.546,0,1,37,3,2.000,602.5061,1369
9272,38.550,1,0,33,3,-13.600,1486.1020,1089
9273,34.410,1,0,57,3,3.550,1184.0480,3249


Obviously `incsq` should not be used, because that a direct `sqrt()` of `incsq` gives the answer.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

`incsq` and `agesq` are engineed features.  
This is becuase income and age may have polynomial relations with the other variables - one grow at a faster rate that the others.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

`inc` should not say `inc^2`. It should just mention <i>income</i>.<br>
Same goes for `age`, should not say `age^2`.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

What do you mean by <i>modelling tactic</i>?

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [15]:
X=df1.drop(columns=['inc'])
y=df1['inc']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [19]:
def create_pipeline_reg(regressor):
    return Pipeline([
        ('scaler',StandardScaler()),
        ('regressor',regressor)
    ])

In [26]:
regressors_name=['LinReg','KNN','DecTree','BaggedDecTrees','RandForest','Adaboost','SVR']
regressors_instance=[LinearRegression(),KNeighborsRegressor(),DecisionTreeRegressor(),
                    BaggingRegressor(),RandomForestRegressor(),AdaBoostRegressor(),SVR()]
regressors_score=[]

In [27]:
for i in range(len(regressors_name)):
    pipe=create_pipeline_reg(regressors_instance[i])
    pipe.fit(X_train,y_train)
    regressors_score.append(pipe.score(X_test,y_test))

In [41]:
reg_scores=pd.DataFrame([regressors_score],columns=regressors_name).T
reg_scores.columns=['Accuracy score']
reg_scores.sort_values(by="Accuracy score",ascending=False,inplace=True)
reg_scores

Unnamed: 0,Accuracy score
RandForest,0.999996
BaggedDecTrees,0.999989
DecTree,0.999985
Adaboost,0.990326
KNN,0.963919
LinReg,0.905139
SVR,0.903933


It appears random forest does the best!

##### 9. What is bootstrapping?

Creation of a new population based on random sampling with replacement on the original population.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A decision tree is just 1 tree. Susceptible to high bias.  
A bagged decision trees uses bootstrapped populations on multiple trees, before aggregating the results.  
A bagged decision tree reduces variance, but has higher bias.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Bagged decision trees uses bootstrapped populations and aggregate the results.  
Random forest adds a layer of complexity by randomly selecting (or dropping) some features in every split of the tree branch.  
This prevents the trees from being overly reliant on one (or a few) features all the time.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

The avoidance of over-reliance on specific features allow a RF to generalise better.  
This means a lower variance but higher bias.  
As mentioned, a RF generalises better than a Bagged Decision Trees, hence it is more superior.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [52]:
regressors_RMSE_train=[]
regressors_RMSE_test=[]
for i in range(len(regressors_name)):
    pipe=create_pipeline(regressors_instance[i])
    pipe.fit(X_train,y_train)
    regressors_RMSE_train.append(mean_squared_error(y_train,pipe.predict(X_train),squared=False))
    regressors_RMSE_test.append(mean_squared_error(y_test,pipe.predict(X_test),squared=False))

In [54]:
reg_RMSE=pd.DataFrame([regressors_RMSE_train,regressors_RMSE_test],columns=regressors_name).T
reg_RMSE.columns=['RMSE train score','RMSE test score']
reg_RMSE.sort_values(by="RMSE test score",ascending=True,inplace=True)
reg_RMSE

Unnamed: 0,RMSE train score,RMSE test score
RandForest,0.06925209,0.051365
DecTree,5.872712e-16,0.088696
BaggedDecTrees,0.1149761,0.102959
Adaboost,2.265848,2.251459
KNN,3.395318,4.474329
LinReg,7.82277,7.25492
SVR,8.116213,7.300913


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Most regressors have the train and test RMSE on the same order, except single Decision Tree.  
The train RMSE is as low as $10^{-16}$ but the test RMSE is only $10^{-2}$.  
So, it's clear the single Decision Tree has highly overfitted.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I'd pick the Random Forest.  
Simple because its random selection of features at every split reduces over reliance on any single feature.  
This allow the models to generalise well.  
In addition, the accuracy score and RMSE agrees that RF is the best too.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

We could tune several paramters, `max_depth`, `min_samples_split`,`min_samples_leaf` etc.  
These affect the complexity of the tree, controlling its overfitting.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [56]:
df2=df.copy()

By common sense, if one has participated in 401k, i.e. `p401k` == 1, then he must have been eligible. `e401k` must be 1 too.  
Therefore, although a good predictor, it is meaningless in reality as we're tying to estimate eligibility even before know for sure.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

What is a <i>modelling tactic</i>?

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [61]:
X=df2.drop(columns=['e401k'])
y=df2['e401k']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=234)

In [79]:
y.value_counts()/y.shape[0]

0    0.607871
1    0.392129
Name: e401k, dtype: float64

40% are 1, 60% are 0. Still considered balanced.

In [82]:
def create_pipeline_class(classifier):
    return Pipeline([
        ('scaler',StandardScaler()),
        ('classifier',classifier)
    ])

In [83]:
classifiers_name=['LogReg','KNN','DecTree','BaggedDecTrees','RandForest','Adaboost','SVC']
classifiers_instance=[LogisticRegression(),KNeighborsClassifier(),DecisionTreeClassifier(),
                     BaggingClassifier(),RandomForestClassifier(),AdaBoostClassifier(),SVC()]
classifiers_TP=[]
classifiers_TN=[]
classifiers_FP=[]
classifiers_FN=[]

In [84]:
for i in range(len(classifiers_name)):
    pipe=create_pipeline_class(classifiers_instance[i])
    pipe.fit(X_train,y_train)
    tn, fp, fn, tp = confusion_matrix(y_test,pipe.predict(X_test)).ravel()
    classifiers_TP.append(tp)
    classifiers_TN.append(tn)
    classifiers_FP.append(fp)
    classifiers_FN.append(fn)

In [85]:
cls_acc_scores=pd.DataFrame([classifiers_TP,classifiers_TN,classifiers_FP,classifiers_FN],
                            columns=classifiers_name).T
cls_acc_scores.columns=['TP','TN','FP','FN']
cls_acc_scores['precision']=cls_acc_scores['TP']/(cls_acc_scores['TP']+cls_acc_scores['FP'])
cls_acc_scores['recall']=cls_acc_scores['TP']/(cls_acc_scores['TP']+cls_acc_scores['FN'])
cls_acc_scores['F1']=2*(cls_acc_scores['precision']*cls_acc_scores['recall'])/(cls_acc_scores['precision']+cls_acc_scores['recall'])
cls_acc_scores.sort_values(by="recall",ascending=False,inplace=True)
cls_acc_scores

Unnamed: 0,TP,TN,FP,FN,precision,recall,F1
DecTree,540,951,188,176,0.741758,0.75419,0.747922
KNN,509,1098,41,207,0.925455,0.710894,0.804107
BaggedDecTrees,503,1110,29,213,0.945489,0.702514,0.80609
RandForest,500,1120,19,216,0.963391,0.698324,0.809717
LogReg,497,1139,0,219,1.0,0.694134,0.819456
Adaboost,497,1138,1,219,0.997992,0.694134,0.818781
SVC,497,1139,0,219,1.0,0.694134,0.819456


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

FP = people who are predicted to be eligible for 401k, but turns out to be not eligible.  
FN = people who are predicted not to be eligible, but actually are.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

I'd rather minimise FN.  
It's okay for FP to be high, worst case is they get rejected when they apply.  
However, for FN, it's an opportunity cost if these people did not apply.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

We should optimise the recall (also known as sensitivity), as it is targeted as FN being a higher concern.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

F1 score is the harmonic mean of precision and recall, giving equal weightage to prioritsing FP and FN essentially.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [89]:
cls_f1_train=[]
cls_f1_test=[]
for i in range(len(classifiers_name)):
    pipe=create_pipeline_class(classifiers_instance[i])
    pipe.fit(X_train,y_train)
    for j in ['train','test']:
        tn, fp, fn, tp = confusion_matrix(globals()[f'y_{j}'],pipe.predict(globals()[f'X_{j}'])).ravel()
        precision=tp/(tp+fp)
        recall=tp/(tp+fn)
        f1=2*(precision*recall)/(precision+recall)
        globals()[f'cls_f1_{j}'].append(f1)

[0.8283192940232652,
 0.8522271076276047,
 1.0,
 0.9758318739054291,
 1.0,
 0.8298638911128903,
 0.8292585170340682]

In [93]:
cls_f1=pd.DataFrame([cls_f1_train,cls_f1_test],columns=classifiers_name).T
cls_f1.columns=['F1_train','F1_test']
cls_f1.sort_values(by="F1_test",ascending=False,inplace=True)
cls_f1

Unnamed: 0,F1_train,F1_test
LogReg,0.828319,0.819456
SVC,0.829259,0.819456
Adaboost,0.829864,0.818781
RandForest,1.0,0.80937
BaggedDecTrees,0.975832,0.807968
KNN,0.852227,0.804107
DecTree,1.0,0.751753


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Yes, Random Forest, Bagged Decision Trees, and single Decision Trees overfitted badly.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I'd pick KNN, based on the recall metric (after Single Decision Tree, which really one should never use).  
It's still important to reduce FN, hence I choose KNN.  
Comparing its F1 train and F1 test, there's a slight overtrain, but not too severe. I'd say it's acceptable to use KNN.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Tune the number of `n_neighbours` in the KNN algorithm.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.