## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

In [100]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import regex as re
from time import sleep

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, VotingClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import roc_curve, confusion_matrix
from sklearn.utils.multiclass import unique_labels
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

%matplotlib inline

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [2]:
df = pd.read_csv('401ksubs.csv')

In [3]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Due to historical systemic racism that favors whites over p.o.c.s past data would make for a training data set that is heavily biased in favor of white people over people of color.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Gender and marrital status.  Both can societal bias to affect our classifier adversely.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

In [6]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

agesq is ascribed to the wrong data type, it should be a float not an int.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

Multiple Linear Regression is suitable for this task because we are using many different predictors to predict a range of possible values. K-nearest neighbors isn't appropriate because we are dealing with labeled data and aren't interested in sorting groups that we already know exist.  Decission trees and Random Forrests could work because they can capture non-linear interactions that a linear regression model might miss.  A support vector machine could also work very well, but it is a black box and we would not be able to determine which features are most important to imcome from it.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [65]:
X = df.drop(columns=['inc'])

In [66]:
y = df['inc']

In [67]:
X.shape

(9275, 10)

In [68]:
y.shape

(9275,)

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [86]:
lr = LinearRegression()

In [87]:
cross_val_score(lr, X_train, y_train, cv=3)

array([0.90115947, 0.90514157, 0.90143968])

In [104]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [79]:
lr.score(X_train, y_train)

0.9034469047706878

In [80]:
lr.score(X_test, y_test)

0.9118802283684437

In [89]:
rf = RandomForestRegressor()

In [93]:
cross_val_score(rf, X_train, y_train, cv=3)

array([0.99986588, 0.9999685 , 0.99980847])

In [90]:
rf.fit(X_train, y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [91]:
rf.score(X_train, y_train)

0.9999839169321548

In [92]:
rf.score(X_test, y_test)

0.9999852432163561

In [94]:
dt = DecisionTreeRegressor()

In [95]:
cross_val_score(dt, X_train, y_train)



array([0.99935995, 0.99997248, 0.99992951])

In [96]:
dt.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [97]:
dt.score(X_train, y_train)

1.0

In [98]:
dt.score(X_test, y_test)

0.999967099052897

##### 9. What is bootstrapping?

Bootstraping is a statistics method that makes inferences based on n random samples of size k with replacement.  It is generally used when resources are limited and we don't have as much sample data as we would like.  It relies on the fact that a small sample from any particular distribution is very likley to look like that distribution.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Decision trees take all of the training data at once while bagged decision trees by exposing different trees
to different subsetets of the data. Each decision tree on boostraped n samples of size k.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Random forrests are is a collection or ensemble many bagged decission decision trees.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Random forrests create random subsets of the features and build smallertrees using the subsets and then it combines the subtrees. By training many serperate trees, the Random forrests former avoids overfitting.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [29]:
from sklearn.metrics import mean_squared_error

In [105]:
y_pred_lr = lr.predict(X_test)

In [106]:
mean_squared_error(y_test, y_pred_lr)

52.53910169167566

In [107]:
lr.score(X_train, y_train)

0.9034469047706878

In [108]:
lr.score(X_train, y_train)

0.9034469047706878

In [109]:
y_pred_rf = rf.predict(X_test)

In [110]:
mean_squared_error(y_test, y_pred_rf)

0.00879834504959043

In [111]:
rf.score(X_train, y_train)

0.9999839169321548

In [112]:
rf.score(X_test, y_test)

0.9999852432163561

In [113]:
y_pred_dt = dt.predict(X_test)

In [114]:
mean_squared_error(y_test, y_pred_dt)

0.01961632643380773

In [115]:
dt.score(X_train, y_train)

1.0

In [116]:
dt.score(X_test, y_test)

0.999967099052897

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

There is evidence of heavy overfitting for Linear Regression and Decission Trees.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I'd pick Random Forrests because it has the lower RMSE by a wide margin, and it's explained variance for testing data is far closer than that for training data than any other model.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would do a gridsearch to tune all of the Random Forrest hyper parameters.  

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [117]:
X = df.drop(columns=['e401k'])
y = df['e401k']

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

Logistic regressison, k-nearest neighbord, decision trees, random forrests, and support vector machines.

Logistic Regression is approriate for this problem because we are using multiple predictors to predict two possible outcomes.  K-nearest neighbors is an unsupervised learning model, but here we have labeled data and we know how we want it to fit.  Decission trees and Random forrests would also work as classifiers.  Support vector machines are more suited to problems that have many different variables to predict. 

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [119]:
logreg = LogisticRegression()

In [121]:
cross_val_score(logreg, X_train, y_train, cv=3)



array([0.88663793, 0.88308887, 0.88006903])

In [151]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [152]:
logreg.score(X_train, y_train)

0.8834100057504313

In [153]:
logreg.score(X_test, y_test)

0.8861578266494179

In [125]:
rm = RandomForestClassifier()

In [126]:
cross_val_score(rm, X_train, y_train, cv=3)



array([0.87974138, 0.86971527, 0.8718723 ])

In [127]:
rm.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [128]:
rm.score(X_train, y_train)

0.9823174238067856

In [129]:
rm.score(X_test, y_test)

0.8732212160413971

In [130]:
dt = DecisionTreeClassifier()

In [132]:
cross_val_score(dt, X_train, y_train, cv=3)

array([0.80474138, 0.79637619, 0.81363244])

In [133]:
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [134]:
dt.score(X_train, y_train)

1.0

In [135]:
dt.score(X_test, y_test)

0.7981888745148771

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positive is predicting that someone is eligible for a 401(k) when they really are not.  Out false negatives is to classify a person as ineligible for 401(k) when they really are.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

From a buisness perspective I would rather minimize false positives so that we minimize the number of 401(k) accounts opened that the company does not have to deal with.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

We would optimize sensitivity.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The f1-score considers precission as well as recall, but ignores specificity.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [139]:
from sklearn.metrics import f1_score

In [142]:
y_pred_log = logreg.predict(X_test)

In [148]:
f1_score(y_test, y_pred_log)

0.8274509803921568

In [149]:
y_pred_rm = rm.predict(X_test)

In [150]:
f1_score(y_test, y_pred_rm)

0.8136882129277566

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

There is overfitting in Decission Tree and Random Forrests.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would choose logistic Regression because it has the highest f1 score, and it is not overfit.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would scale the feature inputs and perform feature engineering to find interaction terms between the caategorical variables.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

For Regression the best model for regression is Random Forrest.  For Classifiction the best model is Logistic Regression.