In [107]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import (
    AdaBoostClassifier, 
    BaggingClassifier, 
    RandomForestClassifier,
    BaggingRegressor, 
    RandomForestRegressor, 
    AdaBoostRegressor)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [3]:
df = pd.read_csv('401ksubs.csv')

In [4]:
df.shape

(9275, 11)

In [5]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [9]:
df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Income and Income^2.
This is what we are trying to predict. Income^2 is a feature engineered based on income.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

Target: Income

1. Linear Regression            --> Appropriate since our target is a continuous variable
2. Logistic Regression          --> Not appropriate since our target is not binary. ===> DO NOT INCLUDE!!!!!!!!!!
3. DecisionTree, RF, Bagging    --> Appropriate since we are able to see the effect of our features (columns).
4. KNN                          --> Not appropriate since it not parametric.
5. ADA                          --> Appropriate since we can identify the effect of features similar to DTree n Co

Ans:
1. State Reg models
2. Key differences betw them
3. Understand/interpret key features (coefs)

Not parametric: Cannot form equation mx+c 

Ensemble models can be used to find feature importances.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [79]:
X = df.drop(columns=['inc', 'incsq', 'e401k', 'p401k', 'pira'])
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=.2,
                                                    random_state=42)

In [91]:
# 1. Linreg

lr = LinearRegression()
ss = StandardScaler()

pipe_lr = Pipeline([
    ('ss', ss),
    ('lr', lr)
])

pipe_lr_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False]
}

gs_lr = GridSearchCV(
    estimator=pipe_lr,
    param_grid=pipe_lr_params,
    cv=5
)

gs_lr.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('lr', LinearRegression())]),
             param_grid={'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [97]:
# 2. KNN Reg

ss = StandardScaler()
knn = KNeighborsRegressor()

pipe_knn = Pipeline([
    ('ss', ss),
    ('knn', knn)
])

pipe_knn_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'knn__n_neighbors': [5, 7, 9, 11],
#     'knn__n_jobs': [-1],
    'knn__metric': ['euclidean', 'minkowski', 'manhattan']
}

gs_knn = GridSearchCV(
    estimator=pipe_knn,
    param_grid=pipe_knn_params,
    cv=5
)

gs_knn.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('knn', KNeighborsRegressor())]),
             param_grid={'knn__metric': ['euclidean', 'minkowski', 'manhattan'],
                         'knn__n_neighbors': [5, 7, 9, 11],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [90]:
pipe_knn.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('knn', KNeighborsClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'knn': KNeighborsClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [104]:
# 3. DTree Reg
ss = StandardScaler()
tree = DecisionTreeRegressor()

pipe_tree = Pipeline([
    ('ss', ss),
    ('tree', tree)
])

pipe_tree_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'tree__max_depth': [None, 1, 3, 5],
    'tree__max_features': [None, 1, 3, 5],
    'tree__random_state': [42],
}

gs_tree = GridSearchCV(
    estimator=pipe_tree,
    param_grid=pipe_tree_params,
    cv=5
)

gs_tree.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('tree', DecisionTreeRegressor())]),
             param_grid={'ss__with_mean': [True, False],
                         'ss__with_std': [True, False],
                         'tree__max_depth': [None, 1, 3, 5],
                         'tree__max_features': [None, 1, 3, 5],
                         'tree__random_state': [42]})

In [100]:
pipe_tree.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('tree', DecisionTreeClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'tree': DecisionTreeClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'tree__ccp_alpha': 0.0,
 'tree__class_weight': None,
 'tree__criterion': 'gini',
 'tree__max_depth': None,
 'tree__max_features': None,
 'tree__max_leaf_nodes': None,
 'tree__min_impurity_decrease': 0.0,
 'tree__min_impurity_split': None,
 'tree__min_samples_leaf': 1,
 'tree__min_samples_split': 2,
 'tree__min_weight_fraction_leaf': 0.0,
 'tree__random_state': None,
 'tree__splitter': 'best'}

In [129]:
# 4. Bagging Reg
ss = StandardScaler()
bag = BaggingRegressor()

pipe_bag = Pipeline([
    ('ss', ss),
    ('bag', bag)
])

pipe_bag_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'bag__n_estimators': [3, 5, 10],
    'bag__max_features': [1, 3, 5],
    'bag__max_samples': [0.5, 0.75, 1.0],
    'bag__random_state': [42],
    'bag__n_jobs': [-1]
}

gs_bag = GridSearchCV(
    estimator=pipe_bag,
    param_grid=pipe_bag_params,
    cv=5
)

gs_bag.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('bag', BaggingRegressor())]),
             param_grid={'bag__max_features': [1, 3, 5],
                         'bag__max_samples': [0.5, 0.75, 1.0],
                         'bag__n_estimators': [3, 5, 10], 'bag__n_jobs': [-1],
                         'bag__random_state': [42],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [110]:
pipe_bag.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('bag', BaggingRegressor())],
 'verbose': False,
 'ss': StandardScaler(),
 'bag': BaggingRegressor(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'bag__base_estimator': None,
 'bag__bootstrap': True,
 'bag__bootstrap_features': False,
 'bag__max_features': 1.0,
 'bag__max_samples': 1.0,
 'bag__n_estimators': 10,
 'bag__n_jobs': None,
 'bag__oob_score': False,
 'bag__random_state': None,
 'bag__verbose': 0,
 'bag__warm_start': False}

In [159]:
# 5. Random Forest Reg
ss = StandardScaler()
rf = RandomForestRegressor()

pipe_rf = Pipeline([
    ('ss', ss),
    ('rf', rf)
])

pipe_rf_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'rf__max_depth': [None, 2, 4, 6],
    'rf__n_jobs': [2, 4],
    'rf__n_estimators': [50, 75, 100],
    'rf__max_features': [None, 4, 6],
}

gs_pipe = GridSearchCV(
    estimator=pipe_rf,
    param_grid=pipe_rf_params,
    cv=5
)

gs_pipe.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('rf', RandomForestRegressor())]),
             param_grid={'rf__max_depth': [None, 2, 4, 6],
                         'rf__max_features': [None, 4, 6],
                         'rf__n_estimators': [50, 75, 100],
                         'rf__n_jobs': [2, 4], 'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [134]:
pipe_rf.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('rf', RandomForestRegressor())],
 'verbose': False,
 'ss': StandardScaler(),
 'rf': RandomForestRegressor(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'rf__bootstrap': True,
 'rf__ccp_alpha': 0.0,
 'rf__criterion': 'mse',
 'rf__max_depth': None,
 'rf__max_features': 'auto',
 'rf__max_leaf_nodes': None,
 'rf__max_samples': None,
 'rf__min_impurity_decrease': 0.0,
 'rf__min_impurity_split': None,
 'rf__min_samples_leaf': 1,
 'rf__min_samples_split': 2,
 'rf__min_weight_fraction_leaf': 0.0,
 'rf__n_estimators': 100,
 'rf__n_jobs': None,
 'rf__oob_score': False,
 'rf__random_state': None,
 'rf__verbose': 0,
 'rf__warm_start': False}

In [163]:
# 6. ADA Boost Reg
ss = StandardScaler()
ada = AdaBoostRegressor()

pipe_ada = Pipeline([
    ('ss', ss),
    ('ada', ada)
])

pipe_ada_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'ada__n_estimators': [50, 100]
}

gs_ada = GridSearchCV(
    estimator=pipe_ada,
    param_grid=pipe_ada_params,
    cv=5
)

gs_ada.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostRegressor())]),
             param_grid={'ada__n_estimators': [50, 100],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [132]:
pipe_ada.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('ada', AdaBoostRegressor())],
 'verbose': False,
 'ss': StandardScaler(),
 'ada': AdaBoostRegressor(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'ada__base_estimator': None,
 'ada__learning_rate': 1.0,
 'ada__loss': 'linear',
 'ada__n_estimators': 50,
 'ada__random_state': None}

##### 9. What is bootstrapping?

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [92]:
print(f'LinReg R^2 Train:   {round(gs_lr.score(X_train, y_train),3)}')
print(f'LinReg R^2 Test:    {round(gs_lr.score(X_test, y_test),3)}')
print(f'LinReg RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_lr.predict(X_train))**0.5),3)}')
print(f'LinReg RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_lr.predict(X_test))**0.5),3)}')
print(f'LinReg Best Params: {gs_lr.best_params_}')

LinReg R^2 Train:   0.293
LinReg R^2 Test:    0.275
LinReg RMSE Train:  20.164
LinReg RMSE Test:   20.897
LinReg Best Params: {'ss__with_mean': False, 'ss__with_std': False}


In [98]:
print(f'KNN R^2 Train:   {round(gs_knn.score(X_train, y_train),3)}')
print(f'KNN R^2 Test:    {round(gs_knn.score(X_test, y_test),3)}')
print(f'KNN RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_knn.predict(X_train))**0.5),3)}')
print(f'KNN RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_knn.predict(X_test))**0.5),3)}')
print(f'KNN Best Params: {gs_knn.best_params_}')

KNN R^2 Train:   0.447
KNN R^2 Test:    0.367
KNN RMSE Train:  17.833
KNN RMSE Test:   19.531
KNN Best Params: {'knn__metric': 'euclidean', 'knn__n_neighbors': 11, 'ss__with_mean': True, 'ss__with_std': True}


In [105]:
print(f'Tree R^2 Train:   {round(gs_tree.score(X_train, y_train),3)}')
print(f'Tree R^2 Test:    {round(gs_tree.score(X_test, y_test),3)}')
print(f'Tree RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_tree.predict(X_train))**0.5),3)}')
print(f'Tree RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_tree.predict(X_test))**0.5),3)}')
print(f'Tree Best Params: {gs_tree.best_params_}')

Tree R^2 Train:   0.419
Tree R^2 Test:    0.382
Tree RMSE Train:  18.274
Tree RMSE Test:   19.298
Tree Best Params: {'ss__with_mean': True, 'ss__with_std': True, 'tree__max_depth': 5, 'tree__max_features': None, 'tree__random_state': 42}


In [130]:
print(f'Bag R^2 Train:   {round(gs_bag.score(X_train, y_train),3)}')
print(f'Bag R^2 Test:    {round(gs_bag.score(X_test, y_test),3)}')
print(f'Bag RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_bag.predict(X_train))**0.5),3)}')
print(f'Bag RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_bag.predict(X_test))**0.5),3)}')
print(f'Bag Best Params: {gs_bag.best_params_}')

Bag R^2 Train:   0.61
Bag R^2 Test:    0.303
Bag RMSE Train:  14.974
Bag RMSE Test:   20.495
Bag Best Params: {'bag__max_features': 3, 'bag__max_samples': 0.5, 'bag__n_estimators': 10, 'bag__n_jobs': -1, 'bag__random_state': 42, 'ss__with_mean': False, 'ss__with_std': False}


In [161]:
# Yes this was incorrectly named. Should've been named gs_rf, but instead named gs_pipe. 
# Don't want to rerun it cause it takes a long time to load so oh well, got the answers anyways

print(f'rf R^2 Train:   {round(gs_pipe.score(X_train, y_train),3)}')
print(f'rf R^2 Test:    {round(gs_pipe.score(X_test, y_test),3)}')
print(f'rf RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_pipe.predict(X_train))**0.5),3)}')
print(f'rf RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_pipe.predict(X_test))**0.5),3)}')
print(f'rf Best Params: {gs_pipe.best_params_}')

rf R^2 Train:   0.451
rf R^2 Test:    0.405
rf RMSE Train:  17.757
rf RMSE Test:   18.926
rf Best Params: {'rf__max_depth': 6, 'rf__max_features': 4, 'rf__n_estimators': 50, 'rf__n_jobs': 2, 'ss__with_mean': True, 'ss__with_std': False}


In [164]:
print(f'ada R^2 Train:   {round(gs_ada.score(X_train, y_train),3)}')
print(f'ada R^2 Test:    {round(gs_ada.score(X_test, y_test),3)}')
print(f'ada RMSE Train:  {round((metrics.mean_squared_error(y_train, gs_ada.predict(X_train))**0.5),3)}')
print(f'ada RMSE Test:   {round((metrics.mean_squared_error(y_test, gs_ada.predict(X_test))**0.5),3)}')
print(f'ada Best Params: {gs_ada.best_params_}')

ada R^2 Train:   0.204
ada R^2 Test:    0.167
ada RMSE Train:  21.385
ada RMSE Test:   22.398
ada Best Params: {'ada__n_estimators': 50, 'ss__with_mean': False, 'ss__with_std': False}


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [175]:
X = df.drop(columns=['e401k', 'p401k'])
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42,
                                                    stratify=y
                                                   )

In [194]:
# 1. log_model Reg Model

log = LogisticRegression()
ss = StandardScaler()

pipe_log_model = Pipeline([
    ('ss', ss),
    ('log', log)
])

pipe_log_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
#     'log__l1_ratio': [0, 0.25, 0.5, 0.75, 1],
    'log__max_iter': [1000, 2000, 3000],
#     'log__solver': ['sag']
}

gs_log_model = GridSearchCV(
    estimator=pipe_log_model,
    param_grid=pipe_log_model_params,
    cv=5
)

gs_log_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('log', LogisticRegression())]),
             param_grid={'log__max_iter': [1000, 2000, 3000],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [186]:
pipe_log_model.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('log', LogisticRegression())],
 'verbose': False,
 'ss': StandardScaler(),
 'log': LogisticRegression(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'log__C': 1.0,
 'log__class_weight': None,
 'log__dual': False,
 'log__fit_intercept': True,
 'log__intercept_scaling': 1,
 'log__l1_ratio': None,
 'log__max_iter': 100,
 'log__multi_class': 'auto',
 'log__n_jobs': None,
 'log__penalty': 'l2',
 'log__random_state': None,
 'log__solver': 'lbfgs',
 'log__tol': 0.0001,
 'log__verbose': 0,
 'log__warm_start': False}

In [195]:
# 2. KNN Model

ss = StandardScaler()
knn = KNeighborsClassifier()

pipe_knn_model = Pipeline([
    ('ss', ss),
    ('knn', knn)
])

pipe_knn_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    
}

gs_knn_model = GridSearchCV(
    estimator=pipe_knn,
    param_grid=pipe_knn_params,
    cv=5
)

gs_knn_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('knn', KNeighborsRegressor())]),
             param_grid={'knn__metric': ['euclidean', 'minkowski', 'manhattan'],
                         'knn__n_neighbors': [5, 7, 9, 11],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [196]:
pipe_knn_model.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('knn', KNeighborsClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'knn': KNeighborsClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [197]:
# 3. Decision Tree Model

ss = StandardScaler()
tree = DecisionTreeClassifier()

pipe_tree_model = Pipeline([
    ('ss', ss),
    ('tree', tree)
])

pipe_tree_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    
}

gs_tree_model = GridSearchCV(
    estimator=pipe_tree,
    param_grid=pipe_tree_params,
    cv=5
)

gs_tree_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('tree', DecisionTreeRegressor())]),
             param_grid={'ss__with_mean': [True, False],
                         'ss__with_std': [True, False],
                         'tree__max_depth': [None, 1, 3, 5],
                         'tree__max_features': [None, 1, 3, 5],
                         'tree__random_state': [42]})

In [198]:
pipe_tree_model.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('tree', DecisionTreeClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'tree': DecisionTreeClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'tree__ccp_alpha': 0.0,
 'tree__class_weight': None,
 'tree__criterion': 'gini',
 'tree__max_depth': None,
 'tree__max_features': None,
 'tree__max_leaf_nodes': None,
 'tree__min_impurity_decrease': 0.0,
 'tree__min_impurity_split': None,
 'tree__min_samples_leaf': 1,
 'tree__min_samples_split': 2,
 'tree__min_weight_fraction_leaf': 0.0,
 'tree__random_state': None,
 'tree__splitter': 'best'}

In [199]:
# 3. Decision Tree Model

ss = StandardScaler()
tree = DecisionTreeClassifier()

pipe_tree_model = Pipeline([
    ('ss', ss),
    ('tree', tree)
])

pipe_tree_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    
}

gs_tree_model = GridSearchCV(
    estimator=pipe_tree,
    param_grid=pipe_tree_params,
    cv=5
)

gs_tree_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('tree', DecisionTreeRegressor())]),
             param_grid={'ss__with_mean': [True, False],
                         'ss__with_std': [True, False],
                         'tree__max_depth': [None, 1, 3, 5],
                         'tree__max_features': [None, 1, 3, 5],
                         'tree__random_state': [42]})

In [200]:
pipe_tree_model.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('tree', DecisionTreeClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'tree': DecisionTreeClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'tree__ccp_alpha': 0.0,
 'tree__class_weight': None,
 'tree__criterion': 'gini',
 'tree__max_depth': None,
 'tree__max_features': None,
 'tree__max_leaf_nodes': None,
 'tree__min_impurity_decrease': 0.0,
 'tree__min_impurity_split': None,
 'tree__min_samples_leaf': 1,
 'tree__min_samples_split': 2,
 'tree__min_weight_fraction_leaf': 0.0,
 'tree__random_state': None,
 'tree__splitter': 'best'}

In [201]:
# 4. Bagging Model

ss = StandardScaler()
bag = BaggingClassifier()

pipe_bag_model = Pipeline([
    ('ss', ss),
    ('bag', bag)
])

pipe_bag_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    
}

gs_bag_model = GridSearchCV(
    estimator=pipe_bag,
    param_grid=pipe_bag_params,
    cv=5
)

gs_bag_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('bag', BaggingRegressor())]),
             param_grid={'bag__max_features': [1, 3, 5],
                         'bag__max_samples': [0.5, 0.75, 1.0],
                         'bag__n_estimators': [3, 5, 10], 'bag__n_jobs': [-1],
                         'bag__random_state': [42],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [202]:
pipe_bag_model.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler()), ('bag', BaggingClassifier())],
 'verbose': False,
 'ss': StandardScaler(),
 'bag': BaggingClassifier(),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'bag__base_estimator': None,
 'bag__bootstrap': True,
 'bag__bootstrap_features': False,
 'bag__max_features': 1.0,
 'bag__max_samples': 1.0,
 'bag__n_estimators': 10,
 'bag__n_jobs': None,
 'bag__oob_score': False,
 'bag__random_state': None,
 'bag__verbose': 0,
 'bag__warm_start': False}

In [203]:
# 5. Random Forest Model

ss = StandardScaler()
rf = RandomForestClassifier()

pipe_rf_model = Pipeline([
    ('ss', ss),
    ('rf', rf)
])

pipe_rf_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    
}

gs_rf_model = GridSearchCV(
    estimator=pipe_rf,
    param_grid=pipe_rf_params,
    cv=5
)

gs_rf_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('rf', RandomForestRegressor())]),
             param_grid={'rf__max_depth': [None, 2, 4, 6],
                         'rf__max_features': [None, 4, 6],
                         'rf__n_estimators': [50, 75, 100],
                         'rf__n_jobs': [2, 4], 'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [204]:
gs_rf_model.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('ss', StandardScaler()),
  ('rf', RandomForestRegressor())],
 'estimator__verbose': False,
 'estimator__ss': StandardScaler(),
 'estimator__rf': RandomForestRegressor(),
 'estimator__ss__copy': True,
 'estimator__ss__with_mean': True,
 'estimator__ss__with_std': True,
 'estimator__rf__bootstrap': True,
 'estimator__rf__ccp_alpha': 0.0,
 'estimator__rf__criterion': 'mse',
 'estimator__rf__max_depth': None,
 'estimator__rf__max_features': 'auto',
 'estimator__rf__max_leaf_nodes': None,
 'estimator__rf__max_samples': None,
 'estimator__rf__min_impurity_decrease': 0.0,
 'estimator__rf__min_impurity_split': None,
 'estimator__rf__min_samples_leaf': 1,
 'estimator__rf__min_samples_split': 2,
 'estimator__rf__min_weight_fraction_leaf': 0.0,
 'estimator__rf__n_estimators': 100,
 'estimator__rf__n_jobs': None,
 'estimator__rf__oob_score': False,
 'estimator__rf__random_state': None,
 'estimator__rf__verbose': 0,
 

In [206]:
# 6. ADA Boost Model

ss = StandardScaler()
ada = AdaBoostClassifier()

pipe_ada_model = Pipeline([
    ('ss', ss),
    ('ada', ada)
])

pipe_ada_model_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'ada__n_estimators': [50, 100]
}

gs_ada_model = GridSearchCV(
    estimator=pipe_ada_model,
    param_grid=pipe_ada_model_params,
    cv=5
)

gs_ada_model.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostClassifier())]),
             param_grid={'ada__n_estimators': [50, 100],
                         'ss__with_mean': [True, False],
                         'ss__with_std': [True, False]})

In [207]:
gs_ada_model.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('ss', StandardScaler()), ('ada', AdaBoostClassifier())],
 'estimator__verbose': False,
 'estimator__ss': StandardScaler(),
 'estimator__ada': AdaBoostClassifier(),
 'estimator__ss__copy': True,
 'estimator__ss__with_mean': True,
 'estimator__ss__with_std': True,
 'estimator__ada__algorithm': 'SAMME.R',
 'estimator__ada__base_estimator': None,
 'estimator__ada__learning_rate': 1.0,
 'estimator__ada__n_estimators': 50,
 'estimator__ada__random_state': None,
 'estimator': Pipeline(steps=[('ss', StandardScaler()), ('ada', AdaBoostClassifier())]),
 'n_jobs': None,
 'param_grid': {'ss__with_mean': [True, False],
  'ss__with_std': [True, False],
  'ada__n_estimators': [50, 100]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.