## Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### Step 2: Gather the data.

##### 1. Read in the data from the repository 

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('./401ksubs.csv')

In [5]:
df.head(1)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

**Answer:** 

Whether or they have children, the persons age, and maybe if they have a college degree. 

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

**Answer:** 

Yes, this would be unethical because it could potentially lead to discrimination, and has legal terms tied to it as well. 

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) from the dataset would we reasonably <u>not</u> use? Why?

In [6]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


**Answer:** 

I would not want to use incsq(income squared) or agesq(age squared) because they do not provide any material informational value to the dataset and therefore would just be noise to the model. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

**Answer:** 

incsq and agesq were created through feature engineering. This may have been done in consideration of their income and age to somehow give them an distinction when deciding who to qualify, and who to not qualify. 

##### 6. Looking at the data dictionary, two variable descriptions appear to contain an error. What is this error, and what do you think the correct value would be?

**Answer:** 

The first is inc(income) - the are squaring the income which doesn't seem right. Also I think if we are reporting income, we should match the nettfa(net total fin. assts, $1000 since they are reporting that in the thousands, income should match

## Step 4: Model the data. (Part 1: Regression Problem)

Note:
- Problem: What features best predict one's income?
- When predicting `inc`, you should ***pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.***

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Friday of Week 7). 

***Hint: There should be at least 6 different models on this list***

**Answer:**
    
    - Linear Regression / used for predicting one's income because its predictions are coefficients can be interpreted very easily
    - Ridge Regression / used for its predictions and coefficients. This is not only simple to understand, but the coefficients have been regulated which improved the performance of the model
    - Lasso Regression / same as Ridge, but regulates the coefficients more harshly than Ridge, which could also improve the predictive performace of the model 
    - ElasticNet Regression / this combines Ridge and Lasso, again streghthening the performace of the model 
    - Polynomial Regression / models a non - linear relationship between the independent and dependant variables 
    - Logistic Regression / correctly predict the category of outcome for individual cases using the most parsimonious model   

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline or master function to try each modeling technique, but you are not required to do so!

In [67]:
#Import your libraries for models and preprocessing/model selection

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, f1_score
from sklearn import svm
from math import sqrt

***Key model preparation quiestions:***
- A. What variables am I including or excluding as predictors of income?
- B.  How large should we make our testing dataset given the number of predictors/total rows? (there is no one answer here, we are more interesting in how you explain your reasoning)
- C. What other preprocessing steps should you take before running your models?

**Answer**:



***Set up your training and testing data and do any preprocessing steps required for modeling***

In [8]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [11]:
#not using 401k, the p401k variable, and the pira variable.
features = ['marr', 'male','age','incsq', 'agesq', 'fsize', 'nettfa']

X = df[features]
y = df['inc']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [13]:
ss = StandardScaler()

In [14]:
ss.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [15]:
X_train_sc = ss.transform(X_train)

In [16]:
X_test_sc = ss.transform(X_test)

### Linear Regression

In [17]:
lr = LinearRegression()

In [18]:
lr.fit(X_train_sc, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [19]:
cross_val_score(lr, X_train_sc, y_train).mean()

0.8934438475524992

In [20]:
lr.score(X_train_sc, y_train)

0.8948254673897013

In [21]:
lr.score(X_test_sc, y_test)

0.9055024120733456

***

### K Nearest Neighbors

In [22]:
knn = KNeighborsRegressor()

In [23]:
knn.fit(X_train_sc, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [24]:
cross_val_score(knn, X_train_sc, y_train).mean()

0.9623823509317297

In [25]:
knn.score(X_train_sc, y_train)

0.9795704183969427

In [26]:
knn.score(X_test_sc, y_test)

0.9728418554699854

****

### Decision Tree

In [27]:
dt = DecisionTreeRegressor()

In [28]:
dt.fit(X_train_sc, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [29]:
cross_val_score(dt, X_train_sc, y_train).mean()

0.9997841756511345

In [30]:
dt.score(X_train_sc, y_train)

1.0

In [31]:
dt.score(X_test_sc, y_test)

0.9999718806020502

### Bagged Decision Tree

In [32]:
bag = BaggingRegressor()

In [33]:
bag.fit(X_train_sc, y_train)

BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=10,
                 n_jobs=None, oob_score=False, random_state=None, verbose=0,
                 warm_start=False)

In [34]:
cross_val_score(bag, X_train_sc, y_train).mean()

0.999951509433892

In [35]:
bag.score(X_train_sc, y_train)

0.9999754504101072

In [36]:
bag.score(X_test_sc, y_test)

0.9999763328212158

### Random Forests

In [37]:
rf = RandomForestRegressor()

In [38]:
rf.fit(X_train_sc, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [39]:
cross_val_score(rf, X_train_sc, y_train).mean()

0.9999265068693463

In [40]:
rf.score(X_train_sc, y_train)

0.9999892883691393

***

### Support Vector Machine

In [41]:
svr = svm.SVR()

In [42]:
svr.fit(X_train_sc, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [43]:
cross_val_score(svr, X_train_sc, y_train).mean()

0.8679106944982079

In [44]:
svr.score(X_train_sc, y_train)

0.8838450999760973

In [45]:
svr.score(X_test_sc, y_test)

0.8796789233869663

***

### AdaBoost Regressor

In [49]:
from sklearn.ensemble import AdaBoostRegressor

In [50]:
ada = AdaBoostRegressor()

In [51]:
ada.fit(X_train_sc, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

In [52]:
cross_val_score(ada, X_train_sc, y_train).mean()

0.9904071726108867

In [53]:
ada.score(X_train_sc, y_train)

0.9904607816324499

In [54]:
ada.score(X_test_sc, y_test)

0.9908924457332933

***

### Gradient Boosting Regressor

In [55]:
from sklearn.ensemble import GradientBoostingRegressor

#from sklearn: GB builds an additive model in a forward stage-wise fashion; 
#it allows for the optimization of arbitrary differentiable loss functions. 
#In each stage a regression tree is fit on the negative gradient of the given loss function.

In [58]:
gra = GradientBoostingRegressor(random_state=0)

In [59]:
gra.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=0, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [60]:
cross_val_score(gra, X_train_sc, y_train).mean()

0.9998784041847119

In [61]:
gra.score(X_train_sc, y_train)

-1.4721741227825524

In [62]:
gra.score(X_test_sc, y_test)

-1.4142480502387045

In [64]:
#Running just to learn more. The higher, the more important the feature.
#the importance of a feature is computed as the normalzied total reduction of the criterion brough by that feature
#Gini importance 

gra.feature_importances_

array([4.35287918e-09, 0.00000000e+00, 2.40246510e-07, 9.99997241e-01,
       4.90292242e-07, 5.30189438e-08, 1.97076923e-06])

##### 9. What is bootstrapping?

**Answer:** 

Bootstrapping is random sampling with replacement 

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

**Answer:** 

A set of bagged decision trees is one that builds its model by iteratively taking a random sample of the rows with replacement in a dataset, building a decision tree model, and then taking the average of those decision trees to build its final model. A set of bagged decision trees is an ensemble method, meant to make weak signals stronger, reducing variance in the model. 

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

**Answer:** 

The difference between a set of bagged decision trees and a random forest is that not only does a random forest model take a random sample of the rows in a dataset with replacement iteratively, but also a random sample set of features when building its multiple models that eventually build the final random forests model. 

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

**Answer:** 

A random forest model might be superior to a set of bagged decision trees because it contains less variance at the cost of slightly greater bias, which should improve the model overall. 

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.
Hint: THis is another spot where using a single scoring function for each model may come in handy!

| Model             	| Train RMSE 	| Test RMSE 	|
|-------------------	|------------	|-----------	|
| Linear Regression 	| 7.7762     	| 7.506     	|
| KNN               	| 3.4272     	| 4.0240    	|
| Decision Tree     	| 6.8056     	| 0.1295    	|
| Bagged DT         	| 0.1188     	| 0.1187    	|
| Random Forest     	| 0.0785     	| 0.0736    	|
| SVM               	| 8.1721     	| 8.4698    	|
| AdaBoost          	| 2.3419     	| 2.3303    	|
| GradientBoost     	| 37.7010    	| 37.9398   	|

### Linear regression

In [73]:
lr_predics_train = lr.predict(X_train_sc)

In [74]:
lr_rms_train = sqrt(mean_squared_error(y_train, lr_predics_train))
lr_rms_train
#reminder - RMSE is a measure of how spread out these residuals are

7.77621651619324

In [75]:
lr_predics_test = lr.predict(X_test_sc)

In [76]:
lr_rms_test = sqrt(mean_squared_error(y_test, lr_predics_test))
lr_rms_test

7.506111330932503

### K Nearest Neighbors

In [77]:
knn_predics_train = knn.predict(X_train_sc)

In [79]:
knn_rms_train = sqrt(mean_squared_error(y_train, knn_predics_train))
knn_rms_train

3.4272263258974696

In [81]:
knn_predics_test = knn.predict(X_test_sc)

In [82]:
knn_rms_test = sqrt(mean_squared_error(y_test, knn_predics_test))
knn_rms_test

4.02396956785657

### Decision Tree

In [83]:
dt_predics_train = dt.predict(X_train_sc)

In [89]:
dt_rmse_train = sqrt(mean_squared_error(y_train, dt_predics_train))
dt_rmse_train

6.80555106201231e-16

In [90]:
dt_predics_test = dt.predict(X_test_sc)

In [91]:
dt_rmse_test = sqrt(mean_squared_error(y_test, dt_predics_test))
dt_rmse_test

0.12948147826076004

### Bagged Decision Tree

In [92]:
bag_predics_train = bag.predict(X_train_sc)

In [96]:
bag_rmse_train = sqrt(mean_squared_error(y_train, bag_predics_train))
bag_rmse_train

0.11880512339431498

In [97]:
bag_predics_test = bag.predict(X_test_sc)

In [98]:
bag_rmse_test = sqrt(mean_squared_error(y_test, bag_predics_test))
bag_rmse_test

0.11878945388933172

### Random Forests

In [99]:
rf_predics_train = rf.predict(X_train_sc)

In [100]:
rf_rmse_train = sqrt(mean_squared_error(y_train, rf_predics_train))
rf_rmse_train

0.07847672086922934

In [101]:
rf_predics_test = rf.predict(X_test_sc)

In [102]:
rf_rmse_test = sqrt(mean_squared_error(y_test, rf_predics_test))
rf_rmse_test

0.07356755139489976

### Support Vector Machine

In [103]:
svr_predics_train = svr.predict(X_train_sc)

In [104]:
svr_rmse_train = sqrt(mean_squared_error(y_train, svr_predics_train))
svr_rmse_train

8.172065081121346

In [105]:
svr_predics_test = svr.predict(X_test_sc)

In [106]:
svr_rmse_test = sqrt(mean_squared_error(y_test, svr_predics_test))
svr_rmse_test

8.46984562245298

### AdaBoost

In [108]:
ada_predics_train = ada.predict(X_train_sc)

In [109]:
ada_rmse_train = sqrt(mean_squared_error(y_train, ada_predics_train))
ada_rmse_train

2.3419059190065097

In [110]:
ada_predics_test = ada.predict(X_test_sc)

In [111]:
ada_rms_test = sqrt(mean_squared_error(y_test, ada_predics_test))
ada_rms_test

2.330266135362324

### GradientBoost

In [112]:
gra_predics_train = gra.predict(X_train_sc)

In [113]:
gra_rmse_train = sqrt(mean_squared_error(y_train, gra_predics_train))
gra_rmse_train

37.70097458638795

In [114]:
gra_predics_test = gra.predict(X_test_sc)

In [115]:
gra_rms_test = sqrt(mean_squared_error(y_test, gra_predics_test))
gra_rms_test

37.93984782983282

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

**Answer:** 

My Decision Tree model was overfit, but the other models looked pretty good. 

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** 

If I had to pick one model, I would pick the Bagged Decision Tree model because although it did not have the best RMSE score on the testing data, the gap between the RMSE score on the training data vs the testing data is amongst the smallest of all the models. This allows me to be relatively confident that the model will work pretty well on unseen data, which I cannot say of the other models. The Linear Regression model performed well on the testing data, but is more considerably more overfit than the SVM model. Therefore, I am more comfortable selecting the Bagged Decision Tree model.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [116]:
df.head(2)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225


**Answer:** 

(1) Maybe try cubing instead squaring and see what effect it has on the models
(2) Using poloynomial features 
(3) Turn the age column into categorical and then running dummies to get different ranges 

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

**Answer:** 

Given that our target variable, in this case, is whether or not someone is eligible for a 401k, including whether or not someone currently has a 401k is almost the same as whether or not someone is eligible for a 401k for those who do currently have a 401k. With this said, including the p401k in my model would almost be like training the model with the target variable included, which, of course, would not lead to great results.`m

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

**Answer:**

    - Logistic Regression / appropriate since its coefficients can be interpreted 
    - KNearest Neighbors / appropriate as it can be used for classification purposes 
    - Decision Trees / can be used for classification purposes 
    - Bagged Decision Trees / can be used for classification purposes
    - Random Forest / can be used for classification purposes 
    

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a $k$-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [21]:
# Import libraries needed for models


***Set up testing and training data, do any preprocessing steps***

In [118]:
X = df.drop(columns=["e401k", "p401k"])
y = df["e401k"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.2,
    random_state = 42
)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

***Run each model!***

### Logestic Regression  

In [119]:
logreg = LogisticRegression()

In [120]:
logreg.fit(X_train_sc, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [121]:
cross_val_score(logreg, X_train_sc, y_train).mean()


0.6529649595687331

In [122]:
logreg.score(X_train_sc, y_train)

0.6540431266846362

In [123]:
logreg.score(X_test_sc, y_test)

0.663611859838275

### K Nearest Neighbors

In [124]:
knn = KNeighborsClassifier()

In [125]:
knn.fit(X_train_sc, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [126]:
cross_val_score(knn, X_train_sc, y_train).mean()

0.6243935309973045

In [127]:
knn.score(X_train_sc, y_train)

0.7504043126684636

In [128]:
knn.score(X_test_sc, y_test)

0.6393530997304582

### Decision Tree

In [129]:
dt = DecisionTreeClassifier()

In [130]:
dt.fit(X_train_sc, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [131]:
cross_val_score(dt, X_train_sc, y_train).mean()

0.5973045822102425

In [132]:
dt.score(X_train_sc, y_train)

1.0

In [134]:
dt.score(X_test_sc, y_test)

0.5892183288409704

### Bagged Decision Tree

In [135]:
bag = BaggingClassifier()

In [136]:
bag.fit(X_train_sc, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [137]:
cross_val_score(bag, X_train_sc, y_train).mean()

0.646765498652291

In [138]:
bag.score(X_train_sc, y_train)

0.9760107816711591

In [139]:
bag.score(X_test_sc, y_test)

0.645822102425876

### Random Forests

In [140]:
rf = RandomForestClassifier()

In [141]:
rf.fit(X_train_sc, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [142]:
cross_val_score(rf, X_train_sc, y_train).mean()

0.6673854447439354

In [143]:
rf.score(X_train_sc, y_train)

1.0

In [144]:
rf.score(X_test_sc, y_test)

0.6652291105121294

### Support Vector Machine

In [145]:
svc = svm.SVC()

In [146]:
svc.fit(X_train_sc, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [147]:
cross_val_score(svc, X_train_sc, y_train).mean()

0.6675202156334231

In [148]:
svc.score(X_train_sc, y_train)

0.6830188679245283

In [149]:
svc.score(X_test_sc, y_test)

0.6738544474393531

### AdaBoost

In [150]:
ada = AdaBoostClassifier()

In [151]:
ada.fit(X_train_sc, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)

In [152]:
cross_val_score(ada, X_train_sc, y_train).mean()

0.6820754716981133

In [153]:
ada.score(X_train_sc, y_train)

0.6894878706199461

In [154]:
ada.score(X_test_sc, y_test)

0.6911051212938005

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Answer:**
- False positives are = someone that the model predicts is eligble for a 401k bt actually is not
- False negatives are = someone that the model predicts is not elible for a 401k but actually is. 

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

**Answer:** 

We would want to minimize False Positives, assuming that the cost to the financial services copmany that I am working for is greater if they offer a 401k to someone who is not actually eligble for one than if they did not offer a 401k to someone who is eligible. 

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

**Answer:** 

If we would want to optimize for False Positives, we would use the specificity metric. 

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to use the ROC AUC Score to evaluate our models.


<br>
<center> <b> Receiver Operating Characteristic Curve </b> </center>
<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/08/ROC-Curve-Plot-for-a-No-Skill-Classifier-and-a-Logistic-Regression-Model.png" width = 500>

Recall back in Week 4 we talked about the ROC AUC score as a useful way to give an overall score of a binary classification model. The values range between 0 and 1, but your scores need to be above 0.5 to indicate your model is useful (0.5 indicates your classes completely overlap). The closer to 1 your score, the stronger your model.

By running the scores on our train and test data, we can check for over or under fitting the same way we do with an accuracy score or R-square for regression models.

##### 24. Using the AUC-score, evaluate each of the models you fit on both the training and testing data.

In [155]:
from sklearn.metrics import f1_score

| Model               	| Train F1 	| Test F1 	|
|---------------------	|----------	|---------	|
| Logistic Regression 	| .4728    	| .4774   	|
| KNN                 	| .6515    	| .4967   	|
| Decision Tree       	| 1.0      	| .4687   	|
| Bagged DT           	| .9687    	| .4839   	|
| Random Forest       	| 1.0      	| .5306   	|
| SVM                 	| 0.4707   	| .4535   	|
| AdaBoost            	| .5621    	| .5688   	|

### Logistic Regression 

In [156]:
logreg_predics_train = logreg.predict(X_train_sc)

In [157]:
logreg_f1_train = f1_score(y_train, logreg_predics_train)

In [158]:
logreg_f1_train

0.4727870199219552

In [159]:
logreg_predics_test = logreg.predict(X_test_sc)

In [160]:
logreg_f1_test = f1_score(y_test, logreg_predics_test)
logreg_f1_test

0.4773869346733668

### KNN

In [161]:
knn_predics_train = knn.predict(X_train_sc)

In [162]:
knn_f1_train = f1_score(y_train, knn_predics_train)
knn_f1_train

0.6514866390666164

In [163]:
knn_predics_test = knn.predict(X_test_sc)

In [164]:
knn_f1_test = f1_score(y_test, knn_predics_test)
knn_f1_test

0.49661399548532736

### Decision Tree

In [165]:
dt_predics_train = dt.predict(X_train_sc)

In [166]:
dt_f1_train = f1_score(y_train, dt_predics_train)
dt_f1_train

1.0

In [167]:
dt_predics_test = dt.predict(X_test_sc)

In [168]:
dt_f1_test = f1_score(y_test, dt_predics_test)
dt_f1_test

0.4686192468619247

### Bagged Decision Tree

In [169]:
bag_predics_train = bag.predict(X_train_sc)

In [170]:
bag_f1_train = f1_score(y_train, bag_predics_train)
bag_f1_train

0.9686619718309859

In [171]:
bag_predics_test = bag.predict(X_test_sc)

In [172]:
bag_f1_test = f1_score(y_test, bag_predics_test)
bag_f1_test

0.4838963079340141

### Random Forests

In [173]:
rf_predics_train = rf.predict(X_train_sc)

In [175]:
rf_f1_train = f1_score(y_train, rf_predics_train)
rf_f1_train

1.0

In [176]:
rf_predics_test = rf.predict(X_test_sc)

In [177]:
rf_f1_test = f1_score(y_test, rf_predics_test)
rf_f1_test

0.5306122448979591

### Support Vector Machine

In [178]:
svc_predics_train = svc.predict(X_train_sc)

In [179]:
svc_f1_train = f1_score(y_train, svc_predics_train)
svc_f1_train

0.4707470747074707

In [180]:
svc_predics_test = svc.predict(X_test_sc)

In [181]:
svc_f1_test = f1_score(y_test, svc_predics_test)
svc_f1_test

0.45347786811201446

### AdaBoost

In [182]:
ada_predics_train = ada.predict(X_train_sc)

In [183]:
ada_f1_train = f1_score(y_train, ada_predics_train)
ada_f1_train

0.5621436716077537

In [184]:
ada_predics_test = ada.predict(X_test_sc)

In [185]:
ada_f1_test = f1_score(y_test, ada_predics_test)
ada_f1_test

0.5688487584650113

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

**Answer:**

Based on the training f1-scores and the testing f1-scores, there is evidence that the Decision Tree, Bagged Decision Trees, and Random Forests models are overfit. The AdaBoost and Support Vector Machine models are also show evidence of being overfit, but only slightly. Finally, the Logistic Regression model does not show any evidence of being overfit.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** 

If I had to pick one Classification model, it would be the AdaBoost model; it has the strongest f1-test score and only shows evidence of being ever-so-slightly overfit.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** 

(1) I would gridsearch on my models to see if/how much tweaking the models' parameters could improve their performances.

(2) I would spend some time on polynomial features, maybe some additional feature creations 

## Step 6: Answer the problem.

***Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.***

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.