In [386]:
# Credit to David Lee for assistance on this lab

## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [6]:
import pandas as pd
import numpy as np

In [7]:
data = pd.read_csv('../6.01-lab-supervised-learning-models-master/401ksubs.csv')

In [8]:
data.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

**Answer:**
- A variable that would be helpful to have would be if they have a college degree or not
- Do they have a job (Yes/No)

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

**Answer:** This would be unethical because if there is a certain race who by chance makes less income just becuase how the data was gathered the model will have an inherent bias towards that race.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) in our dataset would we reasonably not use? Why?

**Answer:** I would pick family size because family size does not have an impact on how much money you make or whether or not you are eligible for a 401k.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

**Answer:** The features income squared and age squared. They may have done this to add emphasis to these features in their modeling process.

##### 6. Looking at the data dictionary, two variable descriptions appear to be errors. What are these errors, and what do you think the correct value would be, looking at the data?

**Answer:**The income and age features are squared. These features should should not have the description of ^2

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all models/modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6).

**Answer:** Linear Regression, LASSO Linear Regression, Ridge Linear Regression, Support Vecotr Machines, Stats Model, Random Forest, Bagging

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above. You will be asked to evaluate your models later in Step 5:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [374]:
# imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor, RandomForestRegressor, AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.svm import SVR, SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, f1_score

In [13]:
data.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [97]:
X = data[['marr', 'male', 'fsize', 'nettfa', 'agesq']]
y = data['inc']

In [98]:
X.shape

(9275, 5)

In [99]:
y.shape

(9275,)

In [100]:
sc = StandardScaler()

In [106]:
sc.fit(X)
X = sc.transform(X)

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=.70)

In [108]:
X_train

array([[ 0.76870611, -0.50689781,  1.38615651, -0.48008893, -0.47415415],
       [ 0.76870611, -0.50689781,  0.73074248, -0.2434581 , -0.85937053],
       [-1.30088727, -0.50689781, -0.58008558, -0.19746088, -0.92971439],
       ...,
       [-1.30088727, -0.50689781,  0.07532845, -0.29817947, -1.30488164],
       [-1.30088727,  1.97278423,  0.73074248, -0.34655316, -1.18875844],
       [ 0.76870611, -0.50689781,  0.07532845,  2.78419699, -0.12578456]])

### **Multi Linear Regression Model**

In [109]:
lr = LinearRegression()

In [110]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [111]:
lr.score(X_train, y_train)

0.28001416105468424

In [271]:
lr_predict_train = lr.predict(X_train)

In [210]:
lr_predict = lr.predict(Z_test)

### **K-Nearest Neighbor**

In [114]:
knn = KNeighborsRegressor()

In [115]:
knn.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [116]:
knn.score(X_train, y_train)

0.5258221016886651

In [117]:
knn.score(X_test, y_test)

0.3155249824655206

In [270]:
knn_predict_train = knn.predict(X_train)

In [209]:
knn_predict = knn.predict(X_test)

### **Decision Tree**

In [118]:
dt = DecisionTreeRegressor(max_depth=4)

In [119]:
dt.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=4,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [120]:
dt.score(X_train, y_train)

0.398548797293796

In [121]:
dt.score(X_test, y_test)

0.3679313150122422

In [273]:
dt_predict_train = dt.predict(X_train)

In [208]:
dt_predict = dt.predict(X_test)

### **Bagged Decision Trees**

In [138]:
br = BaggingRegressor(n_estimators=20, random_state=42)

In [139]:
br.fit(X_train, y_train)

BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=20,
                 n_jobs=None, oob_score=False, random_state=42, verbose=0,
                 warm_start=False)

In [140]:
br.score(X_train, y_train)

0.8848342521590344

In [141]:
br.score(X_test, y_test)

0.28713671693979825

In [268]:
br_predict_train = br.predict(X_train)

In [269]:
br_predict = br.predict(X_test)

### **Random Forest**

In [158]:
rf = RandomForestRegressor(n_estimators=50, max_depth=7)

In [159]:
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=7, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [160]:
rf.score(X_train, y_train)

0.49400433541970135

In [161]:
rf.score(X_test, y_test)

0.3963682576370157

In [266]:
rf_predict_train = rf.predict(X_train)

In [204]:
rf_predict = rf.predict(X_test)

### **Adaboost Model**

In [174]:
ada = AdaBoostRegressor(base_estimator=rf, learning_rate=0.1, random_state=42)

In [175]:
ada.fit(X_train, y_train)

AdaBoostRegressor(base_estimator=RandomForestRegressor(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       criterion='mse',
                                                       max_depth=7,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       max_samples=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       n_estimators=50,
                        

In [176]:
ada.score(X_train, y_train)

0.5000419982470464

In [177]:
ada.score(X_test, y_test)

0.38207143038590474

In [264]:
ada_predict_train = ada.predict(X_train)

In [203]:
ada_predict = ada.predict(X_test)

### **Support Vector Regressor**

In [198]:
svm = SVR(C = 5)

In [199]:
svm.fit(X_train, y_train)

SVR(C=5, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [200]:
svm.score(X_train, y_train)

0.34793449719199654

In [201]:
svm.score(X_test, y_test)

0.34405737922661095

In [263]:
svm_predict_train = svm.predict(X_train)

In [202]:
svm_predict = svm.predict(X_test)

##### 9. What is bootstrapping?

**Answer:** Bootstrapping is taking random samples with replacement of your data and setting them aside for modeling.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

**Answer:** The difference between a decision tree and a bagged decision tree is that a bagged decision tree is creating multiple samples in order to create an averaged decision.  It is less prone to overfitting a model.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

**Answer:**  
Source: https://stats.stackexchange.com/questions/264129/what-is-the-difference-between-bagging-and-random-forest-if-only-one-explanatory
- The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

**Answer:** A random forest is superior because in a set of bagged decision trees it considers all features in splitting meanwhile a random forest considers only features that important.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [274]:
preds = [lr_predict, knn_predict, dt_predict, br_predict, rf_predict, ada_predict, svm_predict]

In [275]:
preds_train = [lr_predict_train, knn_predict_train, dt_predict_train, br_predict_train, rf_predict_train, ada_predict_train, svm_predict_train]

In [276]:
models = ['Linear Regression', 'KNN', 'Decision Tree', 'Bagged Decision Trees', 'Random Forest', 'Adaboost Model', 'Support Vector Regressor']

In [287]:
print(f"The Training RMSE for the Linear Regression is: {round(mean_squared_error(y_train, lr_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Linear Regression is: {round(mean_squared_error(y_test, lr_predict, squared=False), 2)}")
print(f"The Training RMSE for the KNN is: {round(mean_squared_error(y_train, knn_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the KNN is: {round(mean_squared_error(y_test, knn_predict, squared=False), 2)}")
print(f"The Training RMSE for the Decision Tree is: {round(mean_squared_error(y_train, dt_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Decision Tree is: {round(mean_squared_error(y_test, dt_predict, squared=False), 2)}")
print(f"The Training RMSE for the Bagged Decision Trees is: {round(mean_squared_error(y_train, br_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Bagged Decision Trees is: {round(mean_squared_error(y_test, br_predict, squared=False), 2)}")
print(f"The Training RMSE for the Random Forest is: {round(mean_squared_error(y_train, rf_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Random Forest is: {round(mean_squared_error(y_test, rf_predict, squared=False), 2)}")
print(f"The Training RMSE for the Adaboost Model is: {round(mean_squared_error(y_train, ada_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Adaboost Model is: {round(mean_squared_error(y_test, ada_predict, squared=False), 2)}")
print(f"The Training RMSE for the Support Vector Regressor is: {round(mean_squared_error(y_train, svm_predict_train, squared=False), 2)}")
print(f"The Testing RMSE for the Support Vector Regressor is: {round(mean_squared_error(y_test, svm_predict, squared=False), 2)}")

The Training RMSE for the Linear Regression is: 20.26
The Testing RMSE for the Linear Regression is: 21.62
The Training RMSE for the KNN is: 16.44
The Testing RMSE for the KNN is: 20.34
The Training RMSE for the Decision Tree is: 18.52
The Testing RMSE for the Decision Tree is: 19.54
The Training RMSE for the Bagged Decision Trees is: 8.1
The Testing RMSE for the Bagged Decision Trees is: 20.75
The Training RMSE for the Random Forest is: 16.98
The Testing RMSE for the Random Forest is: 19.1
The Training RMSE for the Adaboost Model is: 16.88
The Testing RMSE for the Adaboost Model is: 19.32
The Training RMSE for the Support Vector Regressor is: 19.28
The Testing RMSE for the Support Vector Regressor is: 19.91


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

**Answer:** Yes there were a few models that did overfit such as the: KNN, Bagged Decision Tree, Random Forest, and Adaboost

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** I would select the Support Vector Regressor because it had one of the lowest RMSE scores with the overall lowest amount of variance.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:** First I would check for outliers in the dataset. Secondly I would perform a gridsearch to find the best parameters for this model. Lastly I would collect more data.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

**Answer:** I would be bad to use this in the model because if the value is 1 that means they already have a 401k which would influence the overall classification models.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6).

**Answer:** Logistic Regression, KNN, Decision Trees, Bagged Decision Trees, Random Forests, SVM's, Adaboost, Extra Trees

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above. You will be asked to evaluate your models later in Step 5:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [288]:
data.head(1)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600


In [291]:
X2 = data[['marr', 'male', 'fsize', 'nettfa', 'agesq', 'incsq']]
y2 = data['e401k']

In [293]:
X2.shape

(9275, 6)

In [294]:
y2.shape

(9275,)

In [295]:
sc = StandardScaler()

In [297]:
sc.fit(X2)
X2 = sc.transform(X2)

In [298]:
X_train, X_test, y_train, y_test = train_test_split(X2, y2, random_state=42, train_size=.70)

### **Logistic Regression**

In [302]:
logreg = LogisticRegression()

In [303]:
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [305]:
logreg.score(X_train, y_train)

0.6395563770794824

In [377]:
logreg_predict_train = logreg.predict(X_train)

In [378]:
logreg_predict = logreg.predict(X_test)

### **K-Nearest Neighbor**

In [312]:
knn_c = KNeighborsClassifier()

In [313]:
knn_c.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [314]:
knn_c.score(X_train, y_train)

0.7556993222427604

In [315]:
knn_c.score(X_test, y_test)

0.6399568810636004

In [318]:
knn_c_predict_train = knn_c.predict(X_train)

In [319]:
knn_c_predict = knn_c.predict(X_test)

### **Decision Tree**

In [321]:
dt_c = DecisionTreeClassifier(max_depth=4)

In [322]:
dt_c.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [323]:
dt_c.score(X_train, y_train)

0.6905422057917436

In [324]:
dt_c.score(X_test, y_test)

0.6769673014732304

In [325]:
dt_c_predict_train = dt_c.predict(X_train)

In [326]:
dt_c_predict = dt_c.predict(X_test)

### **Bagged Decision Trees**

In [327]:
br_c = BaggingClassifier(n_estimators=20, random_state=42)

In [328]:
br_c.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=20,
                  n_jobs=None, oob_score=False, random_state=42, verbose=0,
                  warm_start=False)

In [329]:
br_c.score(X_train, y_train)

0.9930683918669131

In [330]:
br_c.score(X_test, y_test)

0.6374416097736256

In [334]:
br_c_predict_train = br_c.predict(X_train)

In [335]:
br_c_predict = br_c.predict(X_test)

### **Random Forest**

In [336]:
rf_c = RandomForestClassifier(n_estimators=50, max_depth=7)

In [337]:
rf_c.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=7, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [338]:
rf_c.score(X_train, y_train)

0.7165742452248922

In [339]:
rf_c.score(X_test, y_test)

0.6816385195831837

In [340]:
rf_c_predict_train = rf_c.predict(X_train)

In [341]:
rf_c_predict = rf_c.predict(X_test)

### **Adaboost Model**

In [357]:
ada_c = AdaBoostClassifier(base_estimator=rf_c, learning_rate=0.1, random_state=42)

In [358]:
ada_c.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=RandomForestClassifier(bootstrap=True,
                                                         ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=7,
                                                         max_features='auto',
                                                         max_leaf_nodes=None,
                                                         max_samples=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                          

In [359]:
ada_c.score(X_train, y_train)

0.7731053604436229

In [360]:
ada_c.score(X_test, y_test)

0.68199784405318

In [361]:
ada_c_predict_train = ada_c.predict(X_train)

In [362]:
ada_c_predict = ada_c.predict(X_test)

### **Support Vector Regressor**

In [367]:
svm_c = SVC(C = 5)

In [368]:
svm_c.fit(X_train, y_train)

SVC(C=5, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [369]:
svm_c.score(X_train, y_train)

0.6825323475046211

In [370]:
svm_c.score(X_test, y_test)

0.6798418972332015

In [372]:
svm_c_predict_train = svm_c.predict(X_train)

In [373]:
svm_c_predict = svm_c.predict(X_test)

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Answer:** Our false positives would be us labeling someone as eligible for a 401k when they actually are not eligible. Our false negatives would be someone who we classified as an eligible but they really are not eligible.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

**Answer:** We would rather minimize our false positive because we would want to minimize as much risk as we can. People who aren't eligible for 401k's being able to open them up would be bad for business.

##### 22. Suppose we wanted to optimize for (minimize) the answer you provided in problem 21. Which metric would we optimize (maximize) in this case?

**Answer:** We would want to minimize the misclassification rate.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

**Answer:** F1 is appropriate because it only takes in the positives. We are seeing here among the postives how many positivies did we correctly predict.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [385]:
print(f"The f1 score for the Logisitic Regression is: {round(f1_score(y_train, logreg_predict_train), 2)}")
print(f"The f1 score for the Logistic Regression is: {round(f1_score(y_test, logreg_predict), 2)}")
print(f"The f1 score for the KNN is: {round(f1_score(y_train, knn_c_predict_train), 2)}")
print(f"The f1 score for the KNN is: {round(f1_score(y_test, knn_c_predict), 2)}")
print(f"The f1 score for the Decision Tree is: {round(f1_score(y_train, dt_c_predict_train), 2)}")
print(f"The f1 score for the Decision Tree is: {round(f1_score(y_test, dt_c_predict), 2)}")
print(f"The f1 score for the Bagged Decision Trees is: {round(f1_score(y_train, br_c_predict_train), 2)}")
print(f"The f1 score for the Bagged Decision Trees is: {round(f1_score(y_test, br_c_predict), 2)}")
print(f"The f1 score for the Random Forest is: {round(f1_score(y_train, rf_c_predict_train), 3)}")
print(f"The f1 score for the Random Forest is: {round(f1_score(y_test, rf_c_predict), 2)}")
print(f"The f1 score for the Adaboost Model is: {round(f1_score(y_train, ada_c_predict_train), 2)}")
print(f"The f1 score for the Adaboost Model is: {round(f1_score(y_test, ada_c_predict), 2)}")
print(f"The f1 score for the Support Vector Classifiers is: {round(f1_score(y_train, svm_c_predict_train), 2)}")
print(f"The f1 score for the Support Vector Classifier is: {round(f1_score(y_test, svm_c_predict), 2)}")

The f1 score for the Logisitic Regression is: 0.31
The f1 score for the Logistic Regression is: 0.31
The f1 score for the KNN is: 0.66
The f1 score for the KNN is: 0.49
The f1 score for the Decision Tree is: 0.56
The f1 score for the Decision Tree is: 0.53
The f1 score for the Bagged Decision Trees is: 0.99
The f1 score for the Bagged Decision Trees is: 0.48
The f1 score for the Random Forest is: 0.601
The f1 score for the Random Forest is: 0.54
The f1 score for the Adaboost Model is: 0.68
The f1 score for the Adaboost Model is: 0.54
The f1 score for the Support Vector Classifiers is: 0.46
The f1 score for the Support Vector Classifier is: 0.45


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

**Answer:** There were instances of overfitting in a couple of models. In the models KNN, Bagged Decision Trees, Random Forest, and Adaboost there was evidence of overfitting.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

**Answer:** I would pick the Random Forest model because it had the highest score and not to much variance. 

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

**Answer:**I would perform a gridsearch to find the best hyperparamters. Do more feature engineering. If it would be possible I would like to collect more data.

## Step 6: Answer the problem. [BONUS] 

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.