<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

---

### Let us begin...

Recall the `data science process`.
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

#### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

#### When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data.

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn import datasets
from sklearn.inspection import DecisionBoundaryDisplay

from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_squared_error

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, BaggingClassifier, \
RandomForestClassifier, AdaBoostClassifier
from sklearn.compose import ColumnTransformer

np.random.seed(42)   # seed for reproduce

In [2]:
df = pd.read_csv('401ksubs.csv')

In [3]:
df.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [4]:
# If we want to predict the income, it may helpful if we have: 
# career category,
# length of working (experience),
# level of education
# employment type (full-time, part-time, freelance, contract)

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

In [5]:
# The race is a sensitive data.
# It may lead to discrimination and bias.
# It may illegal.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [6]:
# inc^2 (the square of income)
# If we know this value, inc (income) can be find by the square root.

# nettfa: net total fin. assets, $1000
# It may be collinear with other variables, as nettfa is a measure of wealth 
# and resembles a target variable more than a feature.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs (Subject Matter Experts) might have done this!

In [7]:
# agesq (the square of age): As a person gains experience and ages, 
# their career progresses, but income may not grow linearly. 
# Therefore, agesq is created to capture this non-linear relationship.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

In [8]:
# inc should be defined as income, not inc^2

# e401k           byte   %9.0g                  =1 if eligble for 401(k)
# inc             float  %9.0g                  inc^2
# marr            byte   %9.0g                  =1 if married
# male            byte   %9.0g                  =1 if male respondent
# age             byte   %9.0g                  age^2
# fsize           byte   %9.0g                  family size
# nettfa          float  %9.0g                  net total fin. assets, $1000
# p401k           byte   %9.0g                  =1 if participate in 401(k)
# pira            byte   %9.0g                  =1 if have IRA
# incsq           float  %9.0g                  inc^2
# agesq           int    %9.0g                  age^2

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

In [9]:
# Linear Regression - will work if the relationship looks linear
# Logistic Regression - This is for classification problem
# Decision Regressor Tree - If no need of coefficent for interpretability, it's ok 
# Gradient Boosting - Will try
# Gradient Descent - Will try

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [30]:
# Create the set of Pipeline for each model

X = df[['marr','male','age','agesq','fsize']]
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X
                                                    , y
                                                    , test_size = 0.2
                                                   )

In [31]:
y_train.describe()

count    7420.000000
mean       39.243698
std        24.163015
min        10.008000
25%        21.643500
50%        33.241500
75%        50.250000
max       199.041000
Name: inc, dtype: float64

In [12]:
df['inc'].isnull().sum()

0

In [13]:
# Dictionary of Pipelines
# We scale only continuous columns, not binary

# List of continuous columns
continuous_cols = ['age','agesq','fsize']

# ColumnTransformer for Scale only continuous columns and leave others column untouch
preprocessor = ColumnTransformer(transformers = [('scaler', StandardScaler(), continuous_cols)], remainder = 'passthrough')

# Pipeline
pipelines = { 'LR': Pipeline([('preprocessor', preprocessor ), ('regressor', LinearRegression())])
             , 'KNN': Pipeline([('preprocessor', preprocessor ), ('regressor', KNeighborsRegressor())])
             , 'DCT': Pipeline([('preprocessor', preprocessor ), ('regressor', DecisionTreeRegressor())])
             , 'BAG': Pipeline([('preprocessor', preprocessor ), ('regressor', BaggingRegressor(DecisionTreeRegressor()))])
             , 'RF': Pipeline([('preprocessor', preprocessor ), ('regressor', RandomForestRegressor())])
             , 'ADA': Pipeline([('preprocessor', preprocessor ), ('regressor', AdaBoostRegressor())])
            }

In [14]:
# Define parameter grids for each model
param_grids = {
    'LR': {}
    , 'KNN': {'regressor__n_neighbors': [1, 3, 5, 7, 9]}
    , 'DCT': {'regressor__max_depth': [None, 10, 20, 30]}
    , 'BAG': {'regressor__n_estimators': [10, 50, 100], 'regressor__random_state': [42]}
    , 'RF': {'regressor__n_estimators': [10, 50, 100], 'regressor__max_depth': [None, 10, 20]}
    , 'ADA': {'regressor__n_estimators': [10, 50, 100], 'regressor__learning_rate': [0.001, 0.01, 0.1, 1]}
}

In [15]:
# Iterate through each model and its parameter grid
# Print each steps for checking
# grid search has RMSE scoring in negative form 'neg_root_mean_squared_error'; we take absolute to get positive value


for model_name, pipeline in pipelines.items():
    
    print(f"Running GridSearch for {model_name}...")
   
    grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grids[model_name], cv = 5, scoring = 'neg_root_mean_squared_error')
    grid_search.fit(X_train, y_train)
     
    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    print(f"Best score for {model_name}: {abs(grid_search.best_score_)}\n")

Running GridSearch for LR...
Best parameters for LR: {}
Best score for LR: 21.69155441448897

Running GridSearch for KNN...
Best parameters for KNN: {'regressor__n_neighbors': 9}
Best score for KNN: 22.6713908103253

Running GridSearch for DCT...
Best parameters for DCT: {'regressor__max_depth': 10}
Best score for DCT: 22.40769851124346

Running GridSearch for BAG...
Best parameters for BAG: {'regressor__n_estimators': 100, 'regressor__random_state': 42}
Best score for BAG: 22.512692300715177

Running GridSearch for RF...
Best parameters for RF: {'regressor__max_depth': 10, 'regressor__n_estimators': 50, 'regressor__random_state': 42}
Best score for RF: 22.017088665535045

Running GridSearch for ADA...
Best parameters for ADA: {'regressor__learning_rate': 0.001, 'regressor__n_estimators': 50, 'regressor__random_state': 42}
Best score for ADA: 21.804819237041198



##### 9. What is bootstrapping?

In [16]:
# It is a method that involves repeatedly resampling data with replacement, where each sample has the same size as the original. 
# Then, we aggregate the prediction values (by taking the mean) from each sample to obtain the final result.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

In [17]:
# A decision tree is a single predictive model. It is fast to compute but tends to overfit.

# A bagged decision tree creates multiple trees using bootstrapping. 
# The final result is an aggregation of predictions from all the trees, which helps reduce overfitting.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In [18]:
# Every tree in bagged decision trees can use all features, which can lead to correlation among the trees. 
# On the other hand, in a random forest, each tree can only access a random subset of features, 
# which reduces correlation between trees. The final result is obtained by aggregating the predictions from all trees, 
# making the random forest less correlated than bagged decision trees.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

In [19]:
# From 11.
# The correlation among trees in a random forest is lower than in a bagged decision tree model, 
# leading to less variance in the random forest. 
# However, because a random forest uses fewer features for each split, it may introduce some bias. 
# Still, this trade-off between reduced variance and slightly increased bias often results in better overall performance in a random forest model.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [20]:
# Same as 8, but add the part of get the best estimator and also print test score

for model_name, pipeline in pipelines.items():
    print(f"Running GridSearch for {model_name}...")
   
    grid_search = GridSearchCV(estimator=pipeline, param_grid = param_grids[model_name], cv=5, scoring = 'neg_root_mean_squared_error')
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_      # Best model from Grid Search

    # Calculate RMSE for training data
    y_train_pred = best_model.predict(X_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    
    # Calculate RMSE for testing data
    y_test_pred = best_model.predict(X_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    
    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    print(f"Best RMSE score: {abs(grid_search.best_score_)}")
    print(f"Training RMSE for {model_name}: {train_rmse}")
    print(f"Testing RMSE for {model_name}: {test_rmse}\n")

Running GridSearch for LR...
Best parameters for LR: {}
Best RMSE score: 21.69155441448897
Training RMSE for LR: 21.672411703221243
Testing RMSE for LR: 22.417156974691792

Running GridSearch for KNN...
Best parameters for KNN: {'regressor__n_neighbors': 9}
Best RMSE score: 22.6713908103253
Training RMSE for KNN: 21.4750048255395
Testing RMSE for KNN: 23.28834102056521

Running GridSearch for DCT...
Best parameters for DCT: {'regressor__max_depth': 10}
Best RMSE score: 22.40704524073252
Training RMSE for DCT: 20.91330712825238
Testing RMSE for DCT: 22.718558013822282

Running GridSearch for BAG...
Best parameters for BAG: {'regressor__n_estimators': 100, 'regressor__random_state': 42}
Best RMSE score: 22.512692300715177
Training RMSE for BAG: 20.45619559842459
Testing RMSE for BAG: 22.9823100539742

Running GridSearch for RF...
Best parameters for RF: {'regressor__max_depth': 10, 'regressor__n_estimators': 50, 'regressor__random_state': 42}
Best RMSE score: 22.017088665535045
Training 

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [21]:
# Linear Regression and ADA have a smaller difference in RMSE between training and testing, 
# while the others show a significant difference, which suggests overfitting."

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [22]:
# I would choose linear regression because it has the best score among the other models. 
# Additionally, linear regression is simple and easy to interpret.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [23]:
# 1) Feature Engineering: I've noticed that when using R^2 as the scoring system,
#    the scores seem to show no strong relationship (i.e., values below 0.2).
#    This means the model didn't capture the patterns, so it may useful when doing the interaction between features.
# 2) Hyperparameter Tuning: Maybe we didn't set the parameters to get the best results

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [24]:
# Because the person who participate e401k are most likely to not participate in p401k.
# This leads to multicollinear.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

In [25]:
# Logistic Regression
# KNN
# Decision Tree
# Random Forests
# Adaboost

# All models are okay to help classification problem

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [26]:
# e401k           byte   %9.0g                  =1 if eligble for 401(k)
# inc             float  %9.0g                  inc^2
# marr            byte   %9.0g                  =1 if married
# male            byte   %9.0g                  =1 if male respondent
# age             byte   %9.0g                  age^2
# fsize           byte   %9.0g                  family size
# nettfa          float  %9.0g                  net total fin. assets, $1000
# p401k           byte   %9.0g                  =1 if participate in 401(k)
# pira            byte   %9.0g                  =1 if have IRA
# incsq           float  %9.0g                  inc^2
# agesq           int    %9.0g                  age^2

In [44]:
# Ignore warning from Logistic Regression 
# FitFailedWarning for that penalty and solver are mismatch
# UserWarning for divergent of model

import warnings
from sklearn.exceptions import FitFailedWarning
warnings.filterwarnings('ignore', category = FitFailedWarning)
warnings.filterwarnings('ignore', category = UserWarning)

# Dictionary of Pipelines
X = df.drop(columns=['e401k', 'p401k'])
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X
                                                    , y
                                                    , test_size = 0.2
                                                    , stratify = y
                                                   )

# List of continuous columns
continuous_cols = ['age', 'agesq', 'fsize']

# ColumnTransformer for scaling only continuous columns
preprocessor = ColumnTransformer(transformers=[('scaler', StandardScaler(), continuous_cols)], remainder = 'passthrough')

# Define pipelines
pipelines = {'LR': Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
             , 'KNN': Pipeline([('preprocessor', preprocessor), ('classifier', KNeighborsClassifier())])
             , 'DCT': Pipeline([('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier())])
             , 'BAG': Pipeline([('preprocessor', preprocessor), ('classifier', BaggingClassifier(DecisionTreeClassifier()))])
             , 'RF': Pipeline([('preprocessor', preprocessor), ('classifier', RandomForestClassifier())])
             , 'ADA': Pipeline([('preprocessor', preprocessor), ('classifier', AdaBoostClassifier(algorithm = 'SAMME'))])
}

# Define parameter grids
param_grids = {
    'LR': {'classifier__C': [0.01, 0.1, 1, 10, 100]
           , 'classifier__penalty': [None,'l1', 'l2']
           , 'classifier__solver': ['lbfgs','liblinear']
           , 'classifier__max_iter': [100, 1000, 10000]
          }
    , 'KNN': {'classifier__n_neighbors': [1, 3, 5, 7, 9]
              , 'classifier__metric': ['euclidean', 'manhattan']
             }
    , 'DCT': {'classifier__max_depth': [None, 10, 20, 30]
              , 'classifier__min_samples_split': [2, 10]
             }
    , 'BAG': {'classifier__n_estimators': [10, 50, 100]
             }
    , 'RF': {'classifier__n_estimators': [10, 50, 100]
             , 'classifier__max_depth': [None, 10, 20]
             , 'classifier__min_samples_split': [2, 10]
            }
    , 'ADA': {'classifier__n_estimators': [10, 50, 100]
              , 'classifier__learning_rate': [0.001, 0.01, 0.1, 1]
              , 'classifier__estimator': [DecisionTreeClassifier(max_depth = 3), DecisionTreeClassifier(max_depth = 5)]
             }
}

# Iterate through each model and its parameter grid
for model_name, pipeline in pipelines.items():
    print(f"Running GridSearch for {model_name}...")

    grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grids[model_name], cv = 5, scoring = 'accuracy', n_jobs = -1)
    grid_search.fit(X_train, y_train)

    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    print(f"Best score for {model_name}: {grid_search.best_score_}\n")


Running GridSearch for LR...
Best parameters for LR: {'classifier__C': 100, 'classifier__max_iter': 100, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear'}
Best score for LR: 0.655256064690027

Running GridSearch for KNN...
Best parameters for KNN: {'classifier__metric': 'manhattan', 'classifier__n_neighbors': 9}
Best score for KNN: 0.6470350404312668

Running GridSearch for DCT...
Best parameters for DCT: {'classifier__max_depth': 10, 'classifier__min_samples_split': 10}
Best score for DCT: 0.6520215633423181

Running GridSearch for BAG...
Best parameters for BAG: {'classifier__n_estimators': 100}
Best score for BAG: 0.6557951482479785

Running GridSearch for RF...
Best parameters for RF: {'classifier__max_depth': 10, 'classifier__min_samples_split': 10, 'classifier__n_estimators': 50}
Best score for RF: 0.6838274932614554

Running GridSearch for ADA...
Best parameters for ADA: {'classifier__estimator': DecisionTreeClassifier(max_depth=5), 'classifier__learning_rate': 0.0

In [49]:
y_test.value_counts(normalize = True)

e401k
0    0.608086
1    0.391914
Name: proportion, dtype: float64

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [None]:
# False Positive
# Predict person is eligible, but actually preson is not eligible
# False Negative
# Predict person is not eligible, but actually person is eligible

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

In [None]:
# We would minimize false negative.
# People who are not identified as eligible may risk not meeting their retirement goals, 
# which could lead to other serious problems, such as homelessness.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

In [45]:
# The Recall = TP / (TP + FN)
# If we reduce FN, the recall will tend to 1 (perfect).

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

In [None]:
# f1-score is the harmonic mean of Precision and Recall
# Precision = TP / (TP + FP)  and Recall = TP / (TP + FN)

# This score will balance between false positive and false negative
# If FP and FN are low,, the f1-score will tend to 1.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [47]:
from sklearn.metrics import f1_score

# Iterate through each model and its parameter grid
for model_name, pipeline in pipelines.items():
    print(f"Running GridSearch for {model_name}...")

    # Perform GridSearchCV to find the best parameters
    grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grids[model_name], cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)

    # Best parameters and score for the current model
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_

    # Print the best parameters and score
    print(f"Best parameters for {model_name}: {best_params}")
    print(f"Best score for {model_name}: {best_score}")

    # Predict on both the training set and the test set
    y_train_pred = best_model.predict(X_train)
    y_test_pred = best_model.predict(X_test)

    # Calculate F1-score for the training set
    f1_train = f1_score(y_train, y_train_pred)
    print(f"F1-score for {model_name} on training set: {f1_train}")

    # Calculate F1-score for the test set
    f1_test = f1_score(y_test, y_test_pred)
    print(f"F1-score for {model_name} on test set: {f1_test}\n")


Running GridSearch for LR...
Best parameters for LR: {'classifier__C': 100, 'classifier__max_iter': 100, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear'}
Best score for LR: 0.655256064690027
F1-score for LR on training set: 0.47297577138123836
F1-score for LR on test set: 0.44966442953020136

Running GridSearch for KNN...
Best parameters for KNN: {'classifier__metric': 'manhattan', 'classifier__n_neighbors': 9}
Best score for KNN: 0.6470350404312668
F1-score for KNN on training set: 0.6005719733079123
F1-score for KNN on test set: 0.4839467501957713

Running GridSearch for DCT...
Best parameters for DCT: {'classifier__max_depth': 10, 'classifier__min_samples_split': 10}
Best score for DCT: 0.6498652291105121
F1-score for DCT on training set: 0.668141592920354
F1-score for DCT on test set: 0.5362210604929052

Running GridSearch for BAG...
Best parameters for BAG: {'classifier__n_estimators': 100}
Best score for BAG: 0.65633423180593
F1-score for BAG on training set: 1.0
F

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [None]:
# BAG is perfect performing on training set (f1-score = 1 means no false positive and false negative)
# but the f1-score of testing set is only 0.50. This means it is overfit.

# Other models seem overfitting too, except ADA and Logistic Regression.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [None]:
# From 25, the models that are not overfit is ADA and Logistic Regression
# But, the score of training set of Logistic Regression is lower that ADA; there is some bias here.

# Thus, our chosen model is ADA.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [None]:
# 1. Hyperparameter Tuning
# 2. Feature Engineering
# 3. Try another ensemble technique like GB, XGB, or stacking
# 4. SMOTE, but I think it no need here, since proportion of postive class and negative class is 40/60.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [None]:
# Regression
# At first, we try to use R^2 as scoring method; but the results are worst, all scores are less than 0.2
# We think all original features are not work. Maybe we must do some feature engineering before doing the model.

# From the below cell, when we check feature importance of ADA
# The key feature for getting e401k is 'pira' 

In [56]:
# Best ADA parameter get from grid search of Q19
best_ada = Pipeline([('preprocessor', preprocessor)
                     , ('classifier', AdaBoostClassifier(estimator = DecisionTreeClassifier(max_depth=5)
                                                         , learning_rate = 0.01
                                                         , n_estimators = 50
                                                         , algorithm = 'SAMME'
                                                        )
                       )
                    ])

best_ada.fit(X_train, y_train)    # fit

base_estimators = best_ada.named_steps['classifier'].estimators_
feature_importance = base_estimators[0].feature_importances_

# Show feature importancce

feature_names = X_train.columns
importance_df = pd.DataFrame({'Feature': feature_names
                              , 'Importance': feature_importance
                             }).sort_values(by = 'Importance', ascending = False)

print(importance_df)

  Feature  Importance
6    pira    0.655679
8   agesq    0.207367
7   incsq    0.059440
3     age    0.034935
0     inc    0.034444
2    male    0.008135
1    marr    0.000000
4   fsize    0.000000
5  nettfa    0.000000
