# Phase 3 Review - Predictive Classification Workflow

## Students Will Be Able To
- Understand the overall process to solve a predictive classification problem
- Understand and implement multiple classification algorithms
- Implement cross-validation techniques
- Handle class imbalance using SMOTE
- Use Regularization to perform feature selection
- Perform GridSearch to determine optimal hyperparameter combinations
- Create Pipelines to streamline the modeling process

## Business and Data Understanding

This dataset was downloaded from [Kaggle](https://www.kaggle.com/uciml/adult-census-income) and contains information on adult incomes. We are trying to predict whether or not an individual's yearly salary was greater than or equal to \$ 50,000 (binary classification). The column `salary` will be either a 0 (less than \$ 50,000) or a 1 (greater than or equal to \$50,000). The metric we will be using is accuracy.

## Tasks

### Data Preparation

#### Train-Test Split

We will be using cross-validation for the duration of this notebook. Please perform two train-test splits. First splitting the entire dataframe into train and test sets and then splitting the *train* data into *training and validation sets. Use `random_state=2021` and `test_size=.15` in both splits for reproducibility. We will be using the train and validation sets for the majority of this notebook. **The test set should be left alone until the very end**.

#### Preprocessing

Please perform the standard data preprocessing steps on the training and validation data:
- Check for missing data and impute if necessary (or drop)
- Scale numerical data
- OneHotEncode categorical data

### Modeling

#### Baseline Logistic Regression

Create a `LogisticRegression` model and fit it on the preprocessed training data. Check the performance of the model on the training and validation data. 

Please plot a confusion matrix of the model's predictions and compare it to the previous performance metric. What might be causing the accuracy score to be misleading? (*HINT*: Check the value counts of your target variable)

#### Second Logistic Regression

Please use SMOTE ([documentation here](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)) to adjust the imbalance of target classes. You can use your preprocessed training data at this step. Once you have resampled your training data, please fit another Logistic Regression model and check its performance using the training and validation data. Plot another confusion matrix and explain whether or not resampling helped improve the performance of your model.

#### Third LogisticRegression

Please create a third and final LogisticRegression model and adjust at least one hyperparameter related to the regularization of the model. Fit the model on the preprocessed, resampled training data. Check the performance on the training and validation data.

Inspect the coefficients of this third model and report the 5 features with the largest coefficients and the 5 features with the lowest coefficients. ([documentaion here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html))

#### DecisionTreeClassifier

Create a `DecisionTreeClassifier` using hyperparameters of your choosing. Please fit the model on the training data and check its performance on the train and validation sets.

#### GridSearch RandomForest

For your final model, please use `GridSearchCV` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) to determine the optimal hyperparameter combination for a `RandomForestClassifier`

Data has been imported for you. Please perform the following steps

- Preprocess
    - Train-validation-test split
    - Impute missing values
    - Scale numerical features
    - OHE categorical
- Basic Logistic Regression model
    - Show features with the 5 highest and 5 lowest coefficient values
- Use SMOTE to resample

In [1]:
# Run this cell without changes

# Basic imports
import pandas as pd
import numpy as np

# Read data into dataframe and remove whitespace from columns
df = pd.read_csv('data/adult.csv')
df.columns = [col.strip() for col in df.columns]

# Replace '?' with np.nan
df = df.replace(' ?', np.nan)

# Convert salary column to 0 and 1
df['salary'] = [1 if s==' >50K.' else 0 for s in df['salary']]

# Drop fnlwgt column
df = df.drop('fnlwgt', axis=1)

# Print shape and head
print('Shape: ', df.shape)
df.head()

Shape:  (16281, 14)


Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,0


### Choose a Metric

In the cell below, please indicate which classification metric you think we should use by assigning it to the variable `metric`. Remember we want to *minimize false positives*. 

In [22]:
# Your code here
metric = 'precision'

### Import Required Packages

In [23]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector, make_column_transformer

## Train-Test Split

Before performing any preprocesing or modeling, we need to perform a train-test split. We will cover two forms of cross-validation in this notebook, the second being one you are already familiar with: `cross_val_score`. 

For now, we conduct a more manual cross-validation by performing a double train-test split. The first will split the data into training and testing sets, the second will split the *training data* into a training and *validation set*. **We will not touch the test set until we have decided upon a final model.**

In the cell below, separate your features and target variable and perform the first train-test split. Please set `random_state=2021` and `test_size=.15`

In [80]:
# Separate target and features
X = df.drop('salary', axis=1)
y = df['salary']

# Initial train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15, random_state=2021)

Now we want to create a *validation* set. 

In the cell below, please perform your second data split using `X_train` and `y_train` and setting `random_state=2021` and `test_size=.15`.

In [81]:
# Train-validation split
X_t, X_val, y_t, y_val = train_test_split(X_train, y_train, test_size=.15, random_state=2021)

## Preprocessing

We have successfully split our data, the next step is to prepare it for modeling. 

Please perform the following steps:
- Check for missing values and impute if necessary (or drop)
- Scale the numerical data
- OneHotEncode categorical data

For the sake of this example, use `select_dtypes()` to determine what is numeric and what is categorical.

Pay attention to the data types of the missing columns. You may need to adjust your imputer `strategy`.

At this time, please **do not touch `X_test` and `y_test`**. You should use `X_val` and `y_val` in their place.

Functions:

- Impute
- Scale
- OHE
- Combine
- data_preprocessing

In [82]:
X_t.isna().sum()

age                 0
workclass         691
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        694
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    198
dtype: int64

In [83]:
X_val.isna().sum()

age                 0
workclass         129
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        129
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country     32
dtype: int64

`X_t` and `X_val` are missing values from the same column. Below I create a SimpleImputer and impute the missing values in each. 

Pay attention to the data types of the missing columns. You may need to adjust your imputer `strategy`.

In [84]:
# Instantiate SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')

In [85]:
# Fit on X_t
imputer.fit(X_t)

# Transform X_t and X_val
X_t_imputed = pd.DataFrame(imputer.transform(X_t), columns=X_t.columns)
X_val_imputed = pd.DataFrame(imputer.transform(X_val), columns=X_val.columns)

In [86]:
# Select Numeric and Categorical data
numeric = X.select_dtypes('number')
categorical = X.select_dtypes('object')

In [107]:
X_t_numeric = X_t_imputed[numeric.columns]
X_t_categorical = X_t_imputed[categorical.columns]

X_val_numeric = X_val_imputed[numeric.columns]
X_val_categorical = X_val_imputed[categorical.columns]

In [108]:
# Instantiate StandardScaler
scaler = StandardScaler()

In [109]:
# Fit Scaler on numeric X_t
scaler.fit(X_t_numeric)

StandardScaler()

In [110]:
# Transform numeric X_t and X_val
X_t_scaled = pd.DataFrame(scaler.transform(X_t_numeric), columns=numeric.columns)
X_val_scaled = pd.DataFrame(scaler.transform(X_val_numeric), columns=numeric.columns)

In [111]:
# Instantiate OHE
ohe = OneHotEncoder(sparse=False)

In [112]:
# Fit on categorical data
ohe.fit(X_t_categorical)

OneHotEncoder(sparse=False)

In [113]:
# Transform X_t and X_val categorical
X_t_encoded = pd.DataFrame(ohe.transform(X_t_categorical), columns=ohe.get_feature_names())
X_val_encoded = pd.DataFrame(ohe.transform(X_val_categorical), columns=ohe.get_feature_names())

In [114]:
X_t_df = pd.concat([X_t_scaled, X_t_encoded], axis=1)
X_val_df = pd.concat([X_val_scaled, X_val_encoded], axis=1)

In [115]:
X_t_df.dtypes

age                    float64
fnlwgt                 float64
education-num          float64
capital-gain           float64
capital-loss           float64
                        ...   
x7_ Thailand           float64
x7_ Trinadad&Tobago    float64
x7_ United-States      float64
x7_ Vietnam            float64
x7_ Yugoslavia         float64
Length: 104, dtype: object

In [116]:
X_val_df

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,x0_ Federal-gov,x0_ Local-gov,x0_ Never-worked,x0_ Private,...,x7_ Portugal,x7_ Puerto-Rico,x7_ Scotland,x7_ South,x7_ Taiwan,x7_ Thailand,x7_ Trinadad&Tobago,x7_ United-States,x7_ Vietnam,x7_ Yugoslavia
0,-0.492522,0.969261,-0.029882,-0.14322,-0.219136,-0.429349,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-1.067515,2.016254,-0.029882,-0.14322,-0.219136,0.609927,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.519951,-0.718296,1.130096,-0.14322,-0.219136,0.769816,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.082470,-0.336004,-0.416541,-0.14322,-0.219136,0.370094,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,-0.205026,0.013635,-0.029882,-0.14322,-0.219136,-0.189516,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2071,-0.276900,-0.470936,-0.416541,-0.14322,-0.219136,-0.909015,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2072,-0.492522,-0.163156,-0.029882,-0.14322,-0.219136,-0.029627,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2073,-0.276900,-1.514995,-1.189860,-0.14322,-0.219136,-2.747736,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2074,0.801210,0.037913,1.130096,-0.14322,3.911634,0.609927,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Modeling

Let's start with a basic `LogisticRegression` model. Set `solver=liblinear`

In [117]:
logreg = LogisticRegression(solver='liblinear')

In [118]:
# fit on X_t_df
logreg.fit(X_t_df, y_t)

LogisticRegression(solver='liblinear')

Now let's make some predictions on the training and validation data and calculate the performance metric for each

In [119]:
train_preds = logreg.predict(X_t_df)
val_preds = logreg.predict(X_val_df)

In [120]:
from sklearn.metrics import precision_score

In [121]:
print('Training precision: ', precision_score(y_t, train_preds))
print('Validation precision: ', precision_score(y_val, val_preds))

Training precision:  0.7359855334538878
Validation precision:  0.7566844919786097


Validation precision is higher than the training precision. Model is underfit. Let's check the class balance.

In [122]:
y.value_counts()

0    12435
1     3846
Name: salary, dtype: int64

The positive class is extremely undersampled. Let's use SMOTE to balance this data a little more.

In [127]:
from imblearn.over_sampling import SMOTENC

In [128]:
sm = SMOTENC(categorical_features=categorical.columns, random_state=42)

In [129]:
X_res, y_res = sm.fit_resample(X, y)

ValueError: Input contains NaN