# Modeling

Notebook for modeling and testing these models.

In [1]:
#SET RANDOM SEED HERE

In [2]:
#imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from pathlib import Path

#utils.py imports
from projecttools.utils import featureEngineeringKavinV1
from projecttools.utils import model_eval

#### Training-Testing Data Split

Used across all versions.

In [3]:
#data read + process
mycwd = os.getcwd()
os.chdir("..")
df = pd.read_csv("data/" + "adult.data", 
            index_col=False, 
            names=['age', 
                   'workclass', 
                   'fnlwgt', 
                   'education', 
                   'education-num', 
                   'marital-status', 
                   'occupation', 
                   'relationship', 
                   'race', 
                   'sex', 
                   'capital-gain', 
                   'capital-loss',
                   'hours-per-week',
                   'native-country',
                   'income'])
os.chdir(mycwd)

In [4]:
training = df.iloc[:, :-1]
testing = df.iloc[: , -1]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(training, testing, test_size=0.2, random_state=1)

Now, let us convert these into CSV's for reproducibility and access in other notebooks.

In [6]:
mycwd = os.getcwd()
mycwd

'/Users/kavin/hw07-group24/notebooks'

In [7]:
#X_train
X_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_train.csv"))

#X_test
X_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_test.csv"))

#y_train
y_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_train.csv"))

#y_test
y_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_test.csv"))

### Modeling Version 1 -- Kavin

I will first conduct my data processing using the feature engineering/processing function written in the Version 1 (Kavin) section of the FeatureEngineering notebook.

In [8]:
#data read + process
mycwd = os.getcwd()
os.chdir("..")
df = pd.read_csv("data/" + "adult.data", 
            index_col=False, 
            names=['age', 
                   'workclass', 
                   'fnlwgt', 
                   'education', 
                   'education-num', 
                   'marital-status', 
                   'occupation', 
                   'relationship', 
                   'race', 
                   'sex', 
                   'capital-gain', 
                   'capital-loss',
                   'hours-per-week',
                   'native-country',
                   'income'])
os.chdir(mycwd)

#### Feature Engineered data using utils.py function

In [9]:
fe_df = featureEngineeringKavinV1(df)
training = fe_df.iloc[:, :-1]
testing = fe_df.iloc[: , -1]

KeyError: 'capital-gain log transformed'

#### Training-Testing Data Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(training, testing, test_size=0.2, random_state=1)

Now, let us convert these into CSV's for reproducibility and access in other notebooks.

In [None]:
mycwd = os.getcwd()
mycwd

In [None]:
#X_train
X_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_train.csv"))

#X_test
X_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_test.csv"))

#y_train
y_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_train.csv"))

#y_test
y_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_test.csv"))

#### Modeling

##### We will use a Random Forest model here.

In [None]:
from sklearn.ensemble import RandomForestClassifier

regressor = RandomForestClassifier(n_estimators=20, random_state=1)
regressor.fit(X_train, y_train)
y_pred_train = regressor.predict(X_train)

In [None]:
#Training Prediction
y_pred_train

In [None]:
#Testing Prediction
y_pred_test = regressor.predict(X_test)
y_pred_test

#### Testing

Now, let us test our model.

We will use the utils.py function model_eval that lets us see how our model did and gives us various performance metrics.

In [None]:
model_eval(y_train, y_test, y_pred_train, y_pred_test)

##### The below is for testing datasets.

The precision and recall scores are fairly high at ~0.9 and F1 is also at ~0.9, which is to be expected due to it being a combination of precision and recall. 

However, it seems accuracy is a bit lower at ~0.85. This indicates to me that my model performs okay, though it could be better. I suspect that there may be a bit of overfitting occurring.

##### The below is for training datasets.

Accuracy, precision, recall, and F1 all experience extremely high scores of ~0.98, making me think my training data may be a bit overfitted. Perhaps I could remove some of the overlapping features or mess with the hyper-parameters of the random forest classifier to prevent this.

##### Now, let us quickly store our model as a file.

In [None]:
import joblib
joblib.dump(regressor, filename="../models/kavin_model_v1.joblib")

### Modeling Version 2 -- Naomi

### Modeling Version 3 -- George

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

import joblib

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from projecttools.utils import model_eval, feat_eng_split


In [None]:
data = pd.read_csv("../data/adult.data",
                   names = ['age', 'workclass', 'fnlwgt', 'education','education-num',
                            'marital-status','occupation','relationship','race','sex',
                           'capital-gain','capital-loss','hours-per-week',
                            'native-country','income'])
data.head()

Drop `education`

In [None]:
data = data.drop("education", axis = 1)

Separate the independent and dependent features

In [None]:
X = data.drop("income", axis = 1)
y = data["income"]

In [None]:
X.head()

In [None]:
y.head()

Use the `feat_eng_split` utils function to:

1. Split the data into training and testing dataset.

2. Feature engineer the training dataset (scale and one hot encode)

3. Transform the testing dataset using the rules created by the training dataset.

4. Output the transformed training features and labels and the testing features and labels

In [None]:
X_train, X_valid, y_train, y_valid = feat_eng_split(X, y)

In [None]:
X_train.head()

In [None]:
X_valid.head()

The validation dataset will be used to evaluate our final model.

Check the null accuracy

In [None]:
pd.value_counts(y_train, normalize=True)

Null accuracy of 76%

### Plan

1. Evaluate how well KNN and Decision Trees do on the data.

2. Derive the best (using parameter tuning) version each model.

3. Then combine the two tuned models into one ensemble model using stacking.

4. Evaluate how well this stacked model does on the validation dataset

#### KNN

In [None]:
#GridSearch KNN

param_grid = {"n_neighbors":np.arange(3, 21, 2)}

grid_knn = GridSearchCV(estimator=KNeighborsClassifier(), 
                        param_grid=param_grid, scoring="accuracy", cv = 10, verbose = 3)

grid_knn.fit(X_train, y_train)

Best params

In [None]:
grid_knn.best_params_

Best score

In [None]:
grid_knn.best_score_

Grab the best estimator, make predictions on the training and validations and then use the `model_eval` function to compare results.

In [None]:
best_knn = grid_knn.best_estimator_

In [None]:
train_knn_preds = best_knn.predict(X_train)
valid_knn_preds = best_knn.predict(X_valid)

In [None]:
knn_performance = model_eval(y_train, y_valid, train_knn_preds, valid_knn_preds)
knn_performance

There doesn't seem to be significant overfitting but scores are certainly lower than I'd like.

### Decision Trees

In [None]:
#Grid search decision trees
param_grid = {"max_depth": np.arange(5, 40, 2)}

grid_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=1), 
                        param_grid=param_grid, scoring="accuracy", cv = 10, verbose = 3)

grid_dt.fit(X_train, y_train)

Best params 

In [None]:
grid_dt.best_params_

Best score

In [None]:
grid_dt.best_score_

Grab best dt model

In [None]:
best_dt = grid_dt.best_estimator_

Make predictions on training and validation sets

In [None]:
train_dt_preds = best_dt.predict(X_train)
valid_dt_preds = best_dt.predict(X_valid)

In [None]:
dt_performance = model_eval(y_train, y_valid, train_dt_preds, valid_dt_preds)
dt_performance

The DT model has virtually the same accuracy score as KNN, but has a much better precision score and worse recall score.

Decision trees are better at minimizing the false positives (people classified as low income but not) while KNN is better at minimizing the false negatives (people classified as high income but not).

### Stacking

In this part we are going to combine our two KNN and DT models along with a logisic regression which will be our final estimator.

Collect models into list of tuple pairs.

In [None]:
models = [("knn", best_knn),
         ("dt", best_dt)]

Intialize stacking classifier

In [None]:
stack = StackingClassifier(estimators=models, final_estimator=LogisticRegression(random_state=1))

Fit model on training data

In [None]:
stack.fit(X_train, y_train)

Classify training and validation datasets

In [None]:
train_stack_preds = stack.predict(X_train)
valid_stack_preds = stack.predict(X_valid)

In [None]:
stack_performance = model_eval(y_train, y_valid, train_stack_preds, valid_stack_preds)
stack_performance

Let's compare this with the knn and dt results.

In [None]:
knn_performance

In [None]:
dt_performance

The stacked model:

   - Showed improvement in the f1 scores.
   - Not much improvement over the two other accuracy scores.
   - Has virtually the same recall score as knn.
   - But has a worse precision score than dt
    

Save model

In [None]:
joblib.dump(stack, filename="../models/george_stack_model.joblib")

### Modeling Version 4 -- Winston

For verison 4 of our modeling process, we mvoe to a new branch of machine learning in neural networks.  In this part, I will be training a two hidden layer ne