# Modeling

Notebook for modeling and testing these models.

In [19]:
#SET RANDOM SEED HERE

In [24]:
#imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from pathlib import Path

#move back to previous directory
mycwd = os.getcwd()
os.chdir("..")
#utils.py imports
from projecttools.utils import featureEngineeringKavinV1
from projecttools.utils import model_eval
#change back to original directory
os.chdir(mycwd)

#### Training-Testing Data Split

Used across all versions.

In [3]:
#data read + process
mycwd = os.getcwd()
os.chdir("..")
df = pd.read_csv("data/" + "adult.data", 
            index_col=False, 
            names=['age', 
                   'workclass', 
                   'fnlwgt', 
                   'education', 
                   'education-num', 
                   'marital-status', 
                   'occupation', 
                   'relationship', 
                   'race', 
                   'sex', 
                   'capital-gain', 
                   'capital-loss',
                   'hours-per-week',
                   'native-country',
                   'income'])
os.chdir(mycwd)

In [4]:
training = df.iloc[:, :-1]
testing = df.iloc[: , -1]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(training, testing, test_size=0.2, random_state=1)

Now, let us convert these into CSV's for reproducibility and access in other notebooks.

In [6]:
mycwd = os.getcwd()
mycwd

'/Users/kavin/hw07-group24/notebooks'

In [7]:
#X_train
X_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_train.csv"))

#X_test
X_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_test.csv"))

#y_train
y_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_train.csv"))

#y_test
y_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_test.csv"))

### Modeling Version 1 -- Kavin

I will first conduct my data processing using the feature engineering/processing function written in the Version 1 (Kavin) section of the FeatureEngineering notebook.

In [8]:
#data read + process
mycwd = os.getcwd()
os.chdir("..")
df = pd.read_csv("data/" + "adult.data", 
            index_col=False, 
            names=['age', 
                   'workclass', 
                   'fnlwgt', 
                   'education', 
                   'education-num', 
                   'marital-status', 
                   'occupation', 
                   'relationship', 
                   'race', 
                   'sex', 
                   'capital-gain', 
                   'capital-loss',
                   'hours-per-week',
                   'native-country',
                   'income'])
os.chdir(mycwd)

#### Feature Engineered data using utils.py function

In [9]:
fe_df = featureEngineeringKavinV1(df)
training = fe_df.iloc[:, :-1]
testing = fe_df.iloc[: , -1]

#### Training-Testing Data Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(training, testing, test_size=0.2, random_state=1)

Now, let us convert these into CSV's for reproducibility and access in other notebooks.

In [11]:
mycwd = os.getcwd()
mycwd

'/Users/kavin/hw07-group24/notebooks'

In [12]:
#X_train
X_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_train.csv"))

#X_test
X_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/X_test.csv"))

#y_train
y_train.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_train.csv"))

#y_test
y_test.to_csv(Path(mycwd + "/TrainingTestingData/v1(Kavin)/y_test.csv"))

#### Modeling

##### We will use a Random Forest model here.

In [13]:
from sklearn.ensemble import RandomForestClassifier

regressor = RandomForestClassifier(n_estimators=20, random_state=1)
regressor.fit(X_train, y_train)
y_pred_train = regressor.predict(X_train)

In [14]:
#Training Prediction
y_pred_train

array([1, 0, 0, ..., 1, 1, 0], dtype=uint8)

In [15]:
#Testing Prediction
y_pred_test = regressor.predict(X_test)
y_pred_test

array([1, 1, 0, ..., 1, 1, 0], dtype=uint8)

#### Testing

Now, let us test our model.

We will use the utils.py function model_eval that lets us see how our model did and gives us various performance metrics.

In [16]:
model_eval(y_train, y_test, y_pred_train, y_pred_test)

Unnamed: 0,Training,Testing
Accuracy Score,0.975967,0.847382
Precision Score,0.980884,0.898262
Recall,0.987458,0.904696
F1 Score,0.98416,0.901467


##### The below is for testing datasets.

The precision and recall scores are fairly high at ~0.9 and F1 is also at ~0.9, which is to be expected due to it being a combination of precision and recall. 

However, it seems accuracy is a bit lower at ~0.85. This indicates to me that my model performs okay, though it could be better. I suspect that there may be a bit of overfitting occurring.

##### The below is for training datasets.

Accuracy, precision, recall, and F1 all experience extremely high scores of ~0.98, making me think my training data may be a bit overfitted. Perhaps I could remove some of the overlapping features or mess with the hyper-parameters of the random forest classifier to prevent this.

### Modeling Version 2 -- Naomi

### Modeling Version 3 -- George

### Modeling Version 4 -- Winston