# Main Narrative Notebook -- Predicting Income Off of General Census Data, a STAT 159 Final Project

## EDA Analysis

In [1]:
...

Ellipsis

## Feature Engineering Analysis

In [2]:
...

Ellipsis

## Modeling/Testing Analysis

In [3]:
...

Ellipsis

## Final Model and Results

In this section we find out if pooling our four different approaches can combine together to produce a signifcantly better model.

Steps:

   1. Each member's model and feature engineering methods are loaded into this notebook
   2. Make a train test split on the data
   3. Train each member's model on the training data.
   4. Classify the testing dataset with each model. Show each model's performance.
   5. Ensemble model:
       - Generate prediction probabilities for model.
       - Average them.
       - Using a 0.5 threshold turn probability into classifications.
   



In [22]:
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split

**George**

In [102]:
data = pd.read_csv("data/adult.data",
                   names = ['age', 'workclass', 'fnlwgt', 'education','education-num',
                            'marital-status','occupation','relationship','race','sex',
                           'capital-gain','capital-loss','hours-per-week',
                            'native-country','income'])

X = data.drop(["education", "income"], axis = 1)
y = data["income"]

Load in feature engineering utils function and apply it to the data

In [103]:
from projecttools.utils import feat_eng_split, model_eval

In [104]:
X_train, X_test, y_train, y_test = feat_eng_split(X, y)



Load in model and fit it on the training dataset

In [105]:
george_model = joblib.load("models/george_stack_model.joblib")
george_model.fit(X_train, y_train)

StackingClassifier(estimators=[('knn', KNeighborsClassifier(n_neighbors=17)),
                               ('dt',
                                DecisionTreeClassifier(max_depth=9,
                                                       random_state=1))],
                   final_estimator=LogisticRegression(random_state=1))

Make predictions

In [106]:
george_train_preds = george_model.predict(X_train)
george_test_preds = george_model.predict(X_test)

In [107]:
george_performance = model_eval(y_train, y_test, george_train_preds, george_test_preds)
george_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.876986,0.860459
Precision Score,0.804316,0.7575
Recall,0.646489,0.618367
F1 Score,0.716817,0.680899


Generate probabilities

In [108]:
george_train_probs = george_model.predict_proba(X_train)[:, 1]
george_test_probs = george_model.predict_proba(X_test)[:, 1]

**Kavin**

In [109]:
from projecttools.utils import featureEngineeringKavinV1

Transform features

In [110]:
X = featureEngineeringKavinV1(data)
y = X.iloc[:, -1].astype(int)
X = X.iloc[:, :-1]
X.head()

Unnamed: 0,?,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,10th,...,United-States,Vietnam,Yugoslavia,age log transformed,years in education log transformed,hours-per-week log transformed,capital-gain log transformed,capital-loss log transformed,years educated / hours worked,capital gains * age
0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,3.688879,2.639057,3.713572,7.684784,0.0,0.710652,28.348242
1,0,0,0,0,0,0,1,0,0,0,...,1,0,0,3.931826,2.639057,2.639057,0.0,0.0,1.0,0.0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,3.663562,2.302585,3.713572,0.0,0.0,0.620046,0.0
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,3.988984,2.079442,3.713572,0.0,0.0,0.559957,0.0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,3.367296,2.639057,3.713572,0.0,0.0,0.710652,0.0


In [111]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1, stratify= y_le)

In [112]:
kavin_model = joblib.load("models/kavin_model_v1.joblib")


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


In [113]:
from sklearn.ensemble import RandomForestClassifier
kavin_model = RandomForestClassifier(n_estimators = 20, random_state = 1)

In [114]:
kavin_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=20, random_state=1)

Predictions

In [115]:
kavin_train_preds = kavin_model.predict(X_train)
kavin_test_preds = kavin_model.predict(X_test)

Performance

In [116]:
kavin_performance = model_eval(y_train, y_test, kavin_train_preds, kavin_test_preds)
kavin_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.976126,0.840683
Precision Score,0.981291,0.883962
Recall,0.987378,0.909562
F1 Score,0.984325,0.896579


Generate probailities

In [117]:
kavin_train_probs = kavin_model.predict_proba(X_train)[:, 1]
kavin_test_probs = kavin_model.predict_proba(X_test)[:, 1]

**Naomi**

**Winston**

**Ensemble**

Collect the four sets of probabilites into one dataframe

In [119]:
ensemble_train_probs = pd.DataFrame({"george":george_train_probs,
                               "kavin": kavin_train_probs})

ensemble_test_probs = pd.DataFrame({"george":george_test_probs,
                               "kavin": kavin_test_probs})


In [121]:
ensemble_train_preds = ensemble_train_probs.mean(axis = 1).apply(lambda x: 1 if x>=0.5 else 0)
ensemble_test_preds = ensemble_test_probs.mean(axis = 1).apply(lambda x: 1 if x>=0.5 else 0)

Ensemble Model Performance

In [123]:
ensemble_performance = model_eval(y_train, y_test, ensemble_train_preds, ensemble_test_preds)
ensemble_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.8681,0.663186
Precision Score,0.954163,0.835251
Recall,0.867954,0.693092
F1 Score,0.909019,0.75756


Model Performance

In [4]:
...

Ellipsis

## Author Contributions Statement

Kavin:

...

George McIntire:

...

Wen-Ching (Naomi) Tu:

...

Winston Cai:

...