# Main Narrative Notebook -- Predicting Income Off of General Census Data, a STAT 159 Final Project

## EDA Analysis

In [1]:
...

Ellipsis

## Feature Engineering Analysis

Now let us go over Feature Engineering best practices that we have found across our various versions.

### Version 1

Version 1 has a lot of standard, to-be-expected feature engineering, along with some interesting approaches.

##### Let us start with the standard stuff. 

There is a lot of one-hot encoding for the categorical variables included in the dataset. This lets us use numerical inputs for our model instead of the strings the categorical variables originally stored. 

Another fairly standard feature-engineering method used is log-transforms. These are especially useful for numerical variables where the later numbers need less weightage than the earlier numbers. This method was used on the following variables: age, years in education, capital-gain, capital-loss, and hours worked per week.

##### Now, let us talk about some less standard, interesting methods used.

Version 1 utilized combined features, meaning it combined different features through various methods. The purpose of this is that a feature may be stronger in terms of model performance if two correlated features are combined and used as a single feature within a model. These very well may have contributed to the solid performance of the version 1 model.

The combined features used were: years educated / hours worked and capital gains * age. Years educated / hours worked is interesting, as higher years educated are correlated with less hours worked, so this correlation creates larger or higher values of the ratio. Additionally for capital gains * age, there is a positive relationship between the two, so multiplying amplifies the affect.

It should be noted that the ratios utilized used the log versions of the original continuous variables.

##### There were some miscellaneous changes as well.

The target variable was one-hot encoded to be <=50k as a binary variable. Additionally, all the variables that were one-hot encoded or used for log transforms were dropped, as they were not needed for the model.

### Version 2

Version 2 takes a simple/classical, but efficient approach to feature engineering. Sometimes doing less is more for feature engineering.

##### Let us start with the standard stuff. 

Version 2 made the excellent observation that the dataset uses "?" instead of NA values, which is not obvious if checking for NA's via standard Pandas functions. Version 2 made the decision to drop these NA values, which contrasts the decision made in Version 3, where the choice to not drop these NA's was made. Regardless, there is no right or wrong in this scenario and dropping them was a solid decision.

Next, all the appropriate categorical variables were one-hot encoded. Additionally, the original, unmodified versions of features were dropped from the dataset, as they have no use in the model.

##### Now, let us talk about some less standard, interesting methods used.

One super interesting thing about Version 2 is that they noticed that the data in the target variable (income level) was imbalanced and chose to resample to account for this. This is a very unique method to use in feature engineering to account for an imbalance.

Additionally, standardized data scaling was used for the continuous variables. This is an interesting way to weight the importance of continuous variables.

### Version 3

Version 3 is very methodical and goes in-depth into more standard feature engineering practices.

##### Let us start with the standard stuff. 

Version 3 made the excellent choice to check for non-standard NA coding. If one does a cursory check of the data with pandas functions, it would look like there are no NA values. However, Version 3 checked for NA values in different forms and it turns out the data uses "?" for its NA values. Version 3 made the decision to not impute these ? variables, as the difference is probably neglibile.

Next, Version 3 makes the choice to cut out any redundant features within the dataset. The most prominent ones being the number of years spent in education and highest education, as they essentially convey the same information. The highest education variable ended up being dropped.

Next, the target variable was encoded to become binary instead of being <=50k and >50k. Next, the categorical variables were one-hot encoded.

##### Now, let us talk about some less standard, interesting methods used.

The utilization of a fit_transform was interesting, where the data was first fit and then transformed. Another super interesting method used was a StandardScaler object, where a fit-transform was used on the continuous variables to standardize them. The choice to standardize them is interesting, as it is a safe way to transform continuous variables as opposed to doing log transforms, like in the version 1 example.

Overall, this feature engineering process was made to be very clean and generalizable. It should perform well across many models.

### Version 4

### Best Feature Engineering Practices Learned Across Versions 1, 2, 3, and 4

Now let us discuss the best feature engineering practices found across Versions 1, 2, 3, and 4 and how they can be applied together in future approaches to this census income problem or even similar ones.

1.

The first thing we will be discussing is not strictly feature engineering, but it is an important thing to think about. Datasets are not always standardized. In this particular example, there are no NA values, yet there are still blank rows that use "?" instead of NA. These don't appear under standard Pandas functions, but are still important. In feature engineering, it is important to account for these non-standardized datasets. Whether we choose to impute or use other methods is another story, but looking for these little details can go a long away in our feature engineering efforts and eventual model performance.

2.

One-hot encoding is a fairly obvious feature engineering effort, but is important nonetheless. Making sure to "numerify" categorical variables is super important.

3.

Cutting redundancy is important in feature engineering. If two features essentially communicate the same thing, removing one will help improve efficiency of the model and prevent the overweighting of a feature genre.

4.

Standardizing the features is another interesting method. It helps to reduce the range of the values and weights values different based on how close they are to the center. This could give extrema less of an impact on the model.

5.

In a similar vein to number 4, applying log transforms to features is a great practice in feature engineering. Some features, such as age, have diminishing returns as the values get higher. In these situations, applying a log-transform to make jumps in lower values to be weighted more can be helpful and change the way extrema are viewed by the model.

6.

Another great practice is combine features if they have some sort of correlation. For example, in version 1 there was a ratio of years educated / hours worked. These two features have a high correlation and combining two such features can potentially make models perform better.

## Modeling/Testing Analysis

In [3]:
...

Ellipsis

## Final Model and Results

In this section we find out if pooling our four different approaches can combine together to produce a signifcantly better model.

Steps:

   1. Each member's model and feature engineering methods are loaded into this notebook
   2. Make a train test split on the data
   3. Train each member's model on the training data.
   4. Classify the testing dataset with each model. Show each model's performance.
   5. Ensemble model:
       - Generate prediction probabilities for model.
       - Average them.
       - Using a 0.5 threshold turn probability into classifications.
   



In [22]:
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split

**George**

In [140]:
data = pd.read_csv("data/adult.data",
                   names = ['age', 'workclass', 'fnlwgt', 'education','education-num',
                            'marital-status','occupation','relationship','race','sex',
                           'capital-gain','capital-loss','hours-per-week',
                            'native-country','income'])

X = data.drop(["education", "income"], axis = 1)
y = data["income"]

Load in feature engineering utils function and apply it to the data

In [141]:
from projecttools.utils import feat_eng_split, model_eval

In [142]:
X_train, X_test, y_train, y_test = feat_eng_split(X, y)



Load in model and fit it on the training dataset

In [143]:
george_model = joblib.load("models/george_stack_model.joblib")
george_model.fit(X_train, y_train)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


StackingClassifier(estimators=[('knn', KNeighborsClassifier(n_neighbors=17)),
                               ('dt',
                                DecisionTreeClassifier(max_depth=9,
                                                       random_state=1))],
                   final_estimator=LogisticRegression(random_state=1))

Make predictions

In [144]:
george_train_preds = george_model.predict(X_train)
george_test_preds = george_model.predict(X_test)

In [145]:
george_performance = model_eval(y_train, y_test, george_train_preds, george_test_preds)
george_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.876986,0.860459
Precision Score,0.804316,0.7575
Recall,0.646489,0.618367
F1 Score,0.716817,0.680899


Generate probabilities

In [146]:
george_train_probs = george_model.predict_proba(X_train)[:, 1]
george_test_probs = george_model.predict_proba(X_test)[:, 1]

**Kavin**

In [147]:
from projecttools.utils import featureEngineeringKavinV1

Transform features

In [148]:
X = featureEngineeringKavinV1(data)
y = X.iloc[:, -1].astype(int)
X = X.iloc[:, :-1]
X.head()

Unnamed: 0,?,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,10th,...,United-States,Vietnam,Yugoslavia,age log transformed,years in education log transformed,hours-per-week log transformed,capital-gain log transformed,capital-loss log transformed,years educated / hours worked,capital gains * age
0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,3.688879,2.639057,3.713572,7.684784,0.0,0.710652,28.348242
1,0,0,0,0,0,0,1,0,0,0,...,1,0,0,3.931826,2.639057,2.639057,0.0,0.0,1.0,0.0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,3.663562,2.302585,3.713572,0.0,0.0,0.620046,0.0
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,3.988984,2.079442,3.713572,0.0,0.0,0.559957,0.0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,3.367296,2.639057,3.713572,0.0,0.0,0.710652,0.0


In [149]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1, stratify= y)

In [150]:
kavin_model = joblib.load("models/kavin_model_v1.joblib")


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


In [151]:
from sklearn.ensemble import RandomForestClassifier
kavin_model = RandomForestClassifier(n_estimators = 20, random_state = 1)

In [152]:
kavin_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=20, random_state=1)

Predictions

In [153]:
kavin_train_preds = kavin_model.predict(X_train)
kavin_test_preds = kavin_model.predict(X_test)

Performance

In [154]:
kavin_performance = model_eval(y_train, y_test, kavin_train_preds, kavin_test_preds)
kavin_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.976536,0.840806
Precision Score,0.982438,0.884706
Recall,0.986731,0.908753
F1 Score,0.98458,0.896568


Generate probailities

In [155]:
kavin_train_probs = kavin_model.predict_proba(X_train)[:, 1]
kavin_test_probs = kavin_model.predict_proba(X_test)[:, 1]

**Naomi**

Replicate Naomi's feature engineering process

In [156]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [157]:
data = pd.read_csv("data/adult.data",
                   names = ['age', 'workclass', 'fnlwgt', 'education','education-num',
                           'marital-status','occupation','relationship','race','sex',
                          'capital-gain','capital-loss','hours-per-week',
                            'native-country','income'])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [158]:
# Label Encoding
for col in data.columns:
    if data[col].dtypes == 'object':
        encoder = LabelEncoder()
        data[col] = encoder.fit_transform(data[col])

In [159]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


In [160]:
X = data.drop(['workclass', 'education', 'race', 'sex', 'capital-loss',
        'native-country', 'income'], axis = 1)
y = data["income"]

In [161]:
scaler = StandardScaler()
Xs = scaler.fit_transform(X)
Xs = pd.DataFrame(index=X.index, data= Xs, columns = X.columns)

In [162]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size = 0.25, random_state=1, stratify= y)

In [170]:
naomi_model = joblib.load("models/naomi_model.joblib")

In [171]:
naomi_model.fit(X_train, y_train)


SVC(probability=True, random_state=1)

In [172]:
naomi_train_preds = naomi_model.predict(X_train)
naomi_test_preds = naomi_model.predict(X_test)

In [173]:
naomi_performance = model_eval(y_train, y_test, naomi_train_preds, naomi_test_preds)
naomi_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.850041,0.844491
Precision Score,0.775379,0.756657
Recall,0.531202,0.521939
F1 Score,0.630474,0.617754


In [174]:
naomi_train_probs = naomi_model.predict_proba(X_train)[:, 1]
naomi_test_probs = naomi_model.predict_proba(X_test)[:, 1]

**Winston**

**Ensemble**

Collect the four sets of probabilites into one dataframe

In [175]:
ensemble_train_probs = pd.DataFrame({"george":george_train_probs,
                               "kavin": kavin_train_probs,
                                "naomi": naomi_train_probs})

ensemble_test_probs = pd.DataFrame({"george":george_test_probs,
                               "kavin": kavin_test_probs,
                                "naomi": naomi_test_probs})


In [176]:
ensemble_train_preds = ensemble_train_probs.mean(axis = 1).apply(lambda x: 1 if x>=0.5 else 0)
ensemble_test_preds = ensemble_test_probs.mean(axis = 1).apply(lambda x: 1 if x>=0.5 else 0)

In [177]:
ensemble_performance = model_eval(y_train, y_test, ensemble_train_preds, ensemble_test_preds)
ensemble_performance

Unnamed: 0,Training,Testing
Accuracy Score,0.859623,0.845105
Precision Score,0.715364,0.686201
Recall,0.692739,0.657143
F1 Score,0.70387,0.671358


Model Performance

In [None]:
ensemble_performance = model_eval(y_train, y_test, 

In [4]:
...

Ellipsis

## Author Contributions Statement

Kavin:

* Version 1 -- EDA.ipynb, FeatureEngineering.ipynb, Modeling.ipynb
* Makefile
* README.md
* Feature Engineering Analysis -- main.ipynb
* Repository Structuring
* Initializing + Formatting Notebooks

George McIntire:

...

Wen-Ching (Naomi) Tu:

...

Winston Cai:

...