# Main Narrative Notebook -- Predicting Income Off of General Census Data, a STAT 159 Final Project

## EDA Analysis

In [1]:
...

Ellipsis

## Feature Engineering Analysis

Now let us go over Feature Engineering best practices that we have found across our various versions.

### Version 1

Version 1 has a lot of standard, to-be-expected feature engineering, along with some interesting approaches.

##### Let us start with the standard stuff. 

There is a lot of one-hot encoding for the categorical variables included in the dataset. This lets us use numerical inputs for our model instead of the strings the categorical variables originally stored. 

Another fairly standard feature-engineering method used is log-transforms. These are especially useful for numerical variables where the later numbers need less weightage than the earlier numbers. This method was used on the following variables: age, years in education, capital-gain, capital-loss, and hours worked per week.

##### Now, let us talk about some less standard, interesting methods used.

Version 1 utilized combined features, meaning it combined different features through various methods. The purpose of this is that a feature may be stronger in terms of model performance if two correlated features are combined and used as a single feature within a model. These very well may have contributed to the solid performance of the version 1 model.

The combined features used were: years educated / hours worked and capital gains * age. Years educated / hours worked is interesting, as higher years educated are correlated with less hours worked, so this correlation creates larger or higher values of the ratio. Additionally for capital gains * age, there is a positive relationship between the two, so multiplying amplifies the affect.

It should be noted that the ratios utilized used the log versions of the original continuous variables.

##### There were some miscellaneous changes as well.

The target variable was one-hot encoded to be <=50k as a binary variable. Additionally, all the variables that were one-hot encoded or used for log transforms were dropped, as they were not needed for the model.

### Version 2

Version 2 takes a simple/classical, but efficient approach to feature engineering. Sometimes doing less is more for feature engineering.

##### Let us start with the standard stuff. 

Version 2 made the excellent observation that the dataset uses "?" instead of NA values, which is not obvious if checking for NA's via standard Pandas functions. Version 2 made the decision to drop these NA values, which contrasts the decision made in Version 3, where the choice to not drop these NA's was made. Regardless, there is no right or wrong in this scenario and dropping them was a solid decision.

Next, all the appropriate categorical variables were one-hot encoded. Additionally, the original, unmodified versions of features were dropped from the dataset, as they have no use in the model.

##### Now, let us talk about some less standard, interesting methods used.

One super interesting thing about Version 2 is that they noticed that the data in the target variable (income level) was imbalanced and chose to resample to account for this. This is a very unique method to use in feature engineering to account for an imbalance.

Additionally, standardized data scaling was used for the continuous variables. This is an interesting way to weight the importance of continuous variables.

### Version 3

Version 3 is very methodical and goes in-depth into more standard feature engineering practices.

##### Let us start with the standard stuff. 

Version 3 made the excellent choice to check for non-standard NA coding. If one does a cursory check of the data with pandas functions, it would look like there are no NA values. However, Version 3 checked for NA values in different forms and it turns out the data uses "?" for its NA values. Version 3 made the decision to not impute these ? variables, as the difference is probably neglibile.

Next, Version 3 makes the choice to cut out any redundant features within the dataset. The most prominent ones being the number of years spent in education and highest education, as they essentially convey the same information. The highest education variable ended up being dropped.

Next, the target variable was encoded to become binary instead of being <=50k and >50k. Next, the categorical variables were one-hot encoded.

##### Now, let us talk about some less standard, interesting methods used.

The utilization of a fit_transform was interesting, where the data was first fit and then transformed. Another super interesting method used was a StandardScaler object, where a fit-transform was used on the continuous variables to standardize them. The choice to standardize them is interesting, as it is a safe way to transform continuous variables as opposed to doing log transforms, like in the version 1 example.

Overall, this feature engineering process was made to be very clean and generalizable. It should perform well across many models.

### Version 4

### Best Feature Engineering Practices Learned Across Versions 1, 2, 3, and 4

Now let us discuss the best feature engineering practices found across Versions 1, 2, 3, and 4 and how they can be applied together in future approaches to this census income problem or even similar ones.

1.

The first thing we will be discussing is not strictly feature engineering, but it is an important thing to think about. Datasets are not always standardized. In this particular example, there are no NA values, yet there are still blank rows that use "?" instead of NA. These don't appear under standard Pandas functions, but are still important. In feature engineering, it is important to account for these non-standardized datasets. Whether we choose to impute or use other methods is another story, but looking for these little details can go a long away in our feature engineering efforts and eventual model performance.

2.

One-hot encoding is a fairly obvious feature engineering effort, but is important nonetheless. Making sure to "numerify" categorical variables is super important.

3.

Cutting redundancy is important in feature engineering. If two features essentially communicate the same thing, removing one will help improve efficiency of the model and prevent the overweighting of a feature genre.

4.

Standardizing the features is another interesting method. It helps to reduce the range of the values and weights values different based on how close they are to the center. This could give extrema less of an impact on the model.

5.

In a similar vein to number 4, applying log transforms to features is a great practice in feature engineering. Some features, such as age, have diminishing returns as the values get higher. In these situations, applying a log-transform to make jumps in lower values to be weighted more can be helpful and change the way extrema are viewed by the model.

6.

Another great practice is combine features if they have some sort of correlation. For example, in version 1 there was a ratio of years educated / hours worked. These two features have a high correlation and combining two such features can potentially make models perform better.

## Modeling/Testing Analysis

In [3]:
...

Ellipsis

## Final Results

In [4]:
...

Ellipsis

## Author Contributions Statement

Kavin:

* Version 1 -- EDA.ipynb, FeatureEngineering.ipynb, Modeling.ipynb
* Makefile
* README.md
* Feature Engineering Analysis -- main.ipynb
* Repository Structuring
* Initializing + Formatting Notebooks

George McIntire:

...

Wen-Ching (Naomi) Tu:

...

Winston Cai:

...