<a href="https://colab.research.google.com/github/brobro10000/CS5262-foundations-of-machine-learning/blob/main/CS5262_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What Drives Income? A Machine Learning Perspective

## Background

The amount an individual makes per year (income) can have an impact on several factors of an individuals well being and the ability to access resources deemed as necessary. Some documented examples of income correlation are found in [healthcare](https://www.ncbi.nlm.nih.gov/books/NBK578537/), [education](https://cepa.stanford.edu/content/widening-academic-achievement-gap-between-rich-and-poor-new-evidence-and-possible-explanations) and [housing](https://repository.gatech.edu/server/api/core/bitstreams/a131f386-4ca6-4be6-a252-0b5b8542f2cb/content). The ability to accurately predict the income level of an individual would potentionally create targeted interventions for those at risk, recognize opportunities for individual growth, and inform policies to address social and economic disparities.

The Census Income is derived from the Census Beureau as "income from several major household surveys and programs" ([source](https://www.census.gov/topics/income-poverty/income.html)). One of the possible challenges to determine an individuals income is being able to accurately retrieve the data in the form of a census and expect all responses are accurate. Another challenge is retrieving a balanced sample of individuals when gathering income levels. This may lead to potentially having a large number of datapoints from a certain income range, and not enough from the the 'tail ends' of society: very low income and very high income.

With this information in mind, I will be using the [*Adult Income Dataset*](https://archive.ics.uci.edu/dataset/2/adult) with machine learning to create a predictive model of whether an individual's income is greater than `$50,000`. This dataset was originally created in 1994 and donated in 1996 and was created by Barry Becker of Silicon Graphics, and [Ronny Kohavi](https://robotics.stanford.edu/~ronnyk/), a consultant and professor at Stanford Robotics Center.


## Project Description

The Adult Income Dataset I will be using derived from [University of California, Irvine, Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Below is a table of an overview of the dataset characteristics.

### Overview

| **Dataset Characteristics** | Multivariate        |
|-----------------------------|---------------------|
| **Subject Area**            | Social Science      |
| **Associated Tasks**        | Classification      |
| **Feature Type**            | Categorical, Integer|
| **# Instances**             | 48,842             |
| **# Features**              | 14                 |


### Variables

| **Variable Name** | **Role**   | **Type**       | **Demographic**      | **Description**                                                                                                                                     | **Units** | **Missing Values** |
|-------------------|------------|----------------|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------|---------------------|
| age               | Feature    | Integer        | Age                  | N/A                                                                                                                                                 |           | no                 |
| capital-gain      | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| capital-loss      | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| education         | Feature    | Categorical    | Education Level      | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |           | no                 |
| education-num     | Feature    | Integer        | Education Level      |                                                                                                                                                     |           | no                 |
| fnlwgt            | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| hours-per-week    | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| income            | Target     | Binary         | Income               | >50K, <=50K.                                                                                                                                        |           | no                 |
| marital-status    | Feature    | Categorical    | Other                | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.                                          |           | no                 |
| native-country    | Feature    | Categorical    | Other                | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, etc.   |           | yes                |
| occupation        | Feature    | Categorical    | Other                | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, etc.                       |           | yes                |
| race              | Feature    | Categorical    | Race                 | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                                                       |           | no                 |
| relationship      | Feature    | Categorical    | Other                | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                                                |           | no                 |
| sex               | Feature    | Binary         | Sex                  | Female, Male.                                                                                                                                       |           | no                 |
| workclass         | Feature    | Categorical    | Income               | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.                                              |           | yes                |


### Additional Disclosure

The criteria of the dataset was also parsed based on the following parameters

- `((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))`

- `(AAGE>16)` - Age is greater then 16

- `(AGI>100)` - Adjusted growth income is more then '100' units (
  (most likely dollars)

- `(AFNLWGT > 1)` - This represents the final weight. A  description from [Kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income/versions/2/data) is as follows:

  > **Description of `fnlwgt` (final weight)**
  >
  > The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:
  >
  > - A single cell estimate of the population 16+ for each state.
  > - Controls for Hispanic Origin by age and sex.
  > - Controls by Race, age, and sex.
  >
  > We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights.
  >
  > There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

- `(HRSWK>0)` - Hours worked is greater then 0.


### Methodology

The process I intend to use to complete this project is listed below.

- Create a problem definition
- Perform exploratory data analysis
- Iteratively perform the following steps:
  - Feature engineering
  - Modeling
  - Assessment
- Summarize findings and model feasability

#### Problem Definition

Census data pertaining to adult income were retrieved and archived within the UC Irvine Machine Learning Repository. With this data and machine learning, can I create a model that predicts if an individual makes more than `$50,000` a year? The model should accurately and precisely predict whether an individual's income is greater than `$50,000`.

The datasource provides baseline model performance values for each metric, and the objective is to improve predictive value by at least 5% above the baseline for each metric. This model could help governments and organizations determine which individuals are most at risk of falling below this value, and what fields correlate to a higher income.

#### Data Source

The data source, UC Irvine, has many possibilities for using the data. There is an option to use the API to retrieve the raw dataset, along with individual test and train data or we can directly save them within Google Colab which will act as our development environment, and slightly improve performance by importing it locally.

Given we have a set of test and train data provided seperately, I plan to use the provided test data as final holdout data to perform the final model evaluation, along with creating a test/train split with a 75% training set, and 25% test set.

Since the dataset is considered an imbalanced dataset, research will have to be done on how to appropriately handle test/train data splitting to make sure a valid subset of the data is within each set.

#### Exploratory Data Analysis

Using synonymous Python packages, we will begin parsing through the data, and highlight correlations that may exist between individudal fields and income, while highlighting uniqueness between features and their correlation to avoid redundant features while maintaining the mantra that "Correlation does not equal causation" ([source](https://www.machinelearningplus.com/statistics/correlation/#ImportanceofCorrelationinDataScience)).

Packages to be used:

- Pandas
- NumPy
- Scikit-learn
- Matplotlib/Seaborn

Pandas and Numpy will be used to help manipulate the raw data into more usable data structure. Matplotlib and its extended version, Seaborn, will be used to assist in the exploratory data analysis portion, while Scikit-learn will help us break the dataset into our test/train split. Since we also have holdout test data, I will not be performing any exploratory data analysis on that subset of data.

#### Feature Engineering

As part of exploratory data analysis, it is important to process the data further before feeding it into a model. Here are the following ways I anticipate feature engineering for the income dataset

- Removing or 'Averaging' empty data points
- Encoding of categorical data (such as education level)
- Data aggregation of common fields (such as education level)
- Handling outliers beyond what is provided by `fnlwgt`

Once sufficient data analysis has been done, we will leverage the outcome of feature engineering to make our final selection for fields to feed into our training model

Given feature engineering is an iterative process, I anticipate revisiting feature engineering to create and consolidate for the most performant model based on our performance metric.

#### Modeling
Leveraging the compute and GPU resources of Google Colab, we will select a model to train our data against after each iteration of feature engineering.

Since the outcome of our model is intended to be a binary choice and the dataset presents itself as a classification based dataset for machine learning, I will be choosing several different classification models agaisnt each iteration of feature engineering. The main models types I will consider are identical to what is provided in the baseline model performance, and determine if the features selected outperform the the provided outcomes for precision and accuracy. The models provided are as follows ([source](https://archive.ics.uci.edu/dataset/2/adult)):

- Xgboost
- Support Vector
- Random Forest
- Neural Network
- Logistic

Since the most performant model types provided for both accuracy and precision are Xgboost, Support Vector and Random forest, those are the model types I will be focusing on specfically.

#### Assessment

Assessment will be done by using the trained model against the original train/test split from the training data. If the results of test data meets the requirements set by the Problem Definition, (greater then 5% increase in accuracy and precision against each base model's performance), we will then assess the model against the holdout test data provided by the data source. If the outcome of the model still exceeds our defined performance objectives, further performance metrics will be gauged and the model will be serialized to revisit and iterate further if necessary.

#### Summary

Once performant (and non-performant) models are recognized, we will summarize our findings and highlight our initial project definition performance metrics, and any additional metrics that result from analysis of the holdout data against each model.

## Performance Metrics

Given the datasource has provided us with a stepping stone into which metrics we can utilize, we will leverage those first (precision and accuracy).

Additional metrics that can be used is determing the cost benefit analysis of our results against a confusion matrix. Finally, since this is considered an imbalanced dataset, some brief research ([source](https://futuremachinelearning.org/strategies-to-handle-imbalanced-datasets-in-machine-learning/)) there are several strategies to determine the most relevant metrics to use.

Relevant Formulas:


| **Metric**      | **Formula**                                | **Description**                                                                                  |
|------------------|--------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Precision**    | TP / (TP + FP)                            | Measures accuracy of positive predictions; inversely related to false positive rate (FPR).       |
| **Recall (TPR)** | TP / (TP + FN)                            | Measures the proportion of actual positive cases correctly identified (also called Sensitivity). |
| **FPR**          | FP / (FP + TN)                            | Measures the proportion of actual negative cases incorrectly classified as positive.            |
| **ROC-AUC**      | Integral of TPR against FPR               | Quantifies the trade-off between TPR and FPR; reflects a model's capacity to distinguish classes.|


Where TP is true postive, FP is false positive, TPR is true postive rate and FPR is false positive rate. ([source](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification))

Since the goal is to improve on the baseline model by 5% for accuracy and precision, we expect the same level of increase against the holdout test data.