<a href="https://colab.research.google.com/github/brobro10000/CS5262-foundations-of-machine-learning/blob/main/CS5262_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What Drives Income? A Machine Learning Perspective

## Background

The amount an individual makes per year (income) can have an impact on several factors of an individuals well being and the ability to access resources deemed as necessary. Some documented examples of income correlation are found in [healthcare](https://www.ncbi.nlm.nih.gov/books/NBK578537/), [education](https://cepa.stanford.edu/content/widening-academic-achievement-gap-between-rich-and-poor-new-evidence-and-possible-explanations) and [housing](https://repository.gatech.edu/server/api/core/bitstreams/a131f386-4ca6-4be6-a252-0b5b8542f2cb/content). The ability to accurately predict the income level of an individual would potentionally create targeted interventions for those at risk, recognize opportunities for individual growth, and inform policies to address social and economic disparities.

The Census Income is derived from the Census Beureau as "income from several major household surveys and programs" ([source](https://www.census.gov/topics/income-poverty/income.html)). One of the possible challenges to determine an individuals income is being able to accurately retrieve the data in the form of a census and expect all responses are accurate. Another challenge is retrieving a balanced sample of individuals when gathering income levels. This may lead to potentially having a large number of datapoints from a certain income range, and not enough from the the 'tail ends' of society: very low income and very high income.

With this information in mind, I will be using the [*Adult Income Dataset*](https://archive.ics.uci.edu/dataset/2/adult) with machine learning to create a predictive model of whether an individual makes above \$50,000 per year, or below \$50,000 per year. This dataset was originally created in 1994 and donated in 1996 and was created by Barry Becker of Silicon Graphics, and [Ronny Kohavi](https://robotics.stanford.edu/~ronnyk/), a consultant and professor at Stanford Robotics Center.


## Project Description

The Adult Income Dataset I will be using derived from [University of California, Irvine, Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Below is a table of an overview of the dataset characteristics.

### Overview

| **Dataset Characteristics** | Multivariate        |
|-----------------------------|---------------------|
| **Subject Area**            | Social Science      |
| **Associated Tasks**        | Classification      |
| **Feature Type**            | Categorical, Integer|
| **# Instances**             | 48,842             |
| **# Features**              | 14                 |


### Variables

| **Variable Name** | **Role**   | **Type**       | **Demographic**      | **Description**                                                                                                                                     | **Units** | **Missing Values** |
|-------------------|------------|----------------|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------|---------------------|
| age               | Feature    | Integer        | Age                  | N/A                                                                                                                                                 |           | no                 |
| capital-gain      | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| capital-loss      | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| education         | Feature    | Categorical    | Education Level      | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |           | no                 |
| education-num     | Feature    | Integer        | Education Level      |                                                                                                                                                     |           | no                 |
| fnlwgt            | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| hours-per-week    | Feature    | Integer        |                       |                                                                                                                                                     |           | no                 |
| income            | Target     | Binary         | Income               | >50K, <=50K.                                                                                                                                        |           | no                 |
| marital-status    | Feature    | Categorical    | Other                | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.                                          |           | no                 |
| native-country    | Feature    | Categorical    | Other                | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, etc.   |           | yes                |
| occupation        | Feature    | Categorical    | Other                | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, etc.                       |           | yes                |
| race              | Feature    | Categorical    | Race                 | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                                                       |           | no                 |
| relationship      | Feature    | Categorical    | Other                | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                                                |           | no                 |
| sex               | Feature    | Binary         | Sex                  | Female, Male.                                                                                                                                       |           | no                 |
| workclass         | Feature    | Categorical    | Income               | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.                                              |           | yes                |


### Additional Disclosure

The criteria of the dataset was also parsed based on the following parameters

- `((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))`

- `(AAGE>16)` - Age is greater then 16

- `(AGI>100)` - Adjusted growth income is more then '100' units (
  (most likely dollars)

- `(AFNLWGT > 1)` - This represents the final weight. A  description from [Kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income/versions/2/data) is as follows:

  > **Description of `fnlwgt` (final weight)**
  >
  > The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:
  >
  > - A single cell estimate of the population 16+ for each state.
  > - Controls for Hispanic Origin by age and sex.
  > - Controls by Race, age, and sex.
  >
  > We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights.
  >
  > There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

- `(HRSWK>0)` - Hours worked is greater then 0.



### Problem Definition


With this dataset, I plan to implement classification models that should accurately and precisely predict whether an individual's income is greater than $50,000. The datasource provides baseline model performance values for each metric, and the objective is to improve predictive value by at least 5% above the baseline for each metric.



## Performance Metrics

With this dataset, I plan to implement classification models that should accurately and precisely predict whether an individual's income is greater than $50,000. The datasource provides baseline model performance values for each metric, and the objective is to improve predictive value by at least 5% above the baseline for each metric.