# **IncomePredictor: Predicting Economic Disparities Using ML**

## **1. Bussiness Understanding**
Income inequality, where earnings are unevenly distributed across a population, is an escalating issue in developing nations worldwide. As AI and automation technologies advance, this disparity may worsen unless proactive measures are taken.

The goal of this project is to develop a machine learning model that can predict whether an individual with certain characteristics would earn above or below a threshold of $50,000.00.

By implementing this solution, the cost and accuracy of tracking critical population metrics, such as income levels, between census periods can be significantly improved. The project will enable policymakers to make more informed decisions and take effective actions to reduce income inequality on a global scale.

### 1.1. Hypothesis Testing
- **Null Hypothesis (Ho):** Level of education has no statistical significance influence on Income limit 
- **Alternate Hypothesis (Ha):** Level of education has statistical significance influence on Income limit

### 1.2. Analytical Question
1. How does income vary across different levels of education and types of educational institutions?
2. How does employment status (full-time, part-time, unemployed) relate to wage levels?
3. Does race influence the income limits a person can achieve?
4. From individual demographics, what factors influence an individual’s wage per hour?
5. How does gender influence employment status and income level?
6. How does moving within a region or to a different region affect employment status?
7. How Does Gender Impact Income Levels?

## **2. Data Understansding**
The dataset consists of various features that provide detailed information about individuals, ranging from demographic details to employment and migration history.<br>

`Key Features:`

<div style="display: flex; justify-content: space-between;">
<div style="width: 50%;">

- **ID**: Unique identifier for each individual.
- **age**: Age of the individual.
- **gender**: Gender of the individual.
- **education**: Level of education of the individual.
- **class**: Social class of the individual.
- **education_institute**: Type of educational institution attended.
- **marital_status**: Marital status of the individual.
- **race**: Race of the individual.
- **is_hispanic**: Indicator for Hispanic ethnicity.
- **employment_commitment**: Employment status of the individual.
- **unemployment_reason**: Reason for unemployment.
- **employment_stat**: Employment status of the individual.
- **wage_per_hour**: Hourly wage of the individual.
- **is_labor_union**: Membership in a labor union.
- **working_week_per_year**: Number of weeks worked per year.
- **industry_code**: Code representing the industry of employment.
- **industry_code_main**: Main industry code.
- **occupation_code**: Code representing the occupation.
- **occupation_code_main**: Main occupation code.
- **total_employed**: Total number of individuals employed.
- **household_stat**: Household status.
- **household_summary**: Summary of household information.

</div>
<div style="width: 50%;">

- **under_18_family**: Presence of individuals under 18 in the family.
- **veterans_admin_questionnaire**: Veteran status questionnaire.
- **vet_benefit**: Veteran benefits received.
- **tax_status**: Tax status of the individual.
- **gains**: Financial gains.
- **losses**: Financial losses.
- **stocks_status**: Status of stocks owned.
- **citizenship**: Citizenship status.
- **mig_year**: Year of migration.
- **country_of_birth_own**: Country of birth of the individual.
- **country_of_birth_father**: Country of birth of the individual’s father.
- **country_of_birth_mother**: Country of birth of the individual’s mother.
- **migration_code_change_in_msa**: Code for change in MSA migration.
- **migration_prev_sunbelt**: Previous migration status in the Sunbelt.
- **migration_code_move_within_reg**: Code for moving within a region.
- **migration_code_change_in_reg**: Code for change in region migration.
- **residence_1_year_ago**: Previous residence status.
- **old_residence_reg**: Previous residence region.
- **old_residence_state**: Previous residence state.
- **importance_of_record**: Importance of the record.
- **income_above_limit**: Indicator for income above a certain limit.
</div>
</div>


### 2.1. Data Collection
#### Import libraries

In [1]:
# Data manipulation libraries
import numpy as np
import pandas as pd

# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### Load csv file

In [4]:
# Load csv file
df = pd.read_csv("../data/train.csv")

# Preview dataframe
df.head(4)

Unnamed: 0,ID,age,gender,education,class,education_institute,marital_status,race,is_hispanic,employment_commitment,...,country_of_birth_mother,migration_code_change_in_msa,migration_prev_sunbelt,migration_code_move_within_reg,migration_code_change_in_reg,residence_1_year_ago,old_residence_reg,old_residence_state,importance_of_record,income_above_limit
0,ID_TZ0000,79,Female,High school graduate,,,Widowed,White,All other,Not in labor force,...,US,?,?,?,?,,,,1779.74,Below limit
1,ID_TZ0001,65,Female,High school graduate,,,Widowed,White,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,2366.75,Below limit
2,ID_TZ0002,21,Male,12th grade no diploma,Federal government,,Never married,Black,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,1693.42,Below limit
3,ID_TZ0003,2,Female,Children,,,Never married,Asian or Pacific Islander,All other,Children or Armed Forces,...,India,unchanged,,unchanged,unchanged,Same,,,1380.27,Below limit


### 2.2. Exploratory Data Analysis 

In [7]:
# Check characteristics of dataframe
print(df.info(), "\n====================== Null Value Count / Percentage ======================")
# Check for null values
print(pd.DataFrame({"null_value_count": df.isna().sum(), "percentage_null_value": df.isna().mean().mul(100) }), "\n====================== Duplicated rows ======================")
# Check for duplicated rows
print(df.duplicated().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209499 entries, 0 to 209498
Data columns (total 43 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   ID                              209499 non-null  object 
 1   age                             209499 non-null  int64  
 2   gender                          209499 non-null  object 
 3   education                       209499 non-null  object 
 4   class                           104254 non-null  object 
 5   education_institute             13302 non-null   object 
 6   marital_status                  209499 non-null  object 
 7   race                            209499 non-null  object 
 8   is_hispanic                     209499 non-null  object 
 9   employment_commitment           209499 non-null  object 
 10  unemployment_reason             6520 non-null    object 
 11  employment_stat                 209499 non-null  int64  
 12  wage_per_hour   

In [8]:
df.describe().style.background_gradient(cmap = "YlOrRd")

Unnamed: 0,age,employment_stat,wage_per_hour,working_week_per_year,industry_code,occupation_code,total_employed,vet_benefit,gains,losses,stocks_status,mig_year,importance_of_record
count,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0
mean,34.518728,0.17676,55.433487,23.15885,15.332398,11.321734,1.956067,1.515854,435.926887,36.881737,194.53342,94.499745,1740.888324
std,22.306738,0.555562,276.757327,24.397963,18.049655,14.460839,2.365154,0.850853,4696.3595,270.383302,1956.375501,0.500001,995.559557
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0,37.87
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,94.0,1061.29
50%,33.0,0.0,0.0,8.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,94.0,1617.04
75%,50.0,0.0,0.0,52.0,33.0,26.0,4.0,2.0,0.0,0.0,0.0,95.0,2185.48
max,90.0,2.0,9999.0,52.0,51.0,46.0,6.0,2.0,99999.0,4608.0,99999.0,95.0,18656.3


##### Key Insights:

- **Age Distribution:**
The average age is around 34.5 years with a wide range from 0 to 90 years, indicating a diverse age range in the dataset.

- **Employment Status:**
The average employment status suggests that a significant portion of individuals may be unemployed or have a low employment status.

- **Wage per Hour:**
There is considerable variability in hourly wages, with a mean of 55.43 but a very high standard deviation, suggesting that wages vary greatly among individuals.

- **Working Weeks per Year:**
Most individuals work fewer weeks per year on average (23.16 weeks), with a median of 8 weeks, indicating many work part-time or seasonally.

- **Industry and Occupation Codes:**
A wide range of industry and occupation codes are present, but many records have these fields as zero, suggesting either missing data or individuals not associated with specific industries or occupations.

- **Veteran Benefits:**
Most individuals receive veteran benefits, with low variability in the benefit amount.

- **Financial Gains and Losses:**
High variability in financial gains and losses is observed, with many individuals reporting zero gains or losses, and extreme values present in the dataset.

- **Stock Status:**
A small portion of the population owns stocks, but the values vary widely among those who do.

- **Migration Year:**
Migration data is relatively consistent, with a narrow range of years, suggesting that the migration data pertains to a specific period.

- **Importance of Record:**
There is significant variability in the importance of records, indicating differing levels of relevance or significance among data points.