Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

Here is some data description for the adult dataset

Key Features:
1.  Attributes: The dataset consists of 14 attributes, including both continuous and categorical variables. Key features include:
2.  Age: The age of the individual.
3.  Workclass: The type of employment (e.g., Private, Self-employed, Government).
4.  Education: The highest level of education attained (e.g., Bachelors, Masters).
5.  Occupation: The individual's job type (e.g., tech support, Sales).
6.  Hours-per-week: The number of hours worked per week.
7.  Income: The target variable indicates whether the individual earns more than $50,000 annually (binary classification: ">50K" or "<=50K").
8.  Data Size: The dataset contains 32,561 instances
9.  Data Quality: The dataset includes missing values represented by a question mark ("?") in certain fields, which may require preprocessing before analysis.

### Part 1 Intro, install and import

In [2]:
!pip install ucimlrepo --quiet

The cited Adult dataset from the UCI Machine Learning Repository presents a great dataset to be worked upon. This dataset provides a valuable resource for exploring the relationship between demographic, socioeconomic factors, and income level. It can be used to develop predictive models, conduct social science research, and gain insights into income inequality.

In [1]:

#  DATA HANDLING
import pandas as pd
import numpy as np

#  DATA VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#  MODELS
from sklearn.linear_model import LogisticRegression


#  METRICS
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


#  OUTPUT CONFIG
import warnings
warnings.filterwarnings('ignore')
sns.set(style='white', context='notebook')

In [5]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

adult_df = pd.concat([X, y], axis=1)
adult_df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Features:**

+  *Demographic Information*:

1. **Age**: Continuous variable representing the individual's age.
2. **Workclass**: Categorical variable indicating the type of employment (e.g., private, government, self-employed).
3. **Education**: Categorical variable specifying the highest level of education attained.
4. **Marital Status**: Categorical variable describing the individual's marital status.
5. **Occupation**: Categorical variable indicating the specific occupation.
6. **Relationship**: Categorical variable representing the individual's relationship status within the household.
7. **Race**: Categorical variable specifying the individual's racial or ethnic background.
8. **Sex**: Categorical variable indicating the individual's gender.
9. **Native Country**: Categorical variable specifying the individual's country of origin.
+ *Socioeconomic Factors*:

1. **fnlwgt**: Continuous variable representing the sampling weight.
2. **Education-num**: Continuous variable indicating the number of years of education.
3. **Capital Gain**: Continuous variable representing capital gains (e.g., from property sales or investments).
4. **Capital Loss**: Continuous variable representing capital losses (e.g., from property sales or investments).
5. **Hours per Week**: Continuous variable indicating the number of hours worked per week.
+ *Target Variable*:

 **Salary:** It is a binary categorical variable indicating whether the individual's annual income exceeds **or** is less than or equal to $50,000.




**Potential Research Questions:**

1. What factors significantly influence an individual's income level?
2. Are there gender or racial disparities in income distribution?
3. How does education level correlate with income?
4. What are the most common occupations among high-income earners?
5. Can a predictive model accurately classify individuals into high-income and low-income categories based on these features?

### Part 2 Data Analysis : Initial Observations

In [12]:
adult_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,48842.0,,,,38.643585,13.71051,17.0,28.0,37.0,48.0,90.0
workclass,47879.0,9.0,Private,33906.0,,,,,,,
fnlwgt,48842.0,,,,189664.134597,105604.025423,12285.0,117550.5,178144.5,237642.0,1490400.0
education,48842.0,16.0,HS-grad,15784.0,,,,,,,
education-num,48842.0,,,,10.078089,2.570973,1.0,9.0,10.0,12.0,16.0
marital-status,48842.0,7.0,Married-civ-spouse,22379.0,,,,,,,
occupation,47876.0,15.0,Prof-specialty,6172.0,,,,,,,
relationship,48842.0,6.0,Husband,19716.0,,,,,,,
race,48842.0,5.0,White,41762.0,,,,,,,
sex,48842.0,2.0,Male,32650.0,,,,,,,


In [8]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [6]:
adult_df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



**Data Integrity:**

*  Null Values: The dataset is notably free of null values.
*  Data Types:
    1. Numeric: Age, Final Weight, Education Number, Capital Gain, Capital Loss, and Hours Per Week are represented as integer data types.
    2.  Categorical: Workclass, Education, Marital Status, Occupation, Relationship, Race, Sex, Native Country, and Income are categorized as object data types.
*  Data Quality:

Missing Values: While the dataset lacks null values, it contains a significant number of '?' values, particularly within the categorical features. These values will require appropriate handling or imputation to ensure data quality and prevent biases in subsequent analysis.

In [None]:
s