# Project: Adult Dataset

- **Project Name:** Adult Classification Project
- **Project Type:** Binary-class Classification
- **Author:** Dr. Saad Laouadi

### Project Overview:
This project leverages the famous **Adult Dataset**, also known as the **Census Income Dataset**, for a **binary-class classification** problem. The objective is to predict whether a person earns more than $50,000 a year based on various demographic features.

The primary focus of this notebook is **feature engineering and standardization**, which includes encoding categorical variables, generate new features, and and scale numerical features. 

### Dataset Details:
- **Source**: The Adult Dataset is derived from the 1994 U.S. Census database.
- **Classes**: Binary classification task - the target is to predict income (<=50K or >50K).
- **Number of Samples**: 48,842
- **Number of Features**: 14 features (including age, education, occupation, race, etc.)

### Objectives:
- **Feature Engineering**:
  - Encode categorical variables
- **Scaling Numerical Features**:
  - Feature scaling using standardization.
- **Prepare the dataset**:
      - Applying the last steps for data to be ready for modeling and evaluation in the next notebook.
---

**Copyright © Dr. Saad Laouadi**  
**All Rights Reserved** 🛡️

In [1]:
# Import necessary modules
import json
import os
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
from sklearn.preprocessing import (
    LabelEncoder,
    MinMaxScaler,
    OneHotEncoder,
    RobustScaler,
    StandardScaler,
)

# Configuration Variables
PRINT_INFO = True

with open("config.json", "r") as file:
    config = json.load(file)

semi_processed_train_data = config["INTERIM_TRAIN_DATA"]

if PRINT_INFO:
    print("Semi Processed Train Data:", semi_processed_train_data)

%load_ext autoreload
%autoreload 2

from utils import *

%load_ext watermark
%watermark -iv -v

Semi Processed Train Data: data/interim/adult_train_semi_processed.csv
Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.26.0

pandas    : 2.2.2
json      : 2.0.9
seaborn   : 0.13.2
re        : 2.2.1
numpy     : 1.26.4
requests  : 2.32.3
sklearn   : 1.5.1
matplotlib: 3.9.2



In [2]:
# load the data
adult_train = pd.read_csv(semi_processed_train_data)

In [3]:
adult_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29096 entries, 0 to 29095
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             29096 non-null  int64  
 1   workclass       27464 non-null  object 
 2   education       29096 non-null  object 
 3   education-num   29096 non-null  int64  
 4   marital-status  29096 non-null  object 
 5   occupation      27457 non-null  object 
 6   relationship    29096 non-null  object 
 7   race            29096 non-null  object 
 8   sex             29096 non-null  object 
 9   capital-gain    29096 non-null  float64
 10  capital-loss    29096 non-null  int64  
 11  hours-per-week  29096 non-null  float64
 12  native-country  28516 non-null  object 
 13  income          29096 non-null  object 
dtypes: float64(2), int64(3), object(9)
memory usage: 3.1+ MB


### One Hot Encoding

### The `workclass` Feature

1. **Group Rare Categories**: Categories like ‘Without-pay’ and ‘Never-worked’ are rare and may not contribute much to the model. We can group them into a new category like 'Other'.
2.	**Group Similar Categories Together**: We can group categories based on their nature. For example, ‘Self-emp-inc’ and ‘Self-emp-not-inc’ are both self-employed categories, so they can be grouped together.
3.	**Map Categories**: We can create a mapping function to transform the original feature by grouping the categories.

In [4]:
adult_train["workclass"].value_counts()

workclass
Private             19621
Self-emp-not-inc     2473
Local-gov            2040
State-gov            1272
Self-emp-inc         1091
Federal-gov           946
Without-pay            14
Never-worked            7
Name: count, dtype: int64

In [5]:
workclass_map = {
    "Private": "private_sector",
    "Self-emp-not-inc": "private_sector",
    "Self-emp-inc": "private_sector",
    "Local-gov": "government",
    "State-gov": "government",
    "Federal-gov": "government",
    "Without-pay": "Unemployed",
    "Never-worked": "Unemployed",
}

adult_train = categorize_feature(adult_train, "workclass", workclass_map)

In [6]:
adult_train["workclass"].value_counts(dropna=False)

workclass
private_sector    23185
government         4258
NaN                1632
Unemployed           21
Name: count, dtype: int64

### The `education` Feature

To group the education categories into meaningful groups, we can categorize them based on common educational levels, such as Primary, Secondary, Vocational, Undergraduate, Graduate, etc., and put the rare ones into an “Other” category. Here’s we can organize it:

**Steps:**

1.	**Primary education**: Preschool, 1st-4th, 5th-6th, 7th-8th.
2.	**Secondary education**: 9th, 10th, 11th, 12th, HS-grad.
3.	**Vocational education**: Assoc-voc, Assoc-acdm.
4.	**Undergraduate education**: Some-college, Bachelors.
5.	**Graduate education**: Masters, Prof-school, Doctorate.
6.	**Rare categories**: Any category that has a significantly smaller frequency count can be grouped into “Other”.

In [7]:
adult_train["education"].value_counts()

education
HS-grad         8886
Some-college    6378
Bachelors       4810
Masters         1653
Assoc-voc       1331
11th            1056
Assoc-acdm      1053
10th             867
7th-8th          629
Prof-school      566
9th              507
12th             414
Doctorate        402
5th-6th          328
1st-4th          166
Preschool         50
Name: count, dtype: int64

In [8]:
education_map = {
    "1st-4th": "primary_education",
    "5th-6th": "primary_education",
    "7th-8th": "primary_education",
    "Preschool": "primary_education",
    "9th": "secondary_education",
    "10th": "secondary_education",
    "11th": "secondary_education",
    "12th": "secondary_education",
    "HS-grad": "secondary_education",
    "Assoc-voc": "vocational_degrees",
    "Assoc-acdm": "vocational_degrees",
    "Bachelors": "higher_dducation",
    "Masters": "higher_education",
    "Doctorate": "higher_education",
    "Prof-school": "professional_education",
}

In [9]:
adult_train = categorize_feature(adult_train, "education", education_map)

In [10]:
# Display the results
print(adult_train["education"].value_counts())

education
secondary_education       11730
higher_dducation           4810
vocational_degrees         2384
higher_education           2055
primary_education          1173
professional_education      566
Name: count, dtype: int64


### The `marital_status` Feature

To categorize these common marital statuses together, you can create a new feature or group similar classes. For example, you can categorize all types of married statuses under one label and the rest as separate labels. Here’s how you can implement this:

Categories:

1. **Married**: Combine Married-civ-spouse, Married-spouse-absent, Married-AF-spouse
2. **Single**: Keep Never-married as is
3. **Divorced/Separated**: Combine Divorced, Separated
4. **Widowed**: Can be kept as a separate category

In [11]:
print(generate_table(adult_train, "marital-status"))

                       count
marital-status              
Married-civ-spouse     13249
Never-married           9173
Divorced                4237
Separated               1014
Widowed                  982
Married-spouse-absent    418
Married-AF-spouse         23


In [12]:
status_map = {
    "Married-civ-spouse": "Married",
    "Married-spouse-absent": "Married",
    "Married-AF-spouse": "Married",
    "Never-married": "Single",
    "Divorced": "Divorced/Separated",
    "Separated": "Divorced/Separated",
    "Widowed": "Widowed",
}

adult_train = categorize_feature(adult_train, "marital-status", status_map)

In [13]:
print(generate_table(adult_train, "marital-status"))

                    count
marital-status           
Married             13690
Single               9173
Divorced/Separated   5251
Widowed               982


### The Occupation Feature

To group the similar classes and rare occupations together into broader categories, we can follow a similar approach as with the marital status. Here’s a potential grouping:

Categories:

1. **White-Collar**: Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support
2. **Blue-Collar**: Craft-repair, Machine-op-inspct, Transport-moving, Handlers-cleaners, Farming-fishing
3. **Service**: Other-service, Protective-serv, Priv-house-serv
4. **Military**: Armed-Forces
5. **Other**: For rare classes or less frequent categories (like Armed-Forces, Priv-house-serv, etc.), we could group them under a general Other category.


In [14]:
print(generate_table(adult_train, "occupation"))

                   count
occupation              
Prof-specialty      3885
Exec-managerial     3719
Adm-clerical        3340
Craft-repair        3298
Sales               3270
Other-service       2996
Machine-op-inspct   1702
Transport-moving    1445
Handlers-cleaners   1179
Farming-fishing      962
Tech-support         874
Protective-serv      631
Priv-house-serv      147
Armed-Forces           9


In [15]:
occupation_map = {
    "Prof-specialty": "White-Collar",
    "Exec-managerial": "White-Collar",
    "Adm-clerical": "White-Collar",
    "Sales": "White-Collar",
    "Tech-support": "White-Collar",
    "Craft-repair": "Blue-Collar",
    "Machine-op-inspct": "Blue-Collar",
    "Transport-moving": "Blue-Collar",
    "Handlers-cleaners": "Blue-Collar",
    "Farming-fishing": "Blue-Collar",
    "Other-service": "Service",
    "Protective-serv": "Service",
    "Priv-house-serv": "Service",
    "Armed-Forces": "Military",
    # Rare or less frequent classes grouped into 'Other'
    "Armed-Forces": "Other",
    "Priv-house-serv": "Other",
}

In [16]:
adult_train = categorize_feature(adult_train, "occupation", occupation_map)

In [17]:
print(generate_table(adult_train, "occupation"))

              count
occupation         
White-Collar  15088
Blue-Collar    8586
Service        3627
Other           156


In [18]:
print(adult_train.columns.to_list())

['age', 'workclass', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']


### The `relationship` Feature

In [19]:
print(generate_table(adult_train, "relationship"))

                count
relationship         
Husband         11506
Not-in-family    7684
Own-child        4096
Unmarried        3317
Wife             1528
Other-relative    965


### The `race` Feature

In [20]:
print(generate_table(adult_train, "race"))

                    count
race                     
White               24438
Black                3038
Asian-Pac-Islander   1038
Amer-Indian-Eskimo    311
Other                 271


### The `native-country` Feature

We can categorize the `country` feature by grouping similar regions and less frequent countries into a broader Other category. I’ll categorize countries based on continents or regions for simplicity:


**Categories:**

- **North America**: United-States, Mexico, Canada, Puerto-Rico, Outlying-US(Guam-USVI-etc)
- **Central America**: El-Salvador, Cuba, Jamaica, Dominican-Republic, Guatemala, Haiti, Honduras, Nicaragua, Trinadad&Tobago
- **South America**: Columbia, Ecuador, Peru
- **Europe**: Germany, England, Italy, Poland, France, Greece, Ireland, Portugal, Scotland, Yugoslavia, Hungary, Holand-Netherlands
- **Asia**: Philippines, India, China, Vietnam, Japan, Taiwan, Iran, Cambodia, Laos, Thailand, Hong
- **Other**: Any country not frequently occurring or not easily categorized can be grouped into Other

In [21]:
print(generate_table(adult_train, "native-country"))

                            count
native-country                   
United-States               25721
Mexico                        633
Philippines                   198
Germany                       137
Canada                        121
Puerto-Rico                   114
El-Salvador                   106
India                         100
Cuba                           95
England                        90
Jamaica                        81
South                          80
China                          75
Italy                          73
Dominican-Republic             70
Vietnam                        67
Japan                          62
Guatemala                      62
Poland                         60
Columbia                       59
Taiwan                         51
Haiti                          44
Iran                           43
Portugal                       37
Nicaragua                      34
Peru                           31
France                         29
Greece        

In [22]:
country_map = {
    "United-States": "North America",
    "Mexico": "North America",
    "Canada": "North America",
    "Puerto-Rico": "North America",
    "Outlying-US(Guam-USVI-etc)": "North America",
    "El-Salvador": "Central America",
    "Cuba": "Central America",
    "Jamaica": "Central America",
    "Dominican-Republic": "Central America",
    "Guatemala": "Central America",
    "Haiti": "Central America",
    "Honduras": "Central America",
    "Nicaragua": "Central America",
    "Trinadad&Tobago": "Central America",
    "Columbia": "South America",
    "Ecuador": "South America",
    "Peru": "South America",
    "Germany": "Europe",
    "England": "Europe",
    "Italy": "Europe",
    "Poland": "Europe",
    "France": "Europe",
    "Greece": "Europe",
    "Ireland": "Europe",
    "Portugal": "Europe",
    "Scotland": "Europe",
    "Yugoslavia": "Europe",
    "Hungary": "Europe",
    "Holand-Netherlands": "Europe",
    "Philippines": "Asia",
    "India": "Asia",
    "China": "Asia",
    "Vietnam": "Asia",
    "Japan": "Asia",
    "Taiwan": "Asia",
    "Iran": "Asia",
    "Cambodia": "Asia",
    "Laos": "Asia",
    "Thailand": "Asia",
    "Hong": "Asia",
    "South": "Asia",
}

In [23]:
adult_train = categorize_feature(adult_train, "native-country", country_map)

In [24]:
adult_train["native-country"].value_counts(dropna=False)

native-country
North America      26603
Asia                 751
NaN                  580
Central America      524
Europe               520
South America        118
Name: count, dtype: int64

### Imputing Categorical Features

There are three categorical features with missing values:
1. **Workclass**
2. **Occupation**
3. **Native country**

In [25]:
features_with_missing = ["workclass", "occupation", "native-country"]

print("-" * 32)
for feature in features_with_missing:
    print(adult_train[feature].value_counts(dropna=False))
    print("-" * 32)

--------------------------------
workclass
private_sector    23185
government         4258
NaN                1632
Unemployed           21
Name: count, dtype: int64
--------------------------------
occupation
White-Collar    15088
Blue-Collar      8586
Service          3627
NaN              1639
Other             156
Name: count, dtype: int64
--------------------------------
native-country
North America      26603
Asia                 751
NaN                  580
Central America      524
Europe               520
South America        118
Name: count, dtype: int64
--------------------------------


**Imputing the previous ones with the most frequent class (mode):**

In [26]:
# imputing with the most frequent class
for feature in features_with_missing:
    adult_train.fillna({feature: adult_train[feature].mode()[0]}, inplace=True)

## One Hot Encode Features


In [27]:
features_to_encode = adult_train.select_dtypes("object").columns.to_list()
features_to_encode.remove("income")
print(features_to_encode)

['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']


In [28]:
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop="first", handle_unknown="ignore")

# Fit and transform the categorical columns
encoded_features = encoder.fit_transform(adult_train[features_to_encode])

In [29]:
# Get the column names for the new OneHot encoded columns
encoded_columns = encoder.get_feature_names_out(features_to_encode)

In [30]:
# Convert the encoded data into a DataFrame
encoded_adult_train = pd.DataFrame(encoded_features, columns=encoded_columns)
# encoded_adult_train.columns.to_list()

In [31]:
# encoded_adult_train.columns.to_list()

In [32]:
# Drop original categorical columns and concatenate with the encoded DataFrame
df = adult_train.drop(columns=features_to_encode)
encoded_adult_train = pd.concat(
    [df.reset_index(drop=True), encoded_adult_train], axis=1
)

In [33]:
encoded_adult_train.columns.to_list()

['age',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'income',
 'workclass_government',
 'workclass_private_sector',
 'education_higher_education',
 'education_primary_education',
 'education_professional_education',
 'education_secondary_education',
 'education_vocational_degrees',
 'education_nan',
 'marital-status_Married',
 'marital-status_Single',
 'marital-status_Widowed',
 'occupation_Other',
 'occupation_Service',
 'occupation_White-Collar',
 'relationship_Not-in-family',
 'relationship_Other-relative',
 'relationship_Own-child',
 'relationship_Unmarried',
 'relationship_Wife',
 'race_Asian-Pac-Islander',
 'race_Black',
 'race_Other',
 'race_White',
 'sex_Male',
 'native-country_Central America',
 'native-country_Europe',
 'native-country_North America',
 'native-country_South America']

In [34]:
# Check the number of feature
len(encoded_adult_train.columns.to_list())

34

### Numerical Feature Standardization
We will use the `StandardScaler` from scikit-learn library. 

In [35]:
scaler = StandardScaler()

In [36]:
# Fit and transform the scaler on the numerical columns
features_to_scale = [
    "age",
    "education-num",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
]
scaled_data = scaler.fit_transform(encoded_adult_train[features_to_scale])

In [37]:
# Convert the scaled data back to a DataFrame
scaled_adult_train = pd.DataFrame(scaled_data, columns=features_to_scale)
scaled_adult_train.columns.to_list()

['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

In [38]:
scaled_adult_train.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
0,-0.018364,1.095328,0.566071,-0.229186,-0.037839
1,0.785323,1.095328,-0.243974,-0.229186,-2.226349
2,-0.091426,-0.416874,-0.243974,-0.229186,-0.037839
3,1.004511,-1.172976,-0.243974,-0.229186,-0.037839
4,-0.82205,1.095328,-0.243974,-0.229186,-0.037839


In [39]:
# If you want to replace the original numerical columns
encoded_adult_train[features_to_scale] = scaled_adult_train
encoded_scaled_adult_train = encoded_adult_train

In [40]:
encoded_adult_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,29096.0,1.892599e-16,1.000017,-1.625737,-0.82205,-0.091426,0.639198,3.707821
education-num,29096.0,2.974433e-16,1.000017,-3.44128,-0.416874,-0.038824,1.095328,2.22948
capital-gain,29096.0,-1.2210320000000001e-17,1.000017,-0.243974,-0.243974,-0.243974,-0.243974,15.148368
capital-loss,29096.0,-6.837777000000001e-17,1.000017,-0.229186,-0.229186,-0.229186,-0.229186,10.044376
hours-per-week,29096.0,2.764416e-16,1.000017,-3.19902,-0.037839,-0.037839,0.367441,4.663405
workclass_government,29096.0,0.1463431,0.353456,0.0,0.0,0.0,0.0,1.0
workclass_private_sector,29096.0,0.8529351,0.354177,0.0,1.0,1.0,1.0,1.0
education_higher_education,29096.0,0.07062827,0.256207,0.0,0.0,0.0,0.0,1.0
education_primary_education,29096.0,0.04031482,0.1967,0.0,0.0,0.0,0.0,1.0
education_professional_education,29096.0,0.01945285,0.138113,0.0,0.0,0.0,0.0,1.0


In [41]:
len(encoded_scaled_adult_train.columns.to_list())

34

In [42]:
encoded_scaled_adult_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29096 entries, 0 to 29095
Data columns (total 34 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   age                               29096 non-null  float64
 1   education-num                     29096 non-null  float64
 2   capital-gain                      29096 non-null  float64
 3   capital-loss                      29096 non-null  float64
 4   hours-per-week                    29096 non-null  float64
 5   income                            29096 non-null  object 
 6   workclass_government              29096 non-null  float64
 7   workclass_private_sector          29096 non-null  float64
 8   education_higher_education        29096 non-null  float64
 9   education_primary_education       29096 non-null  float64
 10  education_professional_education  29096 non-null  float64
 11  education_secondary_education     29096 non-null  float64
 12  educ

In [43]:
# Reorder the DataFrame so 'income' is the first column, while keeping the data intact
cols = ["income"] + [
    col for col in encoded_scaled_adult_train.columns if col != "income"
]

# Reindex the DataFrame to reorder the columns
encoded_scaled_adult_train = encoded_scaled_adult_train[cols]

In [44]:
encoded_scaled_adult_train.head()

Unnamed: 0,income,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_government,workclass_private_sector,education_higher_education,education_primary_education,...,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,native-country_Central America,native-country_Europe,native-country_North America,native-country_South America
0,<=50K,-0.018364,1.095328,0.566071,-0.229186,-0.037839,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1,<=50K,0.785323,1.095328,-0.243974,-0.229186,-2.226349,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
2,<=50K,-0.091426,-0.416874,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
3,<=50K,1.004511,-1.172976,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,<=50K,-0.82205,1.095328,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [45]:
encoded_scaled_adult_train.head()

Unnamed: 0,income,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_government,workclass_private_sector,education_higher_education,education_primary_education,...,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,native-country_Central America,native-country_Europe,native-country_North America,native-country_South America
0,<=50K,-0.018364,1.095328,0.566071,-0.229186,-0.037839,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1,<=50K,0.785323,1.095328,-0.243974,-0.229186,-2.226349,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
2,<=50K,-0.091426,-0.416874,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
3,<=50K,1.004511,-1.172976,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,<=50K,-0.82205,1.095328,-0.243974,-0.229186,-0.037839,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Encode the Target Variable

In [46]:
encoder = LabelEncoder()
encoded_scaled_adult_train["income"] = encoder.fit_transform(
    encoded_scaled_adult_train["income"]
)

In [47]:
encoded_scaled_adult_train.to_csv(
    "./data/processed/adult_train_processed.csv", index=False
)

In [51]:
# !jupyter kernelspec list

In [52]:
# import sys
#
# !{sys.executable} -m pip show black isort