# Project: Adult Dataset 🌸

- **Project Name:** Adult Classification Project
- **Project Type:** Binary-class Classification
- **Author:** Dr. Saad Laouadi

### Project Overview:
This project leverages the famous **Adult Dataset**, also known as the **Census Income Dataset**, for a **binary-class classification** problem. The objective is to predict whether a person earns more than $50,000 a year based on various demographic features.

The primary focus of this notebook is **data preprocessing**, which includes handling missing values, encoding categorical variables, and feature scaling to prepare the data for machine learning algorithms.

### Dataset Details:
- **Source**: The Adult Dataset is derived from the 1994 U.S. Census database.
- **Classes**: Binary classification task - the target is to predict income (<=50K or >50K).
- **Number of Samples**: 48,842
- **Number of Features**: 14 features (including age, education, occupation, race, etc.)

### Key Features:
- **Preprocessing Tasks**:
  - Handle missing or incomplete data
  - Encode categorical variables
  - Feature scaling (e.g., Standardization, Normalization)

### Objectives:
1. **Preprocess the dataset**:
   - Handle missing values
   - Convert categorical data into numeric form using encoding techniques
   - Scale/normalize features for optimal performance in future machine learning models
2. **Prepare the dataset** for modeling and evaluation in the next notebook.

---

**Copyright © Dr. Saad Laouadi**  
**All Rights Reserved** 🛡️

In [40]:
# Import necessary modules
import os
import re
import requests
from pathlib import Path
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data paths
DATA_URL = "https://raw.githubusercontent.com/dr-saad-la/ML-Datasets/refs/heads/main/benchmark-ml-datasets/adult.data"

BASE_LOCAL_PATH = Path(os.getenv('DATA_PATH'))

if BASE_LOCAL_PATH:
    LOCAL_PATH_DATA = BASE_LOCAL_PATH.joinpath("ML-Datasets/benchmark-ml-datasets/adult.data")
    LOACL_PATH_DATA_CSV = BASE_LOCAL_PATH.joinpath("ML-Datasets/benchmark-ml-datasets/adult.csv")
    LOCAL_PATH_METADATA = BASE_LOCAL_PATH.joinpath("ML-Datasets/benchmark-ml-datasets/adult.info.txt")
    
else:
    print("no environment variable is found") 

In [76]:
def extract_pattern_from_file(data_path, start_line=0, end_line=None, pattern=r"([a-zA-Z0-9\-]+):"):
    """
    Fetches the content of a file from a URL or a local path, skips to a specified line,
    and extracts data based on a regex pattern.

    Parameters:
    - data_path (str): The URL or local path of the file.
    - start_line (int): The line number to start reading from (default is 0).
    - end_line (int or None): The line number to stop reading (default is None, which reads till the end).
    - pattern (str): The regex pattern to extract data from the file (default is r"([a-zA-Z0-9\\-]+):").
    
    Returns:
    - extracted_data (list): List of data extracted based on the provided pattern.
    """
    content = ""
    
    # Ensure that pattern is treated as raw by escaping all backslashes
    pattern = re.compile(pattern)

    try:
        if data_path.startswith('http://') or data_path.startswith('https://'):
            # Handle URL case
            response = requests.get(data_path, stream=True)
            
            # Check if the request was successful
            if response.status_code == 200:
                # Read the file line by line and process based on start and end lines
                lines = response.iter_lines(decode_unicode=True)
                content = "\n".join([line for i, line in enumerate(lines) if i >= start_line and (end_line is None or i <= end_line)])
            else:
                print(f"Error: Unable to fetch the file from the URL. Status code: {response.status_code}")
                return []
        
        elif os.path.exists(data_path):
            # Handle local file path case
            with open(data_path, 'r') as file:
                lines = file.readlines()
                # Process lines from start_line to end_line
                if end_line is None:
                    content = "".join(lines[start_line:])  # If no end line is provided, read till the end
                else:
                    content = "".join(lines[start_line:end_line+1])
        else:
            print(f"Error: The file path '{data_path}' does not exist.")
            return []

        # Use the compiled pattern to extract data from the content
        extracted_data = pattern.findall(content)

        return extracted_data
    
    except requests.exceptions.RequestException as e:
        print(f"Error: An error occurred while trying to fetch the URL. {e}")
        return []
    except Exception as e:
        print(f"Error: An unexpected error occurred. {e}")
        return []

def get_possible_categorical_features(data, max_categories):
    """
    Returns a dictionary containing the number of unique categories for each column in the dataset,
    where the number of categories is less than or equal to the user-defined maximum.
    This can be useful for identifying categorical features that can be encoded, such as using one-hot encoding.

    Parameters:
    - data (pd.DataFrame): The dataset as a pandas DataFrame.
    - max_categories (int): The maximum number of categories a column can have to be considered for encoding.

    Returns:
    - dict: A dictionary where keys are column names and values are the number of unique categories.
            Only columns with categories <= max_categories are included.
    """
    categorical_info = {}

    for column in data.columns:
        num_classes = data[column].nunique()
        
        if num_classes <= max_categories:
            categorical_info[column] = num_classes
    
    return categorical_info

def print_value_counts_for_categorical_features(data, max_categories):
    """
    Identifies categorical features with categories <= max_categories and prints the value counts for each.

    Parameters:
    - data (pd.DataFrame): The dataset as a pandas DataFrame.
    - max_categories (int): The maximum number of categories for features to be considered for encoding.
    """
    categorical_features = get_possible_categorical_features(data, max_categories)
    
    for feature in categorical_features.keys():
        print(f"\nValue counts for feature: '{feature}'")
        print(data[feature].value_counts(dropna=False).to_frame())

In [43]:
url_data_path = 'https://raw.githubusercontent.com/dr-saad-la/ML-Datasets/refs/heads/main/benchmark-ml-datasets/adult.info.txt'  
local_path_metadata = str(LOCAL_PATH_METADATA)


start_line = 94
end_line = None 
pattern = r"([a-zA-Z0-9\-]+):" 

feature_names = extract_pattern_from_file(local_path_metadata,
                                           start_line=start_line,
                                           end_line=end_line,
                                           pattern=pattern)

print(feature_names)
print(len(feature_names))

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
14


In [46]:
adult = pd.read_table(LOCAL_PATH_DATA, delimiter=",",
              names= feature_names + ['Income'])

In [47]:
# pd.read_csv(LOACL_PATH_DATA_CSV).info()

In [48]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int64
 1   workclass       48842 non-null  int64
 2   fnlwgt          48842 non-null  int64
 3   education       48842 non-null  int64
 4   education-num   48842 non-null  int64
 5   marital-status  48842 non-null  int64
 6   occupation      48842 non-null  int64
 7   relationship    48842 non-null  int64
 8   race            48842 non-null  int64
 9   sex             48842 non-null  int64
 10  capital-gain    48842 non-null  int64
 11  capital-loss    48842 non-null  int64
 12  hours-per-week  48842 non-null  int64
 13  native-country  48842 non-null  int64
 14  Income          48842 non-null  int64
dtypes: int64(15)
memory usage: 5.6 MB


In [51]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,39,8,77516,10,13,5,2,2,5,2,2174,0,40,40,1
1,50,7,83311,10,13,3,5,1,5,2,0,0,13,40,1
2,38,5,215646,12,9,1,7,2,5,2,0,0,40,40,1
3,53,5,234721,2,7,3,7,1,3,2,0,0,40,40,1
4,28,5,338409,10,13,3,11,6,3,1,0,0,40,6,1


In [55]:
# Normalize the feature names
adult.columns = adult.columns.str.title()

In [57]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Age             48842 non-null  int64
 1   Workclass       48842 non-null  int64
 2   Fnlwgt          48842 non-null  int64
 3   Education       48842 non-null  int64
 4   Education-Num   48842 non-null  int64
 5   Marital-Status  48842 non-null  int64
 6   Occupation      48842 non-null  int64
 7   Relationship    48842 non-null  int64
 8   Race            48842 non-null  int64
 9   Sex             48842 non-null  int64
 10  Capital-Gain    48842 non-null  int64
 11  Capital-Loss    48842 non-null  int64
 12  Hours-Per-Week  48842 non-null  int64
 13  Native-Country  48842 non-null  int64
 14  Income          48842 non-null  int64
dtypes: int64(15)
memory usage: 5.6 MB


## Columns for Removal

The `fnlwgt`, (which is `final weight`) column represents the proportion of the population that shares the same set of characteristics. Essentially, each row in the original dataset was **de-duplicated**, and the `fnlwgt` column reflects the number of records with identical feature values. Since this column does not contribute to the predictive power of the model, it can be safely excluded from the model training process.

In [59]:
adult = adult.drop(columns = ['Fnlwgt'])

In [60]:
print(f"The adult data shape: {adult.shape}")

The adult data shape: (48842, 14)


In [62]:
# describe the data
adult.describe(include='all').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,48842.0,38.643585,13.71051,17.0,28.0,37.0,48.0,90.0
Workclass,48842.0,4.870439,1.464234,1.0,5.0,5.0,5.0,9.0
Education,48842.0,11.28842,3.874492,1.0,10.0,12.0,13.0,16.0
Education-Num,48842.0,10.078089,2.570973,1.0,9.0,10.0,12.0,16.0
Marital-Status,48842.0,3.61875,1.507703,1.0,3.0,3.0,5.0,7.0
Occupation,48842.0,7.5777,4.230509,1.0,4.0,8.0,11.0,15.0
Relationship,48842.0,2.443287,1.602151,1.0,1.0,2.0,4.0,6.0
Race,48842.0,4.668052,0.845986,1.0,5.0,5.0,5.0,5.0
Sex,48842.0,1.668482,0.470764,1.0,1.0,2.0,2.0,2.0
Capital-Gain,48842.0,1079.067626,7452.019058,0.0,0.0,0.0,0.0,99999.0


In [73]:
get_possible_categorical_features(adult, max_categories=20)

{'Workclass': 9,
 'Education': 16,
 'Education-Num': 16,
 'Marital-Status': 7,
 'Occupation': 15,
 'Relationship': 6,
 'Race': 5,
 'Sex': 2,
 'Income': 2}

In [77]:
print_value_counts_for_categorical_features(adult, max_categories=20)


Value counts for feature: 'Workclass'
           count
Workclass       
5          33906
7           3862
3           3136
1           2799
8           1981
6           1695
2           1432
9             21
4             10

Value counts for feature: 'Education'
           count
Education       
12         15784
16         10878
10          8025
13          2657
9           2061
2           1812
8           1601
1           1389
6            955
15           834
7            756
3            657
11           594
5            509
4            247
14            83

Value counts for feature: 'Education-Num'
               count
Education-Num       
9              15784
10             10878
13              8025
14              2657
11              2061
7               1812
12              1601
6               1389
4                955
15               834
5                756
8                657
16               594
3                509
2                247
1                 83

Value c