# Dataset Preparation: Adult Census Income

This notebook downloads and prepares the Adult Census Income dataset for use with the ML_Engine library.

## Dataset Information
- **Source**: UCI Machine Learning Repository
- **Task**: Binary classification (income >50K or <=50K)
- **Samples**: ~48,842
- **Features**: 14 (mix of numerical and categorical)
- **Target**: 'income' column (binary: '>50K' or '<=50K')

## Preparation Steps
1. Download dataset using scikit-learn's fetch_openml
2. Handle missing values
3. Encode categorical variables
4. Save processed dataset to `dataset/` folder

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Ensure dataset directory exists
os.makedirs('../dataset', exist_ok=True)

In [None]:
# Download Adult Census dataset
# This will cache the dataset locally after first download
print("Downloading Adult Census dataset...")
adult = fetch_openml(name='adult', version=2, as_frame=True)

# Extract features and target
X = adult.data
y = adult.target

print(f"Original dataset shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {list(X.columns)}")
print(f"\nTarget distribution:")
print(y.value_counts())

# Display first few rows
print("\nFirst 5 rows of features:")
display(X.head())

print("\nTarget variable:")
display(y.head())

In [None]:
# Data exploration
print("=== Dataset Information ===")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {len(X.columns)}")

# Check data types
print("\n=== Data Types ===")
print(X.dtypes)

# Check for missing values
print("\n=== Missing Values ===")
missing = X.isnull().sum()
missing = missing[missing > 0]
if len(missing) > 0:
    print(missing)
else:
    print("No missing values found.")

# Check categorical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
print(f"\n=== Categorical Columns ({len(categorical_cols)}) ===")
print(list(categorical_cols))

# Check numerical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
print(f"\n=== Numerical Columns ({len(numerical_cols)}) ===")
print(list(numerical_cols))

In [None]:
# Data preprocessing
print("Preprocessing data...")

# Create a copy for preprocessing
X_processed = X.copy()

# 1. Handle missing values (if any) - fill with mode for categorical, median for numerical
for col in X_processed.columns:
    if X_processed[col].isnull().any():
        if X_processed[col].dtype == 'object':
            # Fill categorical with mode
            X_processed[col].fillna(X_processed[col].mode()[0], inplace=True)
        else:
            # Fill numerical with median
            X_processed[col].fillna(X_processed[col].median(), inplace=True)

# 2. Encode target variable
y_processed = y.copy()
# Map target to binary: '>50K' -> 1, '<=50K' -> 0
y_processed = y_processed.map({'<=50K': 0, '>50K': 1})

print(f"Target encoding: {dict(zip(['<=50K', '>50K'], [0, 1]))}")
print(f"\nTarget distribution after encoding:")
print(y_processed.value_counts())

# 3. For ML_Engine compatibility, we'll:
#    - Keep categorical columns as strings (ML_Engine should handle encoding)
#    - Or we can one-hot encode them
# Let's keep as strings for now, ML_Engine should handle encoding

# Combine X and y for saving
df_processed = X_processed.copy()
df_processed['income'] = y_processed

print(f"\nProcessed dataset shape: {df_processed.shape}")
print("\nFirst 5 rows of processed data:")
display(df_processed.head())

In [None]:
# Save processed dataset
output_path = '../dataset/adult_census_processed.csv'
df_processed.to_csv(output_path, index=False)

print(f"Dataset saved to: {output_path}")
print(f"File size: {os.path.getsize(output_path) / 1024 / 1024:.2f} MB")

# Also save a smaller sample for quick testing
sample_size = 5000
if len(df_processed) > sample_size:
    df_sample = df_processed.sample(n=sample_size, random_state=42)
    sample_path = '../dataset/adult_census_sample.csv'
    df_sample.to_csv(sample_path, index=False)
    print(f"\nSample dataset ({sample_size} rows) saved to: {sample_path}")
    print(f"Sample file size: {os.path.getsize(sample_path) / 1024 / 1024:.2f} MB")

# Create a train/test split for consistency
train_df, test_df = train_test_split(df_processed, test_size=0.2, random_state=42, stratify=df_processed['income'])
train_path = '../dataset/adult_census_train.csv'
test_path = '../dataset/adult_census_test.csv'
train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print(f"\nTrain set saved to: {train_path} ({len(train_df)} rows)")
print(f"Test set saved to: {test_path} ({len(test_df)} rows)")
print(f"Train target distribution:\n{train_df['income'].value_counts()}")
print(f"\nTest target distribution:\n{test_df['income'].value_counts()}")

## Dataset Summary

### Files Created:
1. `adult_census_processed.csv` - Full processed dataset
2. `adult_census_sample.csv` - 5,000 row sample for quick testing
3. `adult_census_train.csv` - Training split (80%)
4. `adult_census_test.csv` - Test split (20%)

### Dataset Characteristics:
- **Total samples**: ~48,842
- **Features**: 14 original features (mix of numerical and categorical)
- **Target**: 'income' (binary: 0 for <=50K, 1 for >50K)
- **Missing values**: Handled (filled with mode/median)
- **Categorical encoding**: Target encoded to 0/1, features kept as strings

### For Use in ML_Engine:
The notebooks should use the processed dataset from the `dataset/` folder. For classification tasks, use `income` as the target column.