# Preprocessing

This notebook performs an initial pass over the dataset and focuses on three main aspects: consistency, completeness and cleanliness.

## Cell 1 - Libraries


In [13]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)

## Cell 2 - Import data

In [14]:
print(os.getcwd())

# Paths are relative to the notebooks/ directory
dataset_path = '../data/raw/adult/adult.data'
test_dataset_path = '../data/raw/adult/adult.test'

column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'sex',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
    'income',
]

df_train = pd.read_csv(
    dataset_path,
    header=None,
    names=column_names,
    na_values=['?', ' ?'],  # catch "?" as missing
    sep=',',
    skipinitialspace=True,
)

df_test = pd.read_csv(
    test_dataset_path,
    header=None,
    names=column_names,
    na_values=['?', ' ?'],  # catch "?" as missing
    sep=',',
    skiprows=1,  # the first line in adult.test is a header/comment
    skipinitialspace=True,
)

# Fix label format in the test split (labels end with a '.')
df_test['income'] = df_test['income'].str.replace(r'\.', '', regex=True).str.strip()

# Add a helper column to keep track of the original split
df_train['split'] = 'train'
df_test['split'] = 'test'

df = pd.concat([df_train, df_test], ignore_index=True)

print(f'Train shape: {df_train.shape}')
print(f'Test shape:  {df_test.shape}')
print(f'Combined:    {df.shape}')

df.head()

/Users/villafuertech/Documents/Academic/University/Septimo_Semestre/Trusthworthy_ML/Projects/3_Project/fairness-project/notebooks
Train shape: (32561, 16)
Test shape:  (16281, 16)
Combined:    (48842, 16)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,split
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,train
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,train
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,train
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,train
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,train


## Cell 3 - Basic cleaning

We remove extra whitespace, ensure numeric dtypes, and keep a clean version of the combined dataset.

In [15]:
# Strip leading/trailing whitespace in all string columns
str_cols = df.select_dtypes(include='object').columns
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())

# Ensure numeric columns have numeric dtype
numeric_cols = [
    'age',
    'fnlwgt',
    'education_num',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
split             object
dtype: object

## Cell 4 - Data quality checks

Quick checks on shape, target distribution, missing values and duplicates to assess consistency, completeness and cleanliness.

In [16]:
print('DataFrame shape:', df.shape)

print('\nTarget distribution (income):')
print(df['income'].value_counts(normalize=True).rename('proportion'))

# Missing values summary
missing_counts = df.isna().sum()
missing_pct = (missing_counts / len(df)) * 100
missing_summary = pd.DataFrame(
    {'n_missing': missing_counts, 'pct_missing': missing_pct.round(2)}
).sort_values('pct_missing', ascending=False)

print('\nNumber of duplicated rows:', df.duplicated().sum())
missing_summary

DataFrame shape: (48842, 16)

Target distribution (income):
income
<=50K    0.760718
>50K     0.239282
Name: proportion, dtype: float64

Number of duplicated rows: 29


Unnamed: 0,n_missing,pct_missing
occupation,2809,5.75
workclass,2799,5.73
native_country,857,1.75
age,0,0.0
fnlwgt,0,0.0
education,0,0.0
education_num,0,0.0
marital_status,0,0.0
relationship,0,0.0
race,0,0.0


## Cell 5 - Save cleaned dataset for EDA

We store a single cleaned file under `data/processed/adult/adult_clean.csv` (from the project root) so that other notebooks (EDA, modeling, fairness analysis) can load it.

In [17]:
# Note: notebook runs from notebooks/, so go one level up
processed_dir = Path('../data/processed/adult')
processed_dir.mkdir(parents=True, exist_ok=True)

processed_path = processed_dir / 'adult_clean.csv'
df.to_csv(processed_path, index=False)

print(f'Saved cleaned data to: {processed_path}')

Saved cleaned data to: ../data/processed/adult/adult_clean.csv
