# Data Preprocessing & Feature Engineering - Adult Dataset (Jupyter Notebook)

This notebook follows the assignment instructions from **EDA2.docx** and uses the provided **adult_with_headers.csv** dataset.

It covers:
- Data exploration
- Missing value handling
- Scaling (Standard & Min-Max)
- Encoding techniques
- Feature engineering
- Outlier detection (Isolation Forest)
- Feature relationship analysis (PPS vs Correlation)

---

## Step 1: Install & Import Required Libraries

In [None]:

# If needed, install dependencies (run once)
# !pip install ppscore scikit-learn seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.ensemble import IsolationForest
import ppscore as pps


## Step 2: Load Dataset

Make sure `adult_with_headers.csv` is in the same folder as this notebook.

In [None]:

# Load dataset
df = pd.read_csv('adult_with_headers.csv')

df.head()


## Step 3: Data Exploration

In [None]:

# Dataset info
df.info()

# Summary statistics
df.describe()


In [None]:

# Check missing values
df.isnull().sum()


## Step 4: Handle Missing Values

In [None]:

# Replace '?' with NaN if present
df.replace('?', np.nan, inplace=True)

# Numerical columns - fill with median
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Categorical columns - fill with mode
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

df.isnull().sum()


## Step 5: Feature Scaling

**Standard Scaling:** Used for models like Logistic Regression, SVM, PCA

**Min-Max Scaling:** Used for distance-based models like KNN and Neural Networks

In [None]:

standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

df_standard_scaled = df.copy()
df_minmax_scaled = df.copy()

df_standard_scaled[num_cols] = standard_scaler.fit_transform(df[num_cols])
df_minmax_scaled[num_cols] = minmax_scaler.fit_transform(df[num_cols])

df_standard_scaled.head()


## Step 6: Encoding Techniques

In [None]:

df_encoded = df.copy()
label_enc = LabelEncoder()

for col in cat_cols:
    if df[col].nunique() < 5:
        df_encoded = pd.get_dummies(df_encoded, columns=[col], drop_first=True)
    else:
        df_encoded[col] = label_enc.fit_transform(df[col])

df_encoded.head()


## Step 7: Feature Engineering

In [None]:

# Feature 1: Age Group
df_encoded['Age_Group'] = pd.cut(
    df['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['Young', 'Adult', 'Mid-Age', 'Senior']
)

# Feature 2: Capital Gain Indicator
df_encoded['High_Capital_Gain'] = (df['capital-gain'] > 0).astype(int)

df_encoded.head()


## Step 7.1: Log Transformation for Skewed Feature

In [None]:

df_encoded['Log_Capital_Gain'] = np.log1p(df['capital-gain'])

sns.histplot(df['capital-gain'], bins=50)
plt.title("Original Capital Gain Distribution")
plt.show()

sns.histplot(df_encoded['Log_Capital_Gain'], bins=50)
plt.title("Log-Transformed Capital Gain Distribution")
plt.show()


## Step 8: Outlier Detection using Isolation Forest

In [None]:

iso_forest = IsolationForest(contamination=0.05, random_state=42)

outlier_labels = iso_forest.fit_predict(df_encoded.select_dtypes(include=['int64', 'float64']))

df_encoded['Outlier'] = outlier_labels

df_cleaned = df_encoded[df_encoded['Outlier'] == 1]

df_cleaned.shape


## Step 9: PPS vs Correlation Matrix

In [None]:

pps_matrix = pps.matrix(df_cleaned)

pps_matrix.head()


## Step 9.1: Correlation Matrix

In [None]:

plt.figure(figsize=(10, 6))
sns.heatmap(df_cleaned.select_dtypes(include=['int64', 'float64']).corr(), cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


## Final Conclusion

This Jupyter notebook fulfills all assignment objectives from EDA2.docx, including **preprocessing, feature engineering, scaling, encoding, outlier detection, and feature relationship analysis**.