# Data Preprocessing and Feature Engineering in Machine Learning

## 1. Data Exploration and Loading :

Import required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ppscore as pps
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

Load the dataset

In [None]:
df = pd.read_csv("adult_with_headers.csv")
df.head()

Basic dataset information

In [None]:
# Shape of the DataFrame
df.shape

Check missing values

In [None]:
# Data types and non-null counts
df.info()
# Summary statistics for numerical features
df.describe().T

### a. Handling Missing Values

In the Adult dataset, missing values are often represented as '?'.
We can replace these '?' with NaN

In [None]:
(df ==" ?").sum()

In [None]:
# Remove leading/trailing spaces from all object (string) columns
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].str.strip()

# Replace '?' with NaN
df.replace("?", pd.NA, inplace=True)
(df == "?").sum()

In [None]:
# Impute categorical features with mode
for col in ['workclass', 'occupation', 'native_country']:
    df[col].fillna(df[col].mode()[0], inplace=True)

### b. Scaling Techniques

In [None]:
# Separate numerical columns
numerical_cols = df.select_dtypes(exclude="object").columns

### Standard Scaling

In [None]:
standard_scaler = StandardScaler()
df_standard_scaled = df.copy()

df_standard_scaled[numerical_cols] = standard_scaler.fit_transform(df[numerical_cols])


#### When to Use Standard Scaling:

* Data follows normal distribution
* Algorithms like Logistic Regression, SVM, PCA
* Sensitive to variance

### Min-Max Scaling

In [None]:
minmax_scaler = MinMaxScaler()
df_minmax_scaled = df.copy()

df_minmax_scaled[numerical_cols] = minmax_scaler.fit_transform(df[numerical_cols])

#### When to Use Min-Max Scaling:

* Data not normally distributed
* Required bounded values (0â€“1)
* Algorithms like KNN, Neural Networks

## 2. Encoding Techniques :

In [None]:
# Identify categorical columns (excluding target variable)
categorical_cols = [col for col in df.select_dtypes(include="object").columns if col != 'income']

In [None]:
# Split based on cardinality
low_cardinality_cols = [col for col in categorical_cols if df[col].nunique() < 5]
high_cardinality_cols = [col for col in categorical_cols if df[col].nunique() >= 5]

### One-Hot Encoding(Low Cardinality)

In [None]:
#for low-cardinality features
df_onehot = pd.get_dummies(df, columns=low_cardinality_cols, drop_first=True)

#### Pros:
* No ordinal assumption
* Works well with linear models

#### Cons:
* Increases dimensionality
* Sparse matrix problem

### Label Encoding (High Cardinality)

In [None]:
# Label Encoding for high-cardinality features
le = LabelEncoder()
for col in high_cardinality_cols:
    df_onehot[col] = le.fit_transform(df_onehot[col])

In [None]:
# Encode target variable
df_onehot['income'] = df_onehot['income'].map({'<=50K': 0, '>50K': 1})

#### Pros:

* Memory efficient
* Fast and simple

#### Cons:

* Introduces false ordinal relationship
* Not ideal for linear models

## 3. Feature Engineering :

### Creating New Features

#### Feature 1: Work Experience Indicator

In [None]:
df_onehot["is_full_time"] = df["hours_per_week"].apply(lambda x: 1 if x >= 40 else 0)

* Full-time workers are more likely to earn >50K.

#### Feature 2: Capital Gain Indicator

In [None]:
df_onehot["has_capital_gain"] = df["capital_gain"].apply(lambda x: 1 if x > 0 else 0)

* Capital gains strongly correlate with higher income.

### Log Transformation of Skewed Feature
Identifying Skewed Feature:

In [None]:
df_onehot[numerical_cols].skew().sort_values(ascending=False)

In [None]:
# "log1p" Handles zeros safely
df_onehot["capital_gain_log"] = np.log1p(df["capital_gain"])

#### Justification:

* Reduces skewness
* Improves model stability
* Helps linear models learn better patterns

### Isolation Forest

In [None]:
# Selecting numerical features only
numerical_features = df_onehot.select_dtypes(exclude="object")

# Apply Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.05,
    random_state=42
)

outliers = iso_forest.fit_predict(numerical_features)

# Adding outlier flag to dataset
df_onehot["outlier_flag"] = outliers

In [None]:
# 1 = normal, -1 = outlier
df_onehot["outlier_flag"].value_counts()

Isolation Forest was used to identify anomalous observations in the dataset.
The method isolates outliers by randomly selecting features and split values, making it efficient for high-dimensional data.
The resulting outlier flag can be used for feature selection or outlier removal.

### PPS (Predictive Power Score) Analysis

PPS computation

In [None]:
# Calculate PPS matrix
pps_matrix = pps.matrix(df_onehot)

PPS Heatmap

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(
    pps_matrix.pivot(index="x", columns="y", values="ppscore"),
    cmap="coolwarm",
    linewidths=0.5
)
plt.title("PPS Score Heatmap")
plt.show()

Predictive Power Score (PPS) was used to measure both linear and non-linear relationships between features and the target variable.
Unlike correlation, PPS captures asymmetric and non-linear dependencies, making it suitable for feature selection.