### Objective:
**This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques**, 
**which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques** 
**such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.**


In [46]:
# Load essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [47]:
# LOAD & INSPECT DATA
data_raw = pd.read_csv("/Users/tonystark/Desktop/Data_Science/CSV_files/adult_with_headers (1).csv")   # load dataset

In [48]:
data_raw.head()                                        # preview entries

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [49]:
data_raw.info()                                        # structure and datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [50]:
data_raw.describe(include='all').T                     # summary of categorical and numeric fields

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,32561.0,,,,38.581647,13.640433,17.0,28.0,37.0,48.0,90.0
workclass,32561.0,9.0,Private,22696.0,,,,,,,
fnlwgt,32561.0,,,,189778.366512,105549.977697,12285.0,117827.0,178356.0,237051.0,1484705.0
education,32561.0,16.0,HS-grad,10501.0,,,,,,,
education_num,32561.0,,,,10.080679,2.57272,1.0,9.0,10.0,12.0,16.0
marital_status,32561.0,7.0,Married-civ-spouse,14976.0,,,,,,,
occupation,32561.0,15.0,Prof-specialty,4140.0,,,,,,,
relationship,32561.0,6.0,Husband,13193.0,,,,,,,
race,32561.0,5.0,White,27816.0,,,,,,,
sex,32561.0,2.0,Male,21790.0,,,,,,,


##### Missing Values Handling
Before modeling, it is important to standardize how missing or invalid entries are treated.
The dataset contains certain placeholders (such as "?") that do not directly represent missing values, so they must be converted into proper NaN values for consistent processing. Once this is done, different strategies are applied to handle missing data depending on the feature type:

- Categorical Columns:
Missing categories are filled with the mode, which preserves the most common group in that column. This avoids introducing unrealistic or rare categories that could distort patterns.

- Numerical Columns:
Missing numeric values are filled with the median, a measure that is stable even when the data contains outliers or is heavily skewed. This prevents extreme values from affecting the imputation process.

Using these two approaches ensures that the dataset remains complete and reliable while staying true to the natural distribution of each feature. This combination is considered a standard and robust practice when preparing structured datasets for machine learning.

In [51]:
# MISSING VALUE HANDLING
# Replace ambiguous "?" markers with NaN so they can be treated consistently
data_raw = data_raw.replace("?", np.nan)

In [52]:
# Identify categorical and numeric groups for selective imputation
cat_features = data_raw.select_dtypes(include='object').columns
num_features = data_raw.select_dtypes(include=np.number).columns

In [53]:
# Fill missing categorical values with most frequent class (preserves majority pattern)
for feature in cat_features:
    data_raw[feature] = data_raw[feature].fillna(data_raw[feature].mode()[0])


In [54]:
# Fill missing numeric values using median (stable even with skewed data)
for feature in num_features:
    data_raw[feature] = data_raw[feature].fillna(data_raw[feature].median())

data_raw.isna().sum()   # verify no remaining missing entries

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

##### PPS Score (Predictive Power Score)
Shows how strong each feature predicts the target.
If PPS library is not installed, fallback to Mutual Information.


#### Scaling Techniques
- Scaling is an important preprocessing step because machine-learning models often assume that all numerical features behave on a similar scale. When features vary widely in magnitude (for example: age ≈ 40 vs. capital_gain ≈ 5000), models may become unstable, biased, or slow to converge. Two commonly used scaling methods are StandardScaler and MinMaxScaler, each suitable for different situations.

##### StandardScaler
- The StandardScaler transforms each numeric feature so that it has:
- - mean = 0
- - standard deviation = 1

This process is called standardization or z-score normalization.
It is most effective when the data roughly follows a bell-shaped (normal-like) distribution.

 ##### How it works (simple example): ######
- If the age column contains values like [25, 30, 35], after applying StandardScaler:

- the middle becomes 0 (centered),

- the others become negative or positive values depending on how far they are from the mean.

This helps the model treat all features with equal importance instead of favoring large-scale features.

##### Best suited for algorithms such as:

- Logistic Regression

- Linear Regression

- Support Vector Machines (SVM)
These models are sensitive to feature magnitudes, so standardization stabilizes their training process.

##### MinMaxScaler
The MinMaxScaler transforms numeric values into a fixed range, usually between 0 and 1.
This is called normalization.

##### How it works (simple example):
If the hours_per_week column ranges from 1 to 99, MinMaxScaler compresses this range to:

- minimum value → 0

- maximum value → 1

- all other values → proportional values between 0 and 1

This keeps the shape of the distribution but reduces the scale.

##### Useful when:

the data is not normally distributed,

features have extreme differences in magnitude,

models rely heavily on distance or gradient behavior.

##### Commonly used with:

K-Nearest Neighbors (KNN)

Neural Networks

Gradient-based models where scale affects learning speed


In [55]:
# SCALING TECHNIQUES
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Create scaler objects for transformation
scale_standard = StandardScaler()
scale_minmax = MinMaxScaler()

In [56]:
# Make two copies for demonstration of scaling impact
data_std = data_raw.copy()
data_mm = data_raw.copy()

In [57]:
# Apply normalization on numeric fields only
data_std[num_features] = scale_standard.fit_transform(data_raw[num_features])
data_mm[num_features] = scale_minmax.fit_transform(data_raw[num_features])
data_std.head()     # inspect scaled output

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.030671,State-gov,-1.063611,Bachelors,1.134739,Never-married,Adm-clerical,Not-in-family,White,Male,0.148453,-0.21666,-0.035429,United-States,<=50K
1,0.837109,Self-emp-not-inc,-1.008707,Bachelors,1.134739,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.14592,-0.21666,-2.222153,United-States,<=50K
2,-0.042642,Private,0.245079,HS-grad,-0.42006,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.14592,-0.21666,-0.035429,United-States,<=50K
3,1.057047,Private,0.425801,11th,-1.197459,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.14592,-0.21666,-0.035429,United-States,<=50K
4,-0.775768,Private,1.408176,Bachelors,1.134739,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.14592,-0.21666,-0.035429,Cuba,<=50K


#### Encoding Categorical Variables
Machine learning models cannot directly interpret text categories, so categorical values must be converted into numerical formats. The appropriate encoding method depends on how many unique categories a feature contains and how the model interprets numerical relationships. Two common encoding methods are One-Hot Encoding and Label Encoding, each with their own strengths and limitations.

##### One-Hot Encoding 

One-Hot Encoding creates a separate binary column for each unique category.
For example, if the column marital_status has categories ['Married', 'Single', 'Divorced'], it becomes:
- marital_status_Married → 0 or 1
- marital_status_Single → 0 or 1
- marital_status_Divorced → 0 or 1
This method is best when the number of categories is small (typically fewer than 5).

##### Why it is used:

- It does not impose any artificial order on categories.
- Each category is treated as independent, which is ideal for linear models and tree models.

##### Limitations:

- It can greatly increase the number of columns, especially when a feature has many unique categories.
- Larger feature spaces lead to higher memory usage and slower training.


##### Label Encoding  
Label Encoding assigns a numeric value to each category.
For example, country = ['US', 'Canada', 'India', 'UK', 'Mexico'] might become:

- US → 0
- Canada → 1
- India → 2
- UK → 3
- Mexico → 4

This method is suitable when a feature has many categories (5 or more), because it keeps the feature compact.

##### Advantages:

- Produces a single numeric column, reducing dimensionality.
- Efficient for algorithms that handle categorical integers naturally (e.g., tree-based models).

##### Limitations:

- Introduces a fake numerical order (e.g., “UK > India”), which has no real meaning.
- Linear models may incorrectly interpret the numbers as having importance in magnitude or direction.


In [58]:
# ENCODING CATEGORICAL FEATURES
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Separate predictors and output column
X_base = data_raw.drop(columns=['income'])
y_base = data_raw['income']

# Expanded encoded dataset container
X_trans = X_base.copy()

encoder_ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [59]:
# Apply One-Hot for low-cardinality and Label Encoding for high-cardinality categories
for feature in X_base.columns:
    if X_base[feature].dtype == 'object':
        if X_base[feature].nunique() < 5:
            onehot_out = encoder_ohe.fit_transform(X_base[[feature]])
            onehot_cols = [f"{feature}_{val}" for val in encoder_ohe.categories_[0]]
            onehot_df = pd.DataFrame(onehot_out, columns=onehot_cols, index=X_base.index)
            X_trans = pd.concat([X_trans.drop(columns=[feature]), onehot_df], axis=1)
        else:
            X_trans[feature] = LabelEncoder().fit_transform(X_base[feature])


##### Feature Engineering
Feature engineering helps convert raw data into more meaningful inputs that make patterns easier for a machine-learning model to learn. In this dataset, we create additional features that highlight important signals and reduce skewness in the data.

1. **capital_gain_flag**
This feature marks whether a person has any capital gain at all:
- If capital_gain > 0 → 1
- If capital_gain = 0 → 0

The actual gain amount can vary widely, but simply knowing whether a gain exists often carries strong predictive value. People who report capital gains tend to have different income patterns than those who report none, so this binary flag helps capture that relationship more clearly.

2. **capital_loss_flag**
Similar to the gain flag, this feature indicates whether a person reported any capital loss:
- If capital_loss > 0 → 1
- If capital_loss = 0 → 0

Again, the presence of a loss tells the model more about financial activity than the raw value itself, especially because most individuals have a value of zero in this column. The flag makes the feature more informative and easier to interpret.

3. **Log transform on 'hours-per-week'**
The hours_per_week feature is often right-skewed, meaning most people work around 40 hours, but a few work significantly more.
Skewed features can distort model training, especially for algorithms that assume normally distributed inputs.

#### Applying a log transform:
**hours_log = log(1 + hours_per_week)**

- Compresses large values
- Reduces skew
- Makes the distribution closer to normal
- Helps the model learn smoother patterns
This transformation helps ensure that extremely high working hours do not disproportionately influence the model.


In [60]:
# FEATURE ENGINEERING
# Create binary flags for gain/loss to highlight presence rather than magnitude
X_trans['gain_flag'] = (X_base['capital_gain'] > 0).astype(int)
X_trans['loss_flag'] = (X_base['capital_loss'] > 0).astype(int)
# Log transform for hours-per-week to reduce skew and stabilize distribution
X_trans['hours_log'] = np.log1p(X_base['hours_per_week'])

X_trans.head()    # review engineered dataset


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,gain_flag,loss_flag,hours_log
0,39,7,77516,9,13,4,1,1,4,2174,0,40,39,0.0,1.0,1,0,3.713572
1,50,6,83311,9,13,2,4,0,4,0,0,13,39,0.0,1.0,0,0,2.639057
2,38,4,215646,11,9,0,6,1,4,0,0,40,39,0.0,1.0,0,0,3.713572
3,53,4,234721,1,7,2,6,0,2,0,0,40,39,0.0,1.0,0,0,3.713572
4,28,4,338409,9,13,2,10,5,2,0,0,40,5,1.0,0.0,0,0,3.713572


#### Outlier Detection Using Isolation Forest

Isolation Forest is an unsupervised algorithm designed to detect data points that behave very differently from the majority. Instead of trying to model normal behavior directly, it works by randomly splitting the data and observing how quickly a point becomes isolated.

- Points that get isolated in fewer steps are considered outliers

- Points that take more steps to isolate are treated as normal

This method is especially effective on high-dimensional datasets because it does not rely on distance calculations, which often become unreliable as the number of features increases.

In this context, Isolation Forest helps remove individuals with extremely unusual income patterns or feature values that could distort the model. By filtering out these rare but influential records, the remaining dataset becomes more stable and better suited for downstream machine-learning tasks.

In [61]:
# OUTLIER REMOVAL WITH ISOLATION FOREST
from sklearn.ensemble import IsolationForest

# IsolationForest identifies unusual patterns based on isolation depth
iso_detector = IsolationForest(random_state=42)
numeric_subset = X_trans.select_dtypes(include=np.number)
iso_labels = iso_detector.fit_predict(numeric_subset)

# Keep inliers only (1 = normal, -1 = outlier)
valid_mask = iso_labels == 1
X_cleaned = X_trans[valid_mask]
y_cleaned = y_base[valid_mask]

X_cleaned.shape, y_cleaned.shape

((27632, 18), (27632,))

In [62]:
# FEATURE IMPORTANCE USING PPS OR FALLBACK MI
try:
    import ppscore as pps
    pps_result = pps.matrix(pd.concat([X_cleaned, y_cleaned], axis=1))[['x','y','ppscore']]
    pps_result.sort_values(by='ppscore', ascending=False)
except:
    # Mutual Information as fallback when PPS unavailable
    from sklearn.feature_selection import mutual_info_classif
    y_encoded = LabelEncoder().fit_transform(y_cleaned)
    mi_scores = mutual_info_classif(
        X_cleaned.select_dtypes(include=np.number),
        y_encoded
    )
    mi_series = pd.Series(mi_scores, index=X_cleaned.select_dtypes(include=np.number).columns)
    mi_series.sort_values(ascending=False)