# **Data Preprocessing and Feature Engineering in Machine Learning**

**Objective:**

This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

**Dataset:**

Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.

**Tasks:**

1. Data Exploration and Preprocessing:

Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).

Handle missing values as per the best practices (imputation, removal, etc.).

Apply scaling techniques to numerical features:

Standard Scaling

Min-Max Scaling

Discuss the scenarios where each scaling technique is preferred and why.

2. Encoding Techniques:

Apply One-Hot Encoding to categorical variables with less than 5 categories.

Use Label Encoding for categorical variables with more than 5 categories.

Discuss the pros and cons of One-Hot Encoding and Label Encoding.

3. Feature Engineering:

Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

4. Feature Selection:

Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.

Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.

**1) Data Exploration and Preprocessing:**

 Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('adult.csv')

# Basic data exploration
print(data.head())
print(data.info())
print(data.describe())
print(data.isnull().sum())


   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50

 Handle missing values as per the best practices (imputation, removal, etc.).

In [2]:
# Handling missing values (Example: using mode for categorical and mean for numerical features)
for column in data.columns:
    if data[column].dtype == 'object':
        data[column].fillna(data[column].mode()[0], inplace=True)
    else:
        data[column].fillna(data[column].mean(), inplace=True)

# Verify missing values are handled
print(data.isnull().sum())


age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


 Apply scaling techniques to numerical features:
-> Standard Scaling
-> Min-Max Scaling

In [3]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Selecting numerical features
numerical_features = data.select_dtypes(include=['int64', 'float64']).columns

# Standard Scaling
scaler_standard = StandardScaler()
data_standard_scaled = data.copy()
data_standard_scaled[numerical_features] = scaler_standard.fit_transform(data[numerical_features])

# Min-Max Scaling
scaler_minmax = MinMaxScaler()
data_minmax_scaled = data.copy()
data_minmax_scaled[numerical_features] = scaler_minmax.fit_transform(data[numerical_features])


 Discuss the scenarios where each scaling technique is preferred and why.

In [4]:
print("Standard Scaling is preferred when the features are normally distributed.\n\nMin-Max Scaling is useful when the data does not follow a normal distribution and you want to bound the feature values within a specific range (e.g., [0, 1]).")

Standard Scaling is preferred when the features are normally distributed.

Min-Max Scaling is useful when the data does not follow a normal distribution and you want to bound the feature values within a specific range (e.g., [0, 1]).


**2) Encoding Techniques**

 Apply One-Hot Encoding to categorical variables with less than 5 categories.

In [5]:
# Identify categorical features
categorical_features = data.select_dtypes(include=['object']).columns

# Apply One-Hot Encoding
one_hot_encoded_data = pd.get_dummies(data, columns=categorical_features, drop_first=True)


 Use Label Encoding for categorical variables with more than 5 categories.

In [6]:
from sklearn.preprocessing import LabelEncoder

# Apply Label Encoding
label_encoder = LabelEncoder()
for column in categorical_features:
    if len(data[column].unique()) > 5:
        data[column] = label_encoder.fit_transform(data[column])


 Discuss the pros and cons of One-Hot Encoding and Label Encoding.

-> One-Hot Encoding:

Pros: Captures all categories without imposing an ordinal relationship.

Cons: Increases dimensionality, which can lead to the curse of dimensionality.

-> Label Encoding:

Pros: Keeps the dataset compact.

Cons: Imposes an ordinal relationship, which might not make sense for non-ordinal data.

**3) Feature Engineering**

 Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

In [7]:
# Example of creating new features
data['age_bin'] = pd.cut(data['age'], bins=[0, 30, 60, 90], labels=['Young', 'Middle-aged', 'Senior'])
data['hours_per_week_bin'] = pd.cut(data['hours_per_week'], bins=[0, 20, 40, 60, 80], labels=['Part-time', 'Full-time', 'Over-time', 'Extreme'])

# Justify choices: These bins can capture distinct working patterns and age groups that might have different income levels.


In [8]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income', 'age_bin', 'hours_per_week_bin'],
      dtype='object')

 Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

In [9]:
import numpy as np

# Log transform a skewed feature (Example: 'capital-gain')
data['capital-gain'] = np.log1p(data['capital_gain'])

# Justification: Log transformation reduces skewness and makes the distribution more normal.


**4) Feature Selection**

 Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.

In [10]:
from sklearn.ensemble import IsolationForest

# Fit Isolation Forest
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(data[numerical_features])

# Select all rows that are not outliers
mask = yhat != -1
data_cleaned = data[mask]

# Discuss:
print("Outliers can skew the model and make it less generalizable.")




Outliers can skew the model and make it less generalizable.


 Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.

In [11]:
pip install ppscore



In [12]:
import ppscore as pps

# Calculate PPS
pps_matrix = pps.matrix(data)

# Compare with correlation matrix
corr_matrix = data.corr()

# Example: Display PPS matrix
print(pps_matrix)
print(corr_matrix)

# Discussion: PPS captures non-linear relationships while correlation captures linear relationships.


                x                   y   ppscore            case  \
0             age                 age  1.000000  predict_itself   
1             age           workclass  0.000000      regression   
2             age              fnlwgt  0.000000      regression   
3             age           education  0.000000      regression   
4             age       education_num  0.000000      regression   
..            ...                 ...       ...             ...   
319  capital-gain      native_country  0.000000      regression   
320  capital-gain              income  0.297578  classification   
321  capital-gain             age_bin  0.000000  classification   
322  capital-gain  hours_per_week_bin  0.055303  classification   
323  capital-gain        capital-gain  1.000000  predict_itself   

     is_valid_score               metric  baseline_score   model_score  \
0              True                 None        0.000000      1.000000   
1              True  mean absolute error       

  corr_matrix = data.corr()
