DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

Objective:

This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

Dataset:

Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.

Tasks:

1. Data Exploration and Preprocessing:

      •	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).

      •	Handle missing values as per the best practices (imputation, removal, etc.).

      •	Apply scaling techniques to numerical features:

          •	Standard Scaling
          •	Min-Max Scaling
          •	Discuss the scenarios where each scaling technique is preferred and why.

2. Encoding Techniques:

      •	Apply One-Hot Encoding to categorical variables with less than 5 categories.

      •	Use Label Encoding for categorical variables with more than 5 categories.
      
      •	Discuss the pros and cons of One-Hot Encoding and Label Encoding.

3. Feature Engineering:

      •	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

      •	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

4. Feature Selection:

      •	Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.

      •	Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.


In [None]:
!pip install ppscore



In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler


In [None]:

# Load the dataset
file_path = ('adult_with_headers.csv')
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In [None]:
# Summary statistics and checking for missing values
summary_stats = df.describe(include='all')
missing_values = df.isnull().sum()

summary_stats, missing_values


(                 age workclass        fnlwgt education  education_num  \
 count   32561.000000     32561  3.256100e+04     32561   32561.000000   
 unique           NaN         9           NaN        16            NaN   
 top              NaN   Private           NaN   HS-grad            NaN   
 freq             NaN     22696           NaN     10501            NaN   
 mean       38.581647       NaN  1.897784e+05       NaN      10.080679   
 std        13.640433       NaN  1.055500e+05       NaN       2.572720   
 min        17.000000       NaN  1.228500e+04       NaN       1.000000   
 25%        28.000000       NaN  1.178270e+05       NaN       9.000000   
 50%        37.000000       NaN  1.783560e+05       NaN      10.000000   
 75%        48.000000       NaN  2.370510e+05       NaN      12.000000   
 max        90.000000       NaN  1.484705e+06       NaN      16.000000   
 
              marital_status       occupation relationship    race    sex  \
 count                 32561     

In [None]:
# Handle missing values (drop or impute based on the dataset)
df = df.dropna()

# Separate numerical columns for scaling
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Apply Standard Scaling and MinMax Scaling
scaler_standard = StandardScaler()
scaler_minmax = MinMaxScaler()

df_standard_scaled = df.copy()
df_minmax_scaled = df.copy()

df_standard_scaled[numerical_cols] = scaler_standard.fit_transform(df[numerical_cols])
df_minmax_scaled[numerical_cols] = scaler_minmax.fit_transform(df[numerical_cols])

# View scaled data
print(df_standard_scaled.head())
print(df_minmax_scaled.head())

        age          workclass    fnlwgt   education  education_num  \
0  0.030671          State-gov -1.063611   Bachelors       1.134739   
1  0.837109   Self-emp-not-inc -1.008707   Bachelors       1.134739   
2 -0.042642            Private  0.245079     HS-grad      -0.420060   
3  1.057047            Private  0.425801        11th      -1.197459   
4 -0.775768            Private  1.408176   Bachelors       1.134739   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0      0.148453      -0.21

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Identify categorical columns
categorical_cols = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']

# One-Hot Encoding for columns with less than 5 categories
ohe_cols = [col for col in categorical_cols if df[col].nunique() < 5]
onehot_encoder = OneHotEncoder(sparse=False, drop='first')  # Changed 'sparse' to 'sparse_output'
df_ohe = df.copy()
ohe_encoded = pd.DataFrame(onehot_encoder.fit_transform(df_ohe[ohe_cols]), columns=onehot_encoder.get_feature_names_out(ohe_cols))
df_ohe = pd.concat([df_ohe.drop(ohe_cols, axis=1), ohe_encoded], axis=1)

# Label Encoding for columns with more than 5 categories
le_cols = [col for col in categorical_cols if df[col].nunique() >= 5]
label_encoder = LabelEncoder()
df_le = df_ohe.copy()
for col in le_cols:
    df_le[col] = label_encoder.fit_transform(df_le[col])

print(df_le.head()) # View encoded data


   age  workclass  fnlwgt  education  education_num  marital_status  \
0   39          7   77516          9             13               4   
1   50          6   83311          9             13               2   
2   38          4  215646         11              9               0   
3   53          4  234721          1              7               2   
4   28          4  338409          9             13               2   

   occupation  relationship  race  capital_gain  capital_loss  hours_per_week  \
0           1             1     4          2174             0              40   
1           4             0     4             0             0              13   
2           6             1     4             0             0              40   
3           6             0     2             0             0              40   
4          10             5     2             0             0              40   

   native_country  sex_ Male  income_ >50K  
0              39        1.0           0.

In [None]:
import numpy as np

df_le['age_education_interaction'] = df_le['age'] * df_le['education_num'] # Feature Engineering: Creating new features

df_le['income_per_hour'] = (df_le['capital_gain'] - df_le['capital_loss']) / df_le['hours_per_week'] # Creating a new feature: Income per Hour (if 'hours_per_week' and 'capital_gain' exist)

# Log transformation for skewed features (e.g., 'capital_gain')
df_le['log_capital_gain'] = np.log1p(df_le['capital_gain'])  # log1p to handle zero values

print(df_le[['age_education_interaction', 'income_per_hour', 'log_capital_gain']].head()) # View new features


   age_education_interaction  income_per_hour  log_capital_gain
0                        507            54.35          7.684784
1                        650             0.00          0.000000
2                        342             0.00          0.000000
3                        371             0.00          0.000000
4                        364             0.00          0.000000


In [None]:
from sklearn.ensemble import IsolationForest
import ppscore as pps

# Isolation Forest to detect and remove outliers
iso_forest = IsolationForest(contamination=0.05, random_state=42)  # Adjust contamination based on data
outliers = iso_forest.fit_predict(df_le[numerical_cols])

# Removing outliers
df_no_outliers = df_le[outliers == 1]

# PPS (Predictive Power Score) to find relationships between features
pps_matrix = pps.matrix(df_no_outliers).pivot(columns='x', index='y', values='ppscore')

# Compare PPS with correlation matrix
correlation_matrix = df_no_outliers.corr()





In [None]:

print(pps_matrix)


x                               age  age_education_interaction  capital_gain  \
y                                                                              
age                        1.000000                   0.701901      0.005200   
age_education_interaction  0.436537                   1.000000      0.009127   
capital_gain               0.000000                   0.000000      1.000000   
capital_loss               0.000000                   0.000000      0.000000   
education                  0.000000                   0.550829      0.000000   
education_num              0.000000                   0.629512      0.000000   
fnlwgt                     0.000000                   0.000000      0.000000   
hours_per_week             0.000000                   0.000000      0.000000   
income_ >50K               0.000000                   0.000000      0.000000   
income_per_hour            0.000000                   0.000000      0.723850   
log_capital_gain           0.000000     

In [None]:
print(correlation_matrix)

                                age  workclass    fnlwgt  education  \
age                        1.000000   0.009988 -0.079026  -0.002812   
workclass                  0.009988   1.000000 -0.017831   0.019409   
fnlwgt                    -0.079026  -0.017831  1.000000  -0.023298   
education                 -0.002812   0.019409 -0.023298   1.000000   
education_num              0.035963   0.046283 -0.039355   0.347690   
marital_status            -0.284393  -0.060436  0.027795  -0.034145   
occupation                -0.017709   0.248579 -0.001048  -0.025483   
relationship              -0.262209  -0.092290  0.008571  -0.010679   
race                       0.027202   0.048704 -0.022577   0.014373   
capital_gain               0.093410   0.020837 -0.012993   0.020707   
capital_loss               0.029835  -0.003454 -0.011045   0.021202   
hours_per_week             0.094233   0.129166 -0.021354   0.053627   
native_country             0.001334  -0.005586 -0.050673   0.062910   
sex_ M