# DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

### Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.
### Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.
### Tasks:
1. Data Exploration and Preprocessing:
+ Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
+ Handle missing values as per the best practices (imputation, removal, etc.).
+ Apply scaling techniques to numerical features:
+ Standard Scaling
+ Min-Max Scaling
+ Discuss the scenarios where each scaling technique is preferred and why.
2. Encoding Techniques:
+ Apply One-Hot Encoding to categorical variables with less than 5 categories.
+ Use Label Encoding for categorical variables with more than 5 categories.
+ Discuss the pros and cons of One-Hot Encoding and Label Encoding.
3. Feature Engineering:
+ Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
+ Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.
4. Feature Selection:
+ Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.
+ Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.


###### 1. Data Exploration and Preprocessing:
+ Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
+ Handle missing values as per the best practices (imputation, removal, etc.).
+ Apply scaling techniques to numerical features:
+ Standard Scaling
+ Min-Max Scaling
+ Discuss the scenarios where each scaling technique is preferred and why.

In [1]:
import pandas as pd 
data = pd.read_csv('adult_with_headers.csv')
data1 = data.copy()
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
#missing value
data.isnull()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32557,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32558,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
32559,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [3]:
#checks missing value
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

###### no missing value 

In [4]:
#summary
data.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [5]:
#data type
data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

In [6]:
#Apply scaling techniques to numerical features:
# Standard Scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.select_dtypes(include=['float64', 'int64']))
scaled_data

array([[ 0.03067056, -1.06361075,  1.13473876,  0.1484529 , -0.21665953,
        -0.03542945],
       [ 0.83710898, -1.008707  ,  1.13473876, -0.14592048, -0.21665953,
        -2.22215312],
       [-0.04264203,  0.2450785 , -0.42005962, -0.14592048, -0.21665953,
        -0.03542945],
       ...,
       [ 1.42360965, -0.35877741, -0.42005962, -0.14592048, -0.21665953,
        -0.03542945],
       [-1.21564337,  0.11095988, -0.42005962, -0.14592048, -0.21665953,
        -1.65522476],
       [ 0.98373415,  0.92989258, -0.42005962,  1.88842434, -0.21665953,
        -0.03542945]])

In [7]:
# Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
min_max_scaled_data = min_max_scaler.fit_transform(data.select_dtypes(include=['float64', 'int64']))
min_max_scaled_data

array([[0.30136986, 0.0443019 , 0.8       , 0.02174022, 0.        ,
        0.39795918],
       [0.45205479, 0.0482376 , 0.8       , 0.        , 0.        ,
        0.12244898],
       [0.28767123, 0.13811345, 0.53333333, 0.        , 0.        ,
        0.39795918],
       ...,
       [0.56164384, 0.09482688, 0.53333333, 0.        , 0.        ,
        0.39795918],
       [0.06849315, 0.12849934, 0.53333333, 0.        , 0.        ,
        0.19387755],
       [0.47945205, 0.18720338, 0.53333333, 0.1502415 , 0.        ,
        0.39795918]])

###### Discuss the scenarios where each scaling technique is preferred and why.
+ Standard Scaling is best for algorithms that assume data is normally distributed or sensitive to scale, like regression and SVMs. It standardizes the data to have a mean of 0 and a standard deviation of 1.
+ Min-Max Scaling is useful when you need data within a specific range, like [0, 1], especially for neural networks or decision trees. It prevents outliers from affecting the model.

###### 2. Encoding Techniques:
+ Apply One-Hot Encoding to categorical variables with less than 5 categories.
+ Use Label Encoding for categorical variables with more than 5 categories.
+ Discuss the pros and cons of One-Hot Encoding and Label Encoding.

In [8]:
# Apply One-Hot Encoding to categorical variables with less than 5 categories.
from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns with less than 5 unique categories
categorical_columns = data.select_dtypes(include=['object']).nunique()
columns_to_encode = categorical_columns[categorical_columns < 5].index

encoder = OneHotEncoder() 
encoded_data = encoder.fit_transform(data[columns_to_encode])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(columns_to_encode))
df_encoded = data.drop(columns=columns_to_encode).join(encoded_df)

# Print the result
print(df_encoded.head())


   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race  \
0        Never-married        Adm-clerical   Not-in-family   White   
1   Married-civ-spouse     Exec-managerial         Husband   White   
2             Divorced   Handlers-cleaners   Not-in-family   White   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black   
4   Married-civ-spouse      Prof-specialty            Wife   Black   

   capital_gain  capital_loss  hours_per_week  native_country  sex_ Female  \
0          2174             0              40   United-States          0.0   
1             0             0         

In [9]:
#Use Label Encoding for categorical variables with more than 5 categories.
from sklearn.preprocessing import LabelEncoder
columns_to_label_encode = categorical_columns[categorical_columns > 5].index


label_encoder = LabelEncoder()
for col in columns_to_label_encode:
    data[col] = label_encoder.fit_transform(data[col])

print(data.head())

   age  workclass  fnlwgt  education  education_num  marital_status  \
0   39          7   77516          9             13               4   
1   50          6   83311          9             13               2   
2   38          4  215646         11              9               0   
3   53          4  234721          1              7               2   
4   28          4  338409          9             13               2   

   occupation  relationship    race      sex  capital_gain  capital_loss  \
0           1             1   White     Male          2174             0   
1           4             0   White     Male             0             0   
2           6             1   White     Male             0             0   
3           6             0   Black     Male             0             0   
4          10             5   Black   Female             0             0   

   hours_per_week  native_country  income  
0              40              39   <=50K  
1              13           

###### Discuss the pros and cons of One-Hot Encoding and Label Encoding.
+ One-Hot Encoding is useful for categorical variables with no inherent order, as it creates binary columns for each category, preventing any assumptions about relationships between them. However, it can increase the dataset's dimensionality, especially with high-cardinality features, and lead to sparse data, which may be inefficient for some models.

+ Label Encoding is more efficient as it converts categories into numerical labels, making it suitable for ordinal features with a natural order. However, it can introduce problems when applied to nominal data, as it assumes a ranking between categories, which may mislead the model.

###### 3. Feature Engineering:
+ Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
+ Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

In [10]:
data.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,7,77516,9,13,4,1,1,White,Male,2174,0,40,39,<=50K
1,50,6,83311,9,13,2,4,0,White,Male,0,0,13,39,<=50K


In [11]:
# Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
def age_group(age):
    if age <= 25:
        return 'Young'
    elif age <= 50:
        return 'Middle-aged'
    else:
        return 'Senior'

data['age_group'] = data['age'].apply(age_group)
data['capital_gain_loss_ratio'] = data['capital_gain'] / (data['capital_loss'] + 1)

In [12]:
data.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,age_group,capital_gain_loss_ratio
0,39,7,77516,9,13,4,1,1,White,Male,2174,0,40,39,<=50K,Middle-aged,2174.0
1,50,6,83311,9,13,2,4,0,White,Male,0,0,13,39,<=50K,Middle-aged,0.0


In [13]:
# Apply a transformation  to at least one skewed numerical feature and justify your choice.
import numpy as np
if data['capital_gain'].skew() > 1:
    data['log_capital_gain'] = np.log1p(data['capital_gain']) 
else:
    data['log_capital_gain'] = data['capital_gain']  

# Display the new features
print(data[['age_group', 'capital_gain_loss_ratio', 'log_capital_gain']].head())


     age_group  capital_gain_loss_ratio  log_capital_gain
0  Middle-aged                   2174.0          7.684784
1  Middle-aged                      0.0          0.000000
2  Middle-aged                      0.0          0.000000
3       Senior                      0.0          0.000000
4  Middle-aged                      0.0          0.000000


The log transformation was applied to the capital_gain feature because it is typically right-skewed, with many values close to 0 and a few large outliers. Applying the log transformation helps normalize the distribution and reduces the impact of extreme values, making the feature more suitable for machine learning models.

###### 4. Feature Selection:
+ Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.
+ Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.

In [14]:
# Use the Isolation Forest algorithm to identify and remove outliers. 
from sklearn.ensemble import IsolationForest
data_encoded = pd.get_dummies(data)

In [15]:
clf = IsolationForest(random_state=10,contamination = 0.04)
clf.fit(data_encoded)

In [16]:
y_pred_outlier = clf.predict(data_encoded)

In [17]:
y_pred_outlier

array([ 1,  1,  1, ...,  1,  1, -1])

###### Discuss how outliers can affect model performance.
Outliers can mislead the model, causing it to make wrong predictions. They can also make the model too sensitive, leading to poor performance on new data.Models makes inaccurate predictions. They can also increase the variance, making the model overfit or fail to generalize well on new data.

###### 4. Feature Selection:
+ Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.
+ Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.

In [18]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,age_group,capital_gain_loss_ratio,log_capital_gain
0,39,7,77516,9,13,4,1,1,White,Male,2174,0,40,39,<=50K,Middle-aged,2174.0,7.684784
1,50,6,83311,9,13,2,4,0,White,Male,0,0,13,39,<=50K,Middle-aged,0.0,0.000000
2,38,4,215646,11,9,0,6,1,White,Male,0,0,40,39,<=50K,Middle-aged,0.0,0.000000
3,53,4,234721,1,7,2,6,0,Black,Male,0,0,40,39,<=50K,Senior,0.0,0.000000
4,28,4,338409,9,13,2,10,5,Black,Female,0,0,40,5,<=50K,Middle-aged,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,White,Female,0,0,38,39,<=50K,Middle-aged,0.0,0.000000
32557,40,4,154374,11,9,2,7,0,White,Male,0,0,40,39,>50K,Middle-aged,0.0,0.000000
32558,58,4,151910,11,9,6,1,4,White,Female,0,0,40,39,<=50K,Senior,0.0,0.000000
32559,22,4,201490,11,9,4,1,3,White,Male,0,0,20,39,<=50K,Young,0.0,0.000000


In [19]:
import ppscore as pps

X = data.drop(columns=['income'])
pps_scores = {col: pps.score(data, col, 'income') for col in X.columns}

for feature, score in pps_scores.items():
    print(f"Feature: {feature}, PPS Score: {score['ppscore']}")


Feature: age, PPS Score: 0.005415334901707637
Feature: workclass, PPS Score: 0.0940557685801341
Feature: fnlwgt, PPS Score: 0
Feature: education, PPS Score: 0.2431351218589835
Feature: education_num, PPS Score: 0.2431351218589835
Feature: marital_status, PPS Score: 0
Feature: occupation, PPS Score: 0.09240967070380073
Feature: relationship, PPS Score: 0.0
Feature: race, PPS Score: 0.0
Feature: sex, PPS Score: 0.0
Feature: capital_gain, PPS Score: 0.2971227681408571
Feature: capital_loss, PPS Score: 0.1417549226945072
Feature: hours_per_week, PPS Score: 0.04727792815029037
Feature: native_country, PPS Score: 0.009409079091742992
Feature: age_group, PPS Score: 0.0
Feature: capital_gain_loss_ratio, PPS Score: 0.2971227681408571
Feature: log_capital_gain, PPS Score: 0.2975778786274411


###### Outliers can distort the model's learning process, leading to biased predictions and reduced accuracy, as they may haveBased on the PPS scores, features such as capital_gain , capital_loss, and capital_gain_loss_ratio show a higher predictive power for income, suggesting a strong relationship with income, possibly due to the financial nature of the dataset. On the other hand, features like age, workclass, and marital_status have low PPS scores, indicating they have a weaker relationship with income prediction in this context.

In [20]:
# Compare its findings with the correlation matrix.
pps.matrix(data)

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,age,age,1.000000,predict_itself,True,,0.000000,1.000000,
1,age,workclass,0.000000,regression,True,mean absolute error,0.742600,0.875681,DecisionTreeRegressor()
2,age,fnlwgt,0.000000,regression,True,mean absolute error,75872.186200,77535.141544,DecisionTreeRegressor()
3,age,education,0.000000,regression,True,mean absolute error,2.759000,2.806164,DecisionTreeRegressor()
4,age,education_num,0.000000,regression,True,mean absolute error,1.853000,1.898306,DecisionTreeRegressor()
...,...,...,...,...,...,...,...,...,...
319,log_capital_gain,native_country,0.000000,regression,True,mean absolute error,2.374800,4.234430,DecisionTreeRegressor()
320,log_capital_gain,income,0.297578,classification,True,weighted F1,0.653115,0.756341,DecisionTreeClassifier()
321,log_capital_gain,age_group,0.022165,classification,True,weighted F1,0.451800,0.463951,DecisionTreeClassifier()
322,log_capital_gain,capital_gain_loss_ratio,0.996114,regression,True,mean absolute error,1093.884000,4.250600,DecisionTreeRegressor()
