**1. Data Exploration and Preprocessing**


In [2]:
import pandas as pd

# Load the dataset
data = pd.read_csv('adult_with_headers.csv')

# Show basic information
print(data.head())
print(data.info())
print(data.describe())


   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50

**2: Handle Missing Values**

In [3]:
# Check for missing values
print(data.isnull().sum())

# Imputation or removal based on analysis
data = data.dropna()  # or fill with mean, mode, median as needed


age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# Separate numerical and categorical columns
numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = data.select_dtypes(include=['object']).columns
print(data[numerical_features])
# Encode categorical variables
# Apply One-Hot Encoding for categorical variables with <5 categories, Label Encoding otherwise
df_encoded = pd.get_dummies(data[categorical_features], drop_first=True)
df_encoded.describe()



       age  fnlwgt  education_num  capital_gain  capital_loss  hours_per_week
0       39   77516             13          2174             0              40
1       50   83311             13             0             0              13
2       38  215646              9             0             0              40
3       53  234721              7             0             0              40
4       28  338409             13             0             0              40
...    ...     ...            ...           ...           ...             ...
32556   27  257302             12             0             0              38
32557   40  154374              9             0             0              40
32558   58  151910              9             0             0              40
32559   22  201490              9             0             0              20
32560   52  287927              9         15024             0              40

[32561 rows x 6 columns]


Unnamed: 0,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,...,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,0.029483,0.064279,0.000215,0.69703,0.034274,0.078038,0.039864,0.00043,0.036086,0.013298,...,0.003501,0.000369,0.002457,0.001566,0.000553,0.000584,0.895857,0.002058,0.000491,0.24081
std,0.169159,0.245254,0.014661,0.459549,0.181935,0.268236,0.195642,0.020731,0.186507,0.11455,...,0.059068,0.019194,0.049507,0.039546,0.023506,0.024149,0.305451,0.045316,0.022162,0.427581
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**3: Apply Scaling Techniques**

Standard Scaling (Z-score normalization): This scales data such that the mean becomes 0 and standard deviation becomes 1.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['age', 'hours_per_week']] = scaler.fit_transform(data[['age', 'hours_per_week']])


Min-Max Scaling

In [6]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
data[['age', 'hours_per_week']] = min_max_scaler.fit_transform(data[['age', 'hours_per_week']])


In [7]:
# Import necessary libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select only numerical columns for scaling
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns

# # Apply Standard Scaling or Min-Max Scaling to numerical columns only
# scaler = StandardScaler()
# data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Alternatively, for Min-Max Scaling:
min_max_scaler = MinMaxScaler()
data[numerical_cols] = min_max_scaler.fit_transform(data[numerical_cols])


Discussion:

Standard Scaling is preferred when features have different units or ranges, or for algorithms that rely on the assumption of normally distributed features (like linear regression).

Min-Max Scaling is preferred when the data needs to be within a fixed range, for example in neural networks where bounded input can lead to faster convergence.

**Encoding Techniques**

One-Hot Encoding (for categorical variables with less than 5 categories):

In [8]:
# Apply one-hot encoding
data = pd.get_dummies(data, columns=['sex', 'marital_status'], drop_first=True)


Label Encoding (for categorical variables with more than 5 categories):


In [9]:
from sklearn.preprocessing import LabelEncoder

# Apply label encoding
le = LabelEncoder()
data['occupation'] = le.fit_transform(data['occupation'])


Discussion:

One-Hot Encoding Pros: Works well for non-ordinal categorical features. Helps machine learning algorithms to interpret categorical features correctly.

One-Hot Encoding Cons: May lead to a large number of features if there are many categories.

Label Encoding Pros: Efficient for high-cardinality features and does not increase dimensionality.

Label Encoding Cons: Assumes an order between labels, which may mislead some models (e.g., tree-based methods).

**Feature Engineering**

New Features:

Work-Life Balance Feature: Ratio of hours worked per week to age.

Capital Change Feature: Difference between capital gains and losses.

In [10]:
import numpy as np
data['capital_change'] = data['capital_gain'] - data['capital_loss']

data['capital_gain_log'] = np.log1p(data['capital_gain'])


**Feature Selection**

Isolation Forest for Outlier Detection:

In [11]:
from sklearn.ensemble import IsolationForest

# Initialize Isolation Forest model
iso_forest = IsolationForest(contamination=0.05)  # Assume 5% contamination
outliers = iso_forest.fit_predict(data.select_dtypes(include=['float64', 'int64']))

# Filter out outliers
data = data[outliers == 1]



**Predictive Power Score (PPS):**

In [12]:
# Install the PPS library if not already installed
# !pip install ppscore

import ppscore as pps

# Calculate PPS matrix for entire dataset
pps_matrix = pps.matrix(data)
print(pps_matrix[['x', 'y', 'ppscore']].sort_values(by='ppscore', ascending=False).head(10))




                 x              y  ppscore
0              age            age      1.0
23       workclass      workclass      1.0
69       education      education      1.0
91   education_num      education      1.0
92   education_num  education_num      1.0
115     occupation     occupation      1.0
138   relationship   relationship      1.0
161           race           race      1.0
184   capital_gain   capital_gain      1.0
207   capital_loss   capital_loss      1.0




In [13]:
datanew=data.drop(columns=['workclass','education'])
correlation_matrix = datanew.corr(numeric_only=True)
print(correlation_matrix)


                                            age    fnlwgt  education_num  \
age                                    1.000000 -0.079906       0.018680   
fnlwgt                                -0.079906  1.000000      -0.045972   
education_num                          0.018680 -0.045972       1.000000   
occupation                            -0.022695 -0.001573       0.103473   
capital_gain                           0.072562 -0.016877       0.080741   
capital_loss                           0.023366 -0.025016       0.071001   
hours_per_week                         0.066473 -0.022499       0.133862   
sex_ Male                              0.080547  0.025860       0.000709   
marital_status_ Married-AF-spouse     -0.011567 -0.000277       0.001846   
marital_status_ Married-civ-spouse     0.310043 -0.031410       0.069222   
marital_status_ Married-spouse-absent  0.021238  0.004925      -0.033212   
marital_status_ Never-married         -0.536059  0.040601      -0.018203   
marital_stat

**Observations:**

Age: Has weak positive correlations with most features, including a 0.068 correlation with hours_per_week. This indicates that older people tend to work slightly more hours, although the relationship is weak.

Fnlwgt (final weight): Shows weak correlations with all features, suggesting it has little linear relationship with other variables.

Education_num: Shows a moderate positive correlation with hours_per_week (0.148) and a weaker positive relationship with capital_gain (0.122), suggesting that higher education may have some association with working more hours and earning more in capital gains.

Capital_gain and Capital_loss: Have very weak correlations with other features. Notably, capital_gain has a slightly positive relationship with education_num.

Hours_per_week: Has the strongest positive linear relationship with education_num (0.148), indicating that higher education levels may be associated with working more hours.