**1. Data Exploration and Preprocessing**


In [40]:
import pandas as pd

# Load the dataset
data = pd.read_csv('adult_with_headers.csv')

# Show basic information
print(data.head())
print(data.info())
print(data.describe())
print(data.dtypes())

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50

TypeError: 'Series' object is not callable

**2: Handle Missing Values**

In [2]:
# Check for missing values
print(data.isnull().sum())

# Imputation or removal based on analysis
data = data.dropna()  # or fill with mean, mode, median as needed


age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [39]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# Separate numerical and categorical columns
numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = data.select_dtypes(include=['object']).columns
print(data[numerical_features])
# Encode categorical variables
# Apply One-Hot Encoding for categorical variables with <5 categories, Label Encoding otherwise
df_encoded = pd.get_dummies(data[categorical_features], drop_first=True)
df_encoded.describe()



            age    fnlwgt  education_num  capital_gain  capital_loss  \
0      0.301370  0.044302       0.800000      0.021740           0.0   
1      0.452055  0.048238       0.800000      0.000000           0.0   
2      0.287671  0.138113       0.533333      0.000000           0.0   
3      0.493151  0.151068       0.400000      0.000000           0.0   
4      0.150685  0.221488       0.800000      0.000000           0.0   
...         ...       ...            ...           ...           ...   
32556  0.136986  0.166404       0.733333      0.000000           0.0   
32557  0.315068  0.096500       0.533333      0.000000           0.0   
32558  0.561644  0.094827       0.533333      0.000000           0.0   
32559  0.068493  0.128499       0.533333      0.000000           0.0   
32560  0.479452  0.187203       0.533333      0.150242           0.0   

       hours_per_week  
0            0.397959  
1            0.122449  
2            0.397959  
3            0.397959  
4            0.

ValueError: No objects to concatenate

**3: Apply Scaling Techniques**

Standard Scaling (Z-score normalization): This scales data such that the mean becomes 0 and standard deviation becomes 1.

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['age', 'hours_per_week']] = scaler.fit_transform(data[['age', 'hours_per_week']])


Min-Max Scaling

In [4]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
data[['age', 'hours_per_week']] = min_max_scaler.fit_transform(data[['age', 'hours_per_week']])


In [17]:
# Import necessary libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select only numerical columns for scaling
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns

# # Apply Standard Scaling or Min-Max Scaling to numerical columns only
# scaler = StandardScaler()
# data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Alternatively, for Min-Max Scaling:
min_max_scaler = MinMaxScaler()
data[numerical_cols] = min_max_scaler.fit_transform(data[numerical_cols])


Discussion:

Standard Scaling is preferred when features have different units or ranges, or for algorithms that rely on the assumption of normally distributed features (like linear regression).

Min-Max Scaling is preferred when the data needs to be within a fixed range, for example in neural networks where bounded input can lead to faster convergence.

**Encoding Techniques**

One-Hot Encoding (for categorical variables with less than 5 categories):

In [6]:
# Apply one-hot encoding
data = pd.get_dummies(data, columns=['sex', 'marital_status'], drop_first=True)


Label Encoding (for categorical variables with more than 5 categories):


In [7]:
from sklearn.preprocessing import LabelEncoder

# Apply label encoding
le = LabelEncoder()
data['occupation'] = le.fit_transform(data['occupation'])


Discussion:

One-Hot Encoding Pros: Works well for non-ordinal categorical features. Helps machine learning algorithms to interpret categorical features correctly.

One-Hot Encoding Cons: May lead to a large number of features if there are many categories.

Label Encoding Pros: Efficient for high-cardinality features and does not increase dimensionality.

Label Encoding Cons: Assumes an order between labels, which may mislead some models (e.g., tree-based methods).

**Feature Engineering**

New Features:

Work-Life Balance Feature: Ratio of hours worked per week to age.

Capital Change Feature: Difference between capital gains and losses.

In [8]:
import numpy as np
data['capital_change'] = data['capital_gain'] - data['capital_loss']

data['capital_gain_log'] = np.log1p(data['capital_gain'])


**Feature Selection**

Isolation Forest for Outlier Detection:

In [19]:
from sklearn.ensemble import IsolationForest

# Initialize Isolation Forest model
iso_forest = IsolationForest(contamination=0.05)  # Assume 5% contamination
outliers = iso_forest.fit_predict(data.select_dtypes(include=['float64', 'int64']))

# Filter out outliers
data = data[outliers == 1]



**Predictive Power Score (PPS):**

In [1]:
# Install the PPS library if not already installed
# !pip install ppscore

import ppscore as pps

# Calculate PPS matrix for entire dataset
pps_matrix = pps.matrix(data)
print(pps_matrix[['x', 'y', 'ppscore']].sort_values(by='ppscore', ascending=False).head(10))


ModuleNotFoundError: No module named 'ppscore'

In [42]:
datanew=data.drop(columns=['workclass','education'])
correlation_matrix = datanew.corr(numeric_only=True)
print(correlation_matrix)


                     age    fnlwgt  education_num  capital_gain  capital_loss  \
age             1.000000 -0.076646       0.036527      0.077674      0.057775   
fnlwgt         -0.076646  1.000000      -0.043195      0.000432     -0.010252   
education_num   0.036527 -0.043195       1.000000      0.122630      0.079923   
capital_gain    0.077674  0.000432       0.122630      1.000000     -0.031615   
capital_loss    0.057775 -0.010252       0.079923     -0.031615      1.000000   
hours_per_week  0.068756 -0.018768       0.148123      0.078409      0.054256   

                hours_per_week  
age                   0.068756  
fnlwgt               -0.018768  
education_num         0.148123  
capital_gain          0.078409  
capital_loss          0.054256  
hours_per_week        1.000000  


**Observations:**

Age: Has weak positive correlations with most features, including a 0.068 correlation with hours_per_week. This indicates that older people tend to work slightly more hours, although the relationship is weak.

Fnlwgt (final weight): Shows weak correlations with all features, suggesting it has little linear relationship with other variables.

Education_num: Shows a moderate positive correlation with hours_per_week (0.148) and a weaker positive relationship with capital_gain (0.122), suggesting that higher education may have some association with working more hours and earning more in capital gains.

Capital_gain and Capital_loss: Have very weak correlations with other features. Notably, capital_gain has a slightly positive relationship with education_num.

Hours_per_week: Has the strongest positive linear relationship with education_num (0.148), indicating that higher education levels may be associated with working more hours.