# Task 4: Feature Encoding & Scaling
Adult Income Dataset

In [2]:

import pandas as pd
df = pd.read_csv("adult.csv")
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


This code loads the Adult Income dataset into a DataFrame for preprocessing.

In [3]:

categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns
categorical_cols, numerical_cols


(Index(['workclass', 'education', 'marital.status', 'occupation',
        'relationship', 'race', 'sex', 'native.country', 'income'],
       dtype='object'),
 Index(['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss',
        'hours.per.week'],
       dtype='object'))

This separates categorical and numerical columns so correct preprocessing can be applied.

In [4]:

df.replace("?", pd.NA, inplace=True)
df.dropna(inplace=True)


This removes missing values to avoid errors and bias in machine learning models.

In [5]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])


Label Encoding converts the target variable into numeric form required by ML algorithms.

In [6]:

df = pd.get_dummies(df, columns=categorical_cols.drop('income'), drop_first=True)


One-Hot Encoding converts categorical features into binary columns without creating order.

In [7]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df.head()


Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,income,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,native.country_Nicaragua,native.country_Peru,native.country_Philippines,native.country_Poland,native.country_Puerto-Rico,native.country_South,native.country_Taiwan,native.country_Trinadad&Tobago,native.country_United-States,native.country_Vietnam
1,3.318236,-0.529652,-0.914016,0.0,8.941294,-2.332443,0,False,True,False,...,False,False,False,False,False,False,False,False,True,False
3,0.904346,-0.452613,-2.798031,0.0,7.181757,-0.437702,0,False,True,False,...,False,False,False,False,False,False,False,False,True,False
4,-0.216389,0.8261,-0.537212,0.0,7.181757,-0.437702,0,False,True,False,...,False,False,False,False,False,False,False,False,True,False
5,-0.819862,0.334393,-0.914016,0.0,6.680134,-0.007079,0,False,True,False,...,False,False,False,False,False,False,False,False,True,False
6,-0.47502,-0.347253,-2.044425,0.0,6.680134,-0.437702,0,False,True,False,...,False,False,False,False,False,False,False,False,True,False


StandardScaler scales numerical features so all values contribute equally to the model.

In [8]:

df.to_csv("adult_processed.csv", index=False)


This saves the fully preprocessed dataset for direct use in machine learning models.