**Assignment‑2:**
   Basic Data Pre‑Processing (UCI Dataset)
Objective: Perform core data‑preprocessing operations on a real dataset using Python.

**Dataset used:** Adult Income Dataset

**Fetched the dataset directly from UCI repository using Python API.**

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [2]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

# metadata
print(adult.metadata)

# variable information
print(adult.variables)


{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

**Import libraries Combine X and y into One Dataset**

In [3]:
import pandas as pd

data = pd.concat([X, y], axis=1)
print(data.head())


   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

**Dataset Overview Data Types & Summary**

In [4]:
print(data.shape)      # rows & columns
print(data.info())     # data types
print(data.describe()) # statistics


(48842, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None
                age        fnlwgt  education-num  capital-gain  capital-loss

**Check Missing Values**

In [5]:
print(data.isnull().sum())


age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64


In [6]:
for col in data.columns:
    data[col].fillna(data[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

**Handle Categorical Data (Encoding)**

In [7]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in data.select_dtypes(include='object').columns:
    data[col] = le.fit_transform(data[col])


**Remove Duplicates**

In [8]:
data = data.drop_duplicates()


**Feature Scaling (Normalization)**

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_cols = data.select_dtypes(include=['int64', 'float64']).columns
data[num_cols] = scaler.fit_transform(data[num_cols])


**Train-Test Split**

In [10]:
from sklearn.model_selection import train_test_split

X = data.drop('income', axis=1)
y = data['income']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (39050, 14)
Test shape: (9763, 14)


** Save cleaned dataset to CSV file**

In [11]:
data.to_csv("adult_cleaned.csv", index=False)

print("Cleaned dataset saved successfully!")


Cleaned dataset saved successfully!


In [12]:
print(data.head())
print(data.shape)


        age  workclass    fnlwgt  education  education-num  marital-status  \
0  0.025724   2.246829 -1.061993  -0.332505       1.136595        0.916303   
1  0.828125   1.510353 -1.007118  -0.332505       1.136595       -0.410194   
2 -0.047221   0.037402  0.245993   0.183709      -0.419685       -1.736691   
3  1.046961   0.037402  0.426618  -2.397361      -1.197826       -0.410194   
4 -0.776676   0.037402  1.408464  -0.332505       1.136595       -0.410194   

   occupation  relationship      race       sex  capital-gain  capital-loss  \
0   -1.391146     -0.276731  0.392418  0.704208      0.146804     -0.217195   
1   -0.668572     -0.900803  0.392418  0.704208     -0.144847     -0.217195   
2   -0.186857     -0.276731  0.392418  0.704208     -0.144847     -0.217195   
3   -0.186857     -0.900803 -1.971535  0.704208     -0.144847     -0.217195   
4    0.776574      2.219559 -1.971535 -1.420035     -0.144847     -0.217195   

   hours-per-week  native-country    income  
0       -0

**Conclusion**

In this assignment, the Adult Income dataset from the UCI Machine Learning Repository was successfully analyzed and preprocessed using Python. Various data preprocessing techniques were applied, including handling missing values, removing duplicate records, encoding categorical variables, and scaling numerical features. These steps helped transform the raw dataset into a clean and structured format suitable for machine learning models.

The preprocessing process improved data quality, reduced inconsistencies, and enhanced the usability of the dataset for further analysis and predictive modeling. This assignment demonstrates the importance of data preprocessing as a crucial step in the data science pipeline, as high-quality data leads to more accurate and reliable machine learning outcomes.