# 🧠 Deep Learning Practical Assignment (Adult Income Dataset)

## 📌 Dataset
We will use the **Adult Income dataset** (also known as the Census Income dataset).  
The task is to predict whether a person earns **more than $50K/year** based on demographic and employment attributes.

---


In [18]:
# Option 1: Using OpenML via scikit-learn
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset from OpenML
adult = fetch_openml(name="adult", version=2, as_frame=True)
df = adult.frame

print(df.head())
print(df.shape)  # (48842, 15)

# Separate features and target
X = df.drop(columns="class")
y = df["class"]


   age  workclass  fnlwgt     education  education-num      marital-status  \
0   25    Private  226802          11th              7       Never-married   
1   38    Private   89814       HS-grad              9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm             12  Married-civ-spouse   
3   44    Private  160323  Some-college             10  Married-civ-spouse   
4   18        NaN  103497  Some-college             10       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                NaN    Own-child  White  Female             0             0   

   hours-per-week native-country  class  
0       

## Part 0: Data Preparation
1. Load the dataset into a DataFrame.
2. Split the data into **training, validation, and test sets**.  
   - Suggested: 70% training, 15% validation, 15% test.
3. Apply any necessary preprocessing:
   - Handle categorical features (encoding).
   - Scale numerical features if needed.
4. After training your models, always report results on:
   - **Training accuracy**
   - **Validation accuracy**
   - **Test accuracy**
5. At the end of the assignment, **compare all methods** across train, validation, and test sets.


### EDA

In [19]:
print(df.info())
print('-'*25)
print(df.describe())
print('-'*25)
print(df.describe(include='category'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
 14  class           48842 non-null  category
dtypes: category(9), int64(6)
memory usage: 2.7 MB
None
-------------------------
                age        

In [20]:
print(f"duplicates:{df.duplicated().sum()}")
df = df.drop_duplicates()
print(f"after_removing_duplicates:{df.duplicated().sum()}")

duplicates:52
after_removing_duplicates:0


In [21]:
print(df.isna().sum())
print("-"*25)
print(f"Total Null:{df.isna().sum().sum()}")

age                  0
workclass         2795
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2805
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     856
class                0
dtype: int64
-------------------------
Total Null:6456


In [None]:
numerical_data = df.select_dtypes(include=['float64','int64'])
for i in numerical_data:
    plt.figure(figsize=(13,5))
    plt.subplot(1,2,1)
    sns.histplot(data=numerical_data,x=i,kde=True)
    plt.subplot(1,2,2)
    sns.boxplot(data=numerical_data,y=i)
    plt.show()

In [24]:
categorical_data = df.select_dtypes(include='category')
for c in categorical_data.columns:
    print(f"{c}:\n {categorical_data[c].unique()}, nunique: {categorical_data[c].nunique()}")
    print(categorical_data[c].value_counts())
    print("-" * 40)

workclass:
 ['Private', 'Local-gov', NaN, 'Self-emp-not-inc', 'Federal-gov', 'State-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked']
Categories (8, object): ['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], nunique: 8
workclass
Private             33860
Self-emp-not-inc     3861
Local-gov            3136
State-gov            1981
Self-emp-inc         1694
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64
----------------------------------------
education:
 ['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th', ..., 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool']
Length: 16
Categories (16, object): ['10th', '11th', '12th', '1st-4th', ..., 'Masters', 'Preschool', 'Prof-school', 'Some-college'], nunique: 16
education
HS-grad         15770
Some-college    10863
Bachelors        8013
Masters          2656
Assoc-voc        2060
11th             1812
Assoc-ac

In [None]:
for c in categorical_data:
    plt.subplot(1,2,1)
    sns.countplot(data=categorical_data , y=c)

    plt.subplot(1,2,2)
    val = categorical_data[c].value_counts()
    plt.pie(val.values,labels=val.index,autopct='%1.2f%%')

    plt.show()

In [26]:
X = df.drop(columns="class")
y = df["class"]

### Preprocessing

In [27]:
y

0        <=50K
1        <=50K
2         >50K
3         >50K
4        <=50K
         ...  
48837    <=50K
48838     >50K
48839    <=50K
48840    <=50K
48841     >50K
Name: class, Length: 48790, dtype: category
Categories (2, object): ['<=50K', '>50K']

In [28]:
y = y.apply(lambda x: 1 if x == '<=50K' else 0).astype(int)
print(y.value_counts())

class
1    37109
0    11681
Name: count, dtype: int64


In [29]:
X = X.drop(columns='native-country')

In [30]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler,OneHotEncoder

In [31]:
categorical_cols = X.select_dtypes(include='category').columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore",drop='first'))
])

numerical_transformer = Pipeline(steps=[
    ("scaler", RobustScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ("categorical",categorical_transformer,categorical_cols),
    ("numerical",numerical_transformer,numerical_cols)
])

In [32]:
from sklearn.model_selection import train_test_split

X_train,X_tmp,y_train,y_tmp = train_test_split(X,y,test_size=0.3,stratify=y,random_state=42)
X_val,X_test,y_val,y_test = train_test_split(X_tmp,y_tmp,test_size=0.5,stratify=y_tmp,random_state=42)

print(f"train:{X_train.shape},{y_train.shape}")
print(f"val:{X_val.shape},{y_val.shape}")
print(f"test:{X_test.shape},{y_test.shape}")

train:(34153, 13),(34153,)
val:(7318, 13),(7318,)
test:(7319, 13),(7319,)


In [33]:
X_train = preprocessor.fit_transform(X_train).toarray()
X_val = preprocessor.transform(X_val).toarray()
X_test = preprocessor.transform(X_test).toarray()

print(f"train:{X_train.shape},{y_train.shape}")
print(f"val:{X_val.shape},{y_val.shape}")
print(f"test:{X_test.shape},{y_test.shape}")

train:(34153, 57),(34153,)
val:(7318, 57),(7318,)
test:(7319, 57),(7319,)



## Part 1: Optimizers
1. Train the same neural network using:
   - Stochastic Gradient Descent (SGD)
   - SGD with Momentum
   - Adam
2. Compare the training and validation accuracy for each optimizer.
3. Which optimizer converges the fastest? Which gives the best generalization?
4. Explain *why* Adam often performs better than plain SGD.

---


In [34]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD,Adam

def create_model():
    model = Sequential([
    Dense(10, activation='relu', input_shape=(57,)),
    Dense(1,activation='sigmoid')
    ])
    return model

SGD

In [None]:
model_sgd = create_model()
optimizer = SGD(learning_rate=0.01)
model_sgd.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_sgd = model_sgd.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=1)
history_sgd 


Epoch 1/10


Epoch 2/10
 1665/34153 [>.............................] - ETA: 38s - loss: 0.3722 - accuracy: 0.8114

Momentum

In [None]:
model_momentum = create_model()
optimizer = SGD(learning_rate=0.01,momentum=0.9)
model_momentum.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_momentum = model_momentum.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=1)
history_momentum

Adam

In [None]:
model_adam = create_model()
optimizer = Adam(learning_rate=0.01)
model_adam.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_adam = model_adam.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=1)
history_adam

## Part 2: Batch Size
1. Train the same model with different batch sizes (e.g., 1, 32, 128, 1024).
2. Compare:
   - Training speed
   - Validation accuracy
   - Test accuracy
   - Generalization ability
3. Which batch size leads to the **noisiest gradient updates**?
4. Which batch size generalizes better and why?

b = 1

In [None]:
model_b1 = create_model()
optimizer = Adam(learning_rate=0.01)
model_b1.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_b1 = model_b1.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=1)
history_b1

b = 32

In [None]:
model_b2 = create_model()
optimizer = Adam(learning_rate=0.01)
model_b2.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_b2 = model_b2.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=32)
history_b2

b = 128

In [None]:
model_b3 = create_model()
optimizer = Adam(learning_rate=0.01)
model_b3.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_b3 = model_b3.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=128)
history_b3

b = 1024

In [None]:
model_b4 = create_model()
optimizer = Adam(learning_rate=0.01)
model_b4.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_b4 = model_b4.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=1024)
history_b4


## Part 3: Overfitting and Regularization
1. Train a large neural network (many parameters) on the dataset.
2. Observe training vs. validation accuracy.  
   - Do you see signs of overfitting?
3. Apply regularization techniques:
   - **L2 regularization**
   - **Dropout**
4. Compare the validation results before and after regularization.
5. Which regularization method was more effective in reducing overfitting? Why?

---


In [None]:
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Dropout

Control

In [None]:
model_control = Sequential([
    Dense(32, activation='relu', input_shape=(57,)),
    Dense(10, activation='relu'),
    Dense(1,activation='sigmoid')
])
optimizer = Adam(learning_rate=0.01)
model_control.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_control = model_control.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=32)
history_control

L2

In [None]:
model_l2 = Sequential([
    Dense(32, activation='relu', input_shape=(57,),kernel_regularizer=l2(0.01)),
    Dense(10, activation='relu',kernel_regularizer=l2(0.01)),
    Dense(1,activation='sigmoid')
])
optimizer = Adam(learning_rate=0.01)
model_l2.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_l2 = model_l2.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=32)

Dropout

In [None]:
model_dropout = model = Sequential([
    Dropout(0.5),
    Dense(32, activation='relu', input_shape=(57,)),
    Dense(10, activation='relu'),
    Dense(1,activation='sigmoid')
])
optimizer = Adam(learning_rate=0.01)
model_dropout.compile(optimizer=optimizer,loss='binary_crossentropy', metrics=['accuracy'])
history_dropout = model_dropout.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=32)

## Part 4: Early Stopping
1. Train the model for many epochs without early stopping.  
   - Plot training, validation, and test curves.
2. Train again with **early stopping** (monitor validation loss).
3. Compare the number of epochs trained and the final validation/test accuracy.
4. Explain how early stopping helps prevent overfitting.

---

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=3,min_delta=0.01, restore_best_weights=True)
history_earlystop = model_control.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10,batch_size=32,callbacks=[early_stopping])
history_earlystop

## Part 5: Reflection
1. Summarize what you learned about:
   - The role of optimizers
   - The effect of batch size
   - Regularization methods
   - Early stopping
   - Train/validation/test splits
2. If you had to train a deep learning model on a new tabular dataset, what choices would you make for:
   - Optimizer
   - Batch size
   - Regularization
   - Early stopping
   - Data splitting strategy  
   and why?