<a href="https://colab.research.google.com/github/awaisfarooqchaudhry/IB9AU-GenerativeAI-2026/blob/main/Task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a PyTorch notebook that downloads the German Credit dataset from "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", preprocesses it by mapping the target variable, one-hot encoding categorical features, scaling numerical features, and splitting into training/testing sets. Then, define, train, and evaluate a Multi-Layer Perceptron (MLP) model for binary classification, reporting the final test accuracy.

## Imports and Setup

### Subtask:
Import necessary libraries such as pandas, numpy, sklearn, and torch. Set up a random seed for reproducibility.


**Reasoning**:
The subtask requires importing several libraries and setting a random seed for reproducibility. This code block will handle all these imports and seed settings.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.optim as optim

# Set random seeds for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

print("Libraries imported and random seeds set.")

Libraries imported and random seeds set.


## Data Loading

### Subtask:
Download the dataset from the provided URL. Load the space-separated data into a pandas DataFrame and assign the specified column names.


**Reasoning**:
The subtask requires downloading data from a URL and loading it into a pandas DataFrame with specified column names. This code block will define the URL, column names, and then use pandas to read the data with the correct separator.



In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'

column_names = [
    'checking_status', 'duration', 'credit_history', 'purpose', 'credit_amount',
    'savings_status', 'employment_status', 'installment_rate', 'personal_status',
    'other_parties', 'residence_duration', 'property_magnitude', 'age',
    'other_payment_plans', 'housing', 'existing_credits', 'job', 'num_dependents',
    'telephone', 'foreign_worker', 'creditability'
]

df = pd.read_csv(url, sep=' ', header=None, names=column_names)

print("Dataset loaded successfully. Displaying the first 5 rows:")
print(df.head())

Dataset loaded successfully. Displaying the first 5 rows:
  checking_status  duration credit_history purpose  credit_amount  \
0             A11         6            A34     A43           1169   
1             A12        48            A32     A43           5951   
2             A14        12            A34     A46           2096   
3             A11        42            A32     A42           7882   
4             A11        24            A33     A40           4870   

  savings_status employment_status  installment_rate personal_status  \
0            A65               A75                 4             A93   
1            A61               A73                 2             A92   
2            A61               A74                 2             A93   
3            A61               A74                 2             A93   
4            A61               A73                 3             A93   

  other_parties  ...  property_magnitude age  other_payment_plans housing  \
0          A101  

## Data Preprocessing

### Subtask:
Map the target variable (Creditability) from 1 to 1 and 2 to 0. Identify categorical features and apply one-hot encoding. Split the data into training and testing sets (80/20 ratio). Apply StandardScaler to numerical features, fitting only on the training data and then transforming both training and test sets.


**Reasoning**:
The subtask requires mapping the target variable, separating features and target, identifying categorical and numerical columns, applying one-hot encoding, splitting data into training and testing sets, and then scaling numerical features. I will perform all these preprocessing steps in a single code block.



In [None]:
df['creditability'] = df['creditability'].map({1: 1, 2: 0})

X = df.drop('creditability', axis=1)
y = df['creditability']


categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)

scaler = StandardScaler()

# Ensure numerical_cols only contains columns present in X_train after one-hot encoding
numerical_cols_after_ohe = [col for col in numerical_cols if col in X_train.columns]

X_train[numerical_cols_after_ohe] = scaler.fit_transform(X_train[numerical_cols_after_ohe])
X_test[numerical_cols_after_ohe] = scaler.transform(X_test[numerical_cols_after_ohe])

print("Data preprocessing complete.")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Data preprocessing complete.
Shape of X_train: (800, 48)
Shape of X_test: (200, 48)
Shape of y_train: (800,)
Shape of y_test: (200,)


**Reasoning**:
The data has been preprocessed and split. Now, to prepare for PyTorch model training, I need to convert the numpy arrays (which pandas DataFrames and Series effectively become for these operations) into PyTorch tensors. This is a necessary step before defining and training the MLP model.



In [None]:
X_train_tensor = torch.tensor(X_train.values.astype(np.float32), dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values.astype(np.float32), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

print("Data converted to PyTorch tensors.")
print(f"X_train_tensor shape: {X_train_tensor.shape}")
print(f"y_train_tensor shape: {y_train_tensor.shape}")

Data converted to PyTorch tensors.
X_train_tensor shape: torch.Size([800, 48])
y_train_tensor shape: torch.Size([800, 1])


## Convert to PyTorch Tensors and Create DataLoaders

### Subtask:
Create TensorDatasets and DataLoaders for the training and testing sets to enable efficient batch processing during model training.


**Reasoning**:
I need to create TensorDatasets and DataLoaders as per the subtask instructions. This involves combining the feature and target tensors into datasets and then wrapping them in DataLoaders for batch processing.



In [None]:
from torch.utils.data import TensorDataset, DataLoader

# Create TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create DataLoaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print("TensorDatasets and DataLoaders created.")
print(f"Training DataLoader has {len(train_loader)} batches of size {batch_size}.")
print(f"Testing DataLoader has {len(test_loader)} batches of size {batch_size}.")

TensorDatasets and DataLoaders created.
Training DataLoader has 13 batches of size 64.
Testing DataLoader has 4 batches of size 64.


## Model Definition

### Subtask:
Define a Multi-Layer Perceptron (MLP) model using PyTorch's nn.Module. The model should have at least two hidden layers with ReLU activation functions, and the final output layer should use a Sigmoid activation for binary classification.


**Reasoning**:
I need to define the MLP model class, instantiate it with the correct input and output dimensions, and then print its architecture as per the subtask instructions.



In [None]:
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, output_dim)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

# Instantiate the model
input_dim = X_train_tensor.shape[1]
hidden_dim1 = 64
hidden_dim2 = 32
output_dim = 1

model = MLP(input_dim, hidden_dim1, hidden_dim2, output_dim)

print("MLP model defined.")
print(model)

MLP model defined.
MLP(
  (fc1): Linear(in_features=48, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=1, bias=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
)


## Model Training

### Subtask:
Set up the training loop for the defined MLP model. Use nn.BCELoss as the criterion and Adam as the optimizer. Train the model for 50 epochs, printing the loss every 5 epochs to monitor progress.


**Reasoning**:
I need to define the loss function and optimizer, then implement the training loop as specified, iterating through epochs and batches, performing forward/backward passes, and updating weights, while printing the loss periodically.



In [None]:
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 50
print(f"Starting model training for {epochs} epochs...")

for epoch in range(epochs):
    model.train() # Set model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    if (epoch + 1) % 5 == 0:
        avg_loss = running_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

print("Training complete.")

Starting model training for 50 epochs...
Epoch 5/50, Loss: 0.5496
Epoch 10/50, Loss: 0.4597
Epoch 15/50, Loss: 0.4099
Epoch 20/50, Loss: 0.3585
Epoch 25/50, Loss: 0.3082
Epoch 30/50, Loss: 0.2600
Epoch 35/50, Loss: 0.2147
Epoch 40/50, Loss: 0.1660
Epoch 45/50, Loss: 0.1247
Epoch 50/50, Loss: 0.0971
Training complete.


## Model Evaluation

### Subtask:
Evaluate the trained MLP model on the test set. Calculate and print the accuracy of the model on the test data.

**Reasoning**:
The subtask is to evaluate the trained model on the test set and calculate its accuracy. This code will put the model in evaluation mode, iterate through the test data, make predictions, and calculate the accuracy.



In [None]:
model.eval() # Set model to evaluation mode
correct = 0
total = 0

with torch.no_grad(): # Disable gradient calculation for evaluation
    for inputs, labels in test_loader:
        outputs = model(inputs)
        predicted = (outputs > 0.5).float() # Convert probabilities to binary predictions (0 or 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Accuracy on the test set: {accuracy:.2f}%")

Accuracy on the test set: 72.50%


## Final Task

### Subtask:
Summarize the completed task, confirming that the PyTorch notebook for German Credit binary classification has been created and evaluated according to the specified requirements.


## Summary:

### Q&A
The final test accuracy achieved by the Multi-Layer Perceptron (MLP) model was 72.50%.

### Data Analysis Key Findings
*   **Data Acquisition and Preparation**: The German Credit dataset, containing 1000 entries, was successfully downloaded and loaded into a pandas DataFrame with 21 specified columns.
*   **Target Variable Preprocessing**: The 'creditability' target variable was remapped from $\{1, 2\}$ to $\{1, 0\}$ for binary classification compatibility.
*   **Feature Engineering**: Categorical features were identified and transformed using one-hot encoding, while numerical features were scaled using `StandardScaler` fitted on the training data. The dataset was split into training (800 samples) and testing (200 samples) sets.
*   **PyTorch Data Preparation**: Preprocessed data was successfully converted into PyTorch tensors and organized into `TensorDatasets` and `DataLoaders` with a batch size of 64, preparing it for batch-wise training.
*   **MLP Model Architecture**: A Multi-Layer Perceptron (MLP) model was defined with an input layer matching the 48 features, two hidden layers of 64 and 32 neurons respectively, both utilizing ReLU activation, and a final output layer with a Sigmoid activation for binary classification.
*   **Model Training Progress**: The MLP model was trained for 50 epochs using `nn.BCELoss` as the criterion and Adam optimizer with a learning rate of 0.001. The training loss consistently decreased from an average loss of approximately 0.5496 at Epoch 5 to 0.0971 at Epoch 50, indicating successful learning.
*   **Model Performance**: The trained model achieved an accuracy of 72.50% on the unseen test set.

### Insights or Next Steps
*   The current model achieves a reasonable baseline accuracy of 72.50%. Further optimization could involve hyperparameter tuning (e.g., learning rate, hidden layer sizes, batch size), exploring different activation functions, or regularization techniques to potentially improve performance.
*   Given that the dataset is for credit risk assessment, a deeper analysis of false positives and false negatives (confusion matrix, precision, recall, F1-score) would provide more critical insights into the model's decision-making and its implications for lending decisions.
