<a href="https://colab.research.google.com/github/aryaganu10-cpu/GenAi/blob/main/Task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Develop an MLP model to predict credit risk using the 'German Credit Data' dataset from "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data". The task involves loading the data, performing necessary preprocessing (including one-hot encoding for categorical features and standardization for numerical features), splitting it into training and testing sets, defining a PyTorch MLP model with at least two hidden layers, training the model using `nn.BCELoss` and Adam optimizer for at least 50 epochs, and finally evaluating its accuracy on the test set.

## Load Data and Initial Preprocessing

### Subtask:
Load the 'German Credit Data' from the specified URL into a pandas DataFrame, assign column names, and map the 'Creditability' target column from (1, 2) to (1, 0) as required for BCELoss.


**Reasoning**:
First, I'll import the pandas library to handle data manipulation and then define the URL for the dataset and the column names as specified in the instructions. This sets up the necessary components before loading the data.



In [1]:
import pandas as pd

# Define the URL for the 'German Credit Data'
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'

# Create a list of 21 column names
column_names = [
    'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5',
    'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10',
    'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15',
    'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20',
    'creditability'
]

print("Pandas imported, URL and column names defined.")

Pandas imported, URL and column names defined.


**Reasoning**:
Now that the URL and column names are defined, I will load the data into a pandas DataFrame using `pd.read_csv`, specifying the space separator and column names, then map the 'creditability' column values from (1, 2) to (1, 0).



In [2]:
df = pd.read_csv(url, sep=' ', header=None, names=column_names)
df['creditability'] = df['creditability'].map({1: 1, 2: 0})
print("Data loaded and 'creditability' column mapped. Displaying first 5 rows:")
print(df.head())

Data loaded and 'creditability' column mapped. Displaying first 5 rows:
  feature_1  feature_2 feature_3 feature_4  feature_5 feature_6 feature_7  \
0       A11          6       A34       A43       1169       A65       A75   
1       A12         48       A32       A43       5951       A61       A73   
2       A14         12       A34       A46       2096       A61       A74   
3       A11         42       A32       A42       7882       A61       A74   
4       A11         24       A33       A40       4870       A61       A73   

   feature_8 feature_9 feature_10  ...  feature_12 feature_13  feature_14  \
0          4       A93       A101  ...        A121         67        A143   
1          2       A92       A101  ...        A121         22        A143   
2          2       A93       A101  ...        A121         49        A143   
3          2       A93       A103  ...        A122         45        A143   
4          3       A93       A101  ...        A124         53        A143   

  

## Identify Features and Apply One-Hot Encoding

### Subtask:
Identify categorical and numerical features. Apply one-hot encoding using `pd.get_dummies` to the categorical columns to convert them into a numerical format.


**Reasoning**:
First, I'll print the information about the DataFrame to understand the data types and identify categorical and numerical columns. This will help in planning the one-hot encoding and separation of features.



In [3]:
print(df.info())
print("DataFrame information displayed.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   feature_1      1000 non-null   object
 1   feature_2      1000 non-null   int64 
 2   feature_3      1000 non-null   object
 3   feature_4      1000 non-null   object
 4   feature_5      1000 non-null   int64 
 5   feature_6      1000 non-null   object
 6   feature_7      1000 non-null   object
 7   feature_8      1000 non-null   int64 
 8   feature_9      1000 non-null   object
 9   feature_10     1000 non-null   object
 10  feature_11     1000 non-null   int64 
 11  feature_12     1000 non-null   object
 12  feature_13     1000 non-null   int64 
 13  feature_14     1000 non-null   object
 14  feature_15     1000 non-null   object
 15  feature_16     1000 non-null   int64 
 16  feature_17     1000 non-null   object
 17  feature_18     1000 non-null   int64 
 18  feature_19     1000 non-null 

**Reasoning**:
Based on the `df.info()` output, I will now separate the features into categorical and numerical types, excluding the 'creditability' target column. Then, I will apply one-hot encoding to the categorical features and concatenate them with the numerical features to create a fully processed feature DataFrame.



In [4]:
X = df.drop('creditability', axis=1)
y = df['creditability']

categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

X_categorical_encoded = pd.get_dummies(X[categorical_features], drop_first=True)
X_numerical = X[numerical_features]

df_processed = pd.concat([X_numerical, X_categorical_encoded], axis=1)

print("Categorical features one-hot encoded and concatenated with numerical features.")
print("Shape of processed DataFrame:", df_processed.shape)
print(df_processed.head())

Categorical features one-hot encoded and concatenated with numerical features.
Shape of processed DataFrame: (1000, 48)
   feature_2  feature_5  feature_8  feature_11  feature_13  feature_16  \
0          6       1169          4           4          67           2   
1         48       5951          2           2          22           1   
2         12       2096          2           3          49           1   
3         42       7882          2           4          45           1   
4         24       4870          3           4          53           2   

   feature_18  feature_1_A12  feature_1_A13  feature_1_A14  ...  \
0           1          False          False          False  ...   
1           1           True          False          False  ...   
2           2          False          False           True  ...   
3           2          False          False          False  ...   
4           2          False          False          False  ...   

   feature_12_A124  feature_14_A

## Split Data into Training and Test Sets

### Subtask:
Split the preprocessed data into an 80% training set and a 20% testing set for both features (X) and the target variable (y).


**Reasoning**:
To split the data, I will import the `train_test_split` function from `sklearn.model_selection`.



In [5]:
from sklearn.model_selection import train_test_split

print("train_test_split imported.")

train_test_split imported.


**Reasoning**:
Now that `train_test_split` is imported, I will split the `df_processed` (features) and `y` (target) data into training and testing sets, ensuring an 80/20 split, reproducibility with `random_state=42`, and class proportion preservation using `stratify=y`.



In [6]:
X_train, X_test, y_train, y_test = train_test_split(df_processed, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and testing sets.")
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Data split into training and testing sets.
Shape of X_train: (800, 48)
Shape of X_test: (200, 48)
Shape of y_train: (800,)
Shape of y_test: (200,)


## Standardize Numerical Features (Post-Split)

### Subtask:
Apply `StandardScaler` to the numerical features. Crucially, fit the scaler only on the *training data* and then use it to transform *both* the training and test sets to prevent data leakage.


**Reasoning**:
To standardize the numerical features, I will first import the `StandardScaler` from `sklearn.preprocessing`.



In [7]:
from sklearn.preprocessing import StandardScaler

print("StandardScaler imported.")

StandardScaler imported.


**Reasoning**:
Now that `StandardScaler` is imported, I will identify the numerical features in `X_train` and `X_test`, initialize the scaler, fit it on the numerical features of `X_train`, and then transform both the training and test sets to standardize them, replacing the original columns.



In [8]:
numerical_features_names = X_numerical.columns

scaler = StandardScaler()
X_train_numerical_scaled = scaler.fit_transform(X_train[numerical_features_names])
X_test_numerical_scaled = scaler.transform(X_test[numerical_features_names])

X_train[numerical_features_names] = X_train_numerical_scaled
X_test[numerical_features_names] = X_test_numerical_scaled

print("Numerical features in X_train and X_test have been standardized.")
print("X_train head after standardization:")
print(X_train.head())
print("X_test head after standardization:")
print(X_test.head())

Numerical features in X_train and X_test have been standardized.
X_train head after standardization:
     feature_2  feature_5  feature_8  feature_11  feature_13  feature_16  \
675   0.755149   0.485384   0.905268    1.044365   -0.825479   -0.719852   
703   0.755149  -0.246578   0.905268   -0.758207    0.493705    1.035885   
12   -0.726746  -0.584573  -1.797024   -1.659492   -1.177262   -0.719852   
845   0.014201   0.285331  -0.896260    0.143079   -0.033969   -0.719852   
795  -0.973728  -0.319522  -0.896260    1.044365   -1.177262   -0.719852   

     feature_18  feature_1_A12  feature_1_A13  feature_1_A14  ...  \
675   -0.442509          False          False           True  ...   
703   -0.442509           True          False          False  ...   
12    -0.442509           True          False          False  ...   
845   -0.442509           True          False          False  ...   
795   -0.442509          False          False           True  ...   

     feature_12_A124  featu

## Create PyTorch DataLoaders

### Subtask:
Convert the fully processed training and testing feature and target sets into PyTorch Tensors and then create `DataLoader` instances for efficient batch processing.


**Reasoning**:
First, I'll import the necessary PyTorch libraries: `torch` for tensor operations and `TensorDataset`, `DataLoader` from `torch.utils.data` for creating datasets and data loaders. Then, I will convert the `X_train`, `X_test`, `y_train`, and `y_test` DataFrames/Series into PyTorch Tensors with the specified data types and shapes, which is crucial for model training.



In [9]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Convert to PyTorch Tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)

print("PyTorch and utility modules imported.")
print("DataFrames/Series converted to PyTorch Tensors.")
print("Shape of X_train_tensor:", X_train_tensor.shape)
print("Shape of y_train_tensor:", y_train_tensor.shape)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint64, uint32, uint16, uint8, and bool.

**Reasoning**:
The previous code failed because `X_train` and `X_test` DataFrames contained boolean (object) columns from one-hot encoding, which `torch.tensor` cannot directly convert to `float32`. To fix this, I will explicitly convert these DataFrames to float type before creating PyTorch tensors, ensuring all values are numerical.



In [10]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Convert to PyTorch Tensors, explicitly casting all values to float
X_train_tensor = torch.tensor(X_train.astype(float).values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.astype(float).values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)

print("PyTorch and utility modules imported.")
print("DataFrames/Series converted to PyTorch Tensors.")
print("Shape of X_train_tensor:", X_train_tensor.shape)
print("Shape of y_train_tensor:", y_train_tensor.shape)

PyTorch and utility modules imported.
DataFrames/Series converted to PyTorch Tensors.
Shape of X_train_tensor: torch.Size([800, 48])
Shape of y_train_tensor: torch.Size([800, 1])


**Reasoning**:
Now that the data is converted into PyTorch Tensors, I will create `TensorDataset` objects for both training and testing data, and then initialize `DataLoader` instances with a batch size of 64, shuffling the training data but not the testing data for efficient and consistent batch processing.



In [11]:
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print("TensorDatasets and DataLoaders created.")
print(f"Training DataLoader will yield batches of size {batch_size}.")
print(f"Testing DataLoader will yield batches of size {batch_size}.")

TensorDatasets and DataLoaders created.
Training DataLoader will yield batches of size 64.
Testing DataLoader will yield batches of size 64.


## Define MLP Model

### Subtask:
Define a PyTorch `nn.Module` class for the Multi-Layer Perceptron (MLP) model, ensuring it has at least two hidden layers with an appropriate number of neurons for the input data.


**Reasoning**:
To define the MLP model, I will first import the `torch.nn` module and then define the `MLP` class as instructed, including the necessary linear layers, activation functions, and the forward pass logic.



In [12]:
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim):
        super(MLP, self).__init__()
        self.layer_1 = nn.Linear(input_dim, 64)
        self.relu_1 = nn.ReLU()
        self.layer_2 = nn.Linear(64, 32)
        self.relu_2 = nn.ReLU()
        self.layer_3 = nn.Linear(32, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.layer_1(x)
        x = self.relu_1(x)
        x = self.layer_2(x)
        x = self.relu_2(x)
        x = self.layer_3(x)
        x = self.sigmoid(x)
        return x

print("MLP class defined.")

MLP class defined.


## Train MLP Model

### Subtask:
Implement a training loop for the MLP model. Use `nn.BCELoss` as the loss function and the Adam optimizer. Train the model for at least 50 epochs on the training data.


**Reasoning**:
To begin the model training, I will initialize the MLP model, the BCELoss function, and the Adam optimizer. This sets up the core components required for the training loop.



In [13]:
input_dim = X_train_tensor.shape[1]
model = MLP(input_dim)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

print(f"MLP model initialized with input dimension: {input_dim}")
print("Loss function (BCELoss) and optimizer (Adam) defined.")

MLP model initialized with input dimension: 48
Loss function (BCELoss) and optimizer (Adam) defined.


**Reasoning**:
Now that the model, loss function, and optimizer are initialized, I will implement the training loop for at least 50 epochs, including the forward pass, loss calculation, backward pass, and optimizer steps, while also printing the loss for each epoch to monitor progress.



In [14]:
num_epochs = 50

for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0
    for inputs, labels in train_loader:
        optimizer.zero_grad()  # Zero the gradients
        outputs = model(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Calculate loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update weights
        total_loss += loss.item()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}')

print("Model training complete.")

Epoch 1/50, Loss: 0.7108
Epoch 2/50, Loss: 0.6674
Epoch 3/50, Loss: 0.6094
Epoch 4/50, Loss: 0.5705
Epoch 5/50, Loss: 0.5491
Epoch 6/50, Loss: 0.5352
Epoch 7/50, Loss: 0.5128
Epoch 8/50, Loss: 0.4896
Epoch 9/50, Loss: 0.4782
Epoch 10/50, Loss: 0.4634
Epoch 11/50, Loss: 0.4471
Epoch 12/50, Loss: 0.4447
Epoch 13/50, Loss: 0.4301
Epoch 14/50, Loss: 0.4199
Epoch 15/50, Loss: 0.4093
Epoch 16/50, Loss: 0.4052
Epoch 17/50, Loss: 0.3975
Epoch 18/50, Loss: 0.3920
Epoch 19/50, Loss: 0.3824
Epoch 20/50, Loss: 0.3709
Epoch 21/50, Loss: 0.3630
Epoch 22/50, Loss: 0.3542
Epoch 23/50, Loss: 0.3454
Epoch 24/50, Loss: 0.3395
Epoch 25/50, Loss: 0.3311
Epoch 26/50, Loss: 0.3209
Epoch 27/50, Loss: 0.3130
Epoch 28/50, Loss: 0.2981
Epoch 29/50, Loss: 0.2879
Epoch 30/50, Loss: 0.2819
Epoch 31/50, Loss: 0.2766
Epoch 32/50, Loss: 0.2693
Epoch 33/50, Loss: 0.2559
Epoch 34/50, Loss: 0.2444
Epoch 35/50, Loss: 0.2328
Epoch 36/50, Loss: 0.2265
Epoch 37/50, Loss: 0.2155
Epoch 38/50, Loss: 0.2144
Epoch 39/50, Loss: 0.

## Evaluate MLP Model Accuracy

### Subtask:
Evaluate the trained MLP model on the test set to determine its accuracy. This involves setting the model to evaluation mode, making predictions, converting probabilities to binary labels, and comparing them with the true labels.

**Reasoning**:
To evaluate the model's accuracy, I will first import `accuracy_score` from `sklearn.metrics`, set the model to evaluation mode to disable dropout/batchnorm, and then make predictions on the test set. After predictions are made, I'll convert the probabilities to binary labels and calculate the accuracy.



In [15]:
from sklearn.metrics import accuracy_score

model.eval()  # Set the model to evaluation mode

with torch.no_grad():  # Disable gradient calculation for evaluation
    y_pred_proba = model(X_test_tensor)
    y_pred = (y_pred_proba >= 0.5).float() # Convert probabilities to binary predictions (0 or 1)

accuracy = accuracy_score(y_test_tensor.cpu().numpy(), y_pred.cpu().numpy())

print(f"Model accuracy on the test set: {accuracy:.4f}")

Model accuracy on the test set: 0.7450


## Final Task

### Subtask:
Summarize the end-to-end process, including data loading, preprocessing steps, MLP model definition, training progress, and the final accuracy achieved on the test set.


## Summary:

### Data Analysis Key Findings

*   **Data Loading and Initial Preprocessing**: The 'German Credit Data' was successfully loaded into a DataFrame with 1000 entries and 21 columns. The 'creditability' target column was remapped from (1, 2) to (1, 0) as required for binary classification.
*   **Feature Engineering**:
    *   13 categorical features were identified and one-hot encoded using `pd.get_dummies` with `drop_first=True`.
    *   8 numerical features were identified.
    *   The feature space expanded from the original 20 features to 48 features after one-hot encoding.
*   **Data Splitting**: The preprocessed data was split into an 80% training set (800 samples, 48 features) and a 20% test set (200 samples, 48 features) using `train_test_split` with `stratify=y` to maintain class distribution.
*   **Feature Scaling**: Numerical features were standardized using `StandardScaler`. The scaler was fitted exclusively on the training data and then used to transform both training and test sets to prevent data leakage.
*   **PyTorch Data Preparation**: The processed data was converted into PyTorch Tensors (all features cast to `float32`) and organized into `TensorDataset` and `DataLoader` instances with a batch size of 64 for efficient model training.
*   **MLP Model Architecture**: A PyTorch `nn.Module` class named `MLP` was defined. It features an input layer, two hidden layers with 64 and 32 neurons respectively (each followed by a ReLU activation), and an output layer with a single neuron using a Sigmoid activation function for binary classification.
*   **Model Training**: The MLP model was trained for 50 epochs using `nn.BCELoss` as the loss function and the Adam optimizer with a learning rate of 0.001. The training loss consistently decreased from approximately 0.7108 in Epoch 1 to 0.1219 in Epoch 50, indicating successful learning.
*   **Model Evaluation**: The trained MLP model achieved an accuracy of **0.7450** on the test set.

### Insights or Next Steps

*   The current model provides a reasonable baseline accuracy of 74.5% for credit risk prediction. Further experimentation with hyperparameter tuning (e.g., learning rate, batch size, number of epochs) or exploring more complex MLP architectures could potentially improve performance.
*   Investigate the misclassified samples in the test set to understand patterns or feature importance that might lead to better model design or more targeted feature engineering in future iterations.
