# 1. (3 pts) What is a neural network? What are the general steps required to build a neural network?

A neural network is a machine learning model inspired by the human brain's structure, consisting of layers of interconnected neurons that process and learn from data. To build a neural network, you start by defining the problem, preprocessing the data (like handling missing values and normalizing features), and splitting it into training, validation, and test sets. Next, you design the network architecture by choosing the number of layers, neurons, and activation functions. After selecting a loss function and optimizer, you train the model by adjusting weights through backpropagation. Finally, you evaluate the model’s performance on the test set and fine-tune the hyperparameters to optimize results.

# 2. (3 pts) Generally, how do you check the performance of a neural network? Why is this the case?

To check the performance of a neural network, you typically evaluate it using metrics like accuracy, confusion matrix, and loss on a test set that the model hasn’t seen during training. This helps determine how well the network generalizes to new, unseen data. Monitoring performance ensures that the model is not overfitting or underfitting and is making accurate predictions.

In [27]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from typing import List
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [28]:
df = pd.read_csv('Credit_Card.csv')
label_df = pd.read_csv('Credit_card_label.csv')
merged_df = pd.merge(df, label_df, on='Ind_ID')

# 3. (4 pts) Clean the data or do additional cleaning if you have used the dataset for another assignment. Specify the improvements (at least 2) that you made to your cleaning if you selected the dataset before. If you select one with a low number or records, consider oversampling. 

In [29]:
def preprocess_credit_card_data(merged_df: pd.DataFrame, 
                          numerical_cols: List[str] = ['Annual_income', 'Birthday_count', 'Employed_days'], 
                          categorical_cols: List[str] = ['GENDER', 'Car_Owner', 'Propert_Owner', 'Type_Income', 
                                                         'EDUCATION', 'Marital_status', 'Housing_type', 'Type_Occupation'], 
                          occupation_col: str = 'Type_Occupation') -> pd.DataFrame:

    for col in numerical_cols:
        merged_df[col] = merged_df[col].fillna(merged_df[col].median())

    merged_df.dropna(subset=[occupation_col], inplace=True)

    merged_df['GENDER'] = merged_df['GENDER'].fillna(merged_df['GENDER'].mode()[0])

    for col in numerical_cols:
        Q1 = merged_df[col].quantile(0.25)
        Q3 = merged_df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        merged_df[col] = np.clip(merged_df[col], lower_bound, upper_bound)
 
    merged_df_encoded = pd.get_dummies(merged_df, columns=categorical_cols, drop_first=True)

    merged_df_encoded.replace([np.inf, -np.inf], np.nan, inplace=True)
    merged_df_encoded.fillna(0, inplace=True)

    scaler = StandardScaler()
    merged_df_encoded[numerical_cols] = scaler.fit_transform(merged_df_encoded[numerical_cols])

    merged_df_encoded = merged_df_encoded.astype(int, errors='ignore')
    
    return merged_df_encoded

numerical_cols = ['Annual_income', 'Birthday_count', 'Employed_days']
categorical_cols = ['GENDER', 'Car_Owner', 'Propert_Owner', 'Type_Income', 
                    'EDUCATION', 'Marital_status', 'Housing_type', 'Type_Occupation']
occupation_col = 'Type_Occupation'

df_preprocessed = preprocess_credit_card_data(merged_df)
df_preprocessed

Unnamed: 0,Ind_ID,CHILDREN,Annual_income,Birthday_count,Employed_days,Mobile_phone,Work_Phone,Phone,EMAIL_ID,Family_Members,...,Type_Occupation_Laborers,Type_Occupation_Low-skill Laborers,Type_Occupation_Managers,Type_Occupation_Medicine staff,Type_Occupation_Private service staff,Type_Occupation_Realty agents,Type_Occupation_Sales staff,Type_Occupation_Secretaries,Type_Occupation_Security staff,Type_Occupation_Waiters/barmen staff
8,5010864,1,2,0,0,1,0,1,1,3,...,0,0,0,0,0,0,0,0,0,0
9,5010868,1,2,0,0,1,0,1,1,3,...,0,0,0,0,0,0,0,0,0,0
10,5010869,1,2,0,0,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
11,5018498,0,-1,-1,0,1,1,1,0,2,...,0,0,0,0,0,0,0,0,0,0
12,5018501,0,0,-1,0,1,1,1,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1542,5118268,1,2,1,0,1,0,1,0,3,...,0,0,0,0,0,0,0,0,0,0
1543,5028645,0,0,0,0,1,0,0,0,2,...,0,0,1,0,0,0,0,0,0,0
1544,5023655,0,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1545,5115992,2,0,0,0,1,0,0,0,4,...,0,0,1,0,0,0,0,0,0,0


For my additional preprocessing steps I handled outliers using IQR method, also converted all data to integers if applicable, and I standardized numerical columns

# 4. (10 pts) Create a neural network using Keras or PyTorch to predict the outcome of your datasets.

In [30]:
X = df_preprocessed.drop(columns=['label'])
y = df_preprocessed['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', solver='adam', 
                    max_iter=200, random_state=42)

mlp.fit(X_train, y_train)

y_pred = mlp.predict(X_test)

print(f"Scikit-learn MLP Test Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))




Scikit-learn MLP Test Accuracy: 0.8537735849056604
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       181
           1       0.00      0.00      0.00        31

    accuracy                           0.85       212
   macro avg       0.43      0.50      0.46       212
weighted avg       0.73      0.85      0.79       212



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [40]:
class ANN_Model(nn.Module):
    def __init__(self, input_features=8, hidden1=20, hidden2=20, out_features=2):
        super().__init__()
        self.layer_1_connection = nn.Linear(input_features, hidden1)
        self.layer_2_connection = nn.Linear(hidden1, hidden2)
        self.out = nn.Linear(hidden2, out_features)

    def forward(self, x):
        x = F.relu(self.layer_1_connection(x))
        x = F.relu(self.layer_2_connection(x))
        x = self.out(x)
        return x

X = df_preprocessed.drop(columns=['label']).values 
y = df_preprocessed['label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long) 
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)


input_features = X_train.shape[1]
model = ANN_Model(input_features=input_features, hidden1=20, hidden2=20, out_features=2)

criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch) 
        loss = criterion(outputs, y_batch) 
        loss.backward() 
        optimizer.step() 

    print(f'Epoch number: {epoch} with loss {loss}')

model.eval()
with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = torch.argmax(y_pred, dim=1)
    accuracy = accuracy_score(y_test, y_pred_class.numpy())
    print(f"PyTorch ANN Test Accuracy: {accuracy}")



Epoch number: 0 with loss 7748.7470703125
Epoch number: 1 with loss 521.01708984375
Epoch number: 2 with loss 0.0
Epoch number: 3 with loss 136.2158203125
Epoch number: 4 with loss 285.87890625
Epoch number: 5 with loss 577.27001953125
Epoch number: 6 with loss 156.9921875
Epoch number: 7 with loss 3120.7431640625
Epoch number: 8 with loss 0.0
Epoch number: 9 with loss 657.7470703125
PyTorch ANN Test Accuracy: 0.8537735849056604


# 5. (5 pts) Compare the performance of the neural networks to another model you created. Which performed better? Why do you think that is?

The comparison between the neural networks (ANN models) and the KNN model showed that the neural networks performed better overall. Both the PyTorch and Keras versions of the ANN achieved higher accuracy compared to KNN. Additionally, the confusion matrix and classification report highlighted that the neural networks were more consistent in correctly predicting the outcomes across all classes, whereas KNN struggled with imbalanced or complex patterns in the data.

This difference is likely because neural networks are better at learning intricate relationships in data through multiple layers and activation functions, while KNN relies on simple distance-based calculations. Neural networks can generalize better when the dataset has many features or nonlinear patterns, whereas KNN performance heavily depends on the choice of 
k
k and suffers when the data becomes more complex or high-dimensional. However, KNN is simpler to implement and works well for smaller or less complex datasets.