In [86]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.nn.functional as F

**(3 pts) What is a neural network? What are the general steps required to build a neural network?**

A neural network is a computational model inspired by the human brain, consisting of layers of interconnected nodes (neurons) that process data and learn patterns to solve tasks like classification, regression, and prediction. The general steps required to build a neural network are to:
1. Prepare the Data: Gather, clean, and preprocess the dataset.

2. Model Selection: Select the model you want to use.

3. Define the Model: Further define the model(layers, neurons per layer, activation functions, etc.)

4. Initialize the Parameters: Set the weights and biases.

5. Forward Propagation: Pass the input data through the network and compute the predictions.

6. Compute Loss: Measure the error of the function that you are using.

7. Backward Propagation: Calculate gradients of the loss with respect to the weights and biases that were set prior.

8. Update Parameters: Adjust the weights and biases using an optimization algorithm.

9. Evaluate Performance: Test the model.

10. Deploy Model: Use the trained model for predictions.


**(3 pts) Generally, how do you check the performance of a neural network? Why is this the case?**

You check a neural network's performance using metrics like accuracy, precision, recall, or loss on a validation/test set. This ensures the model generalizes well to unseen data, avoiding overfitting and underfitting.

**(4 pts) Clean the data or do additional cleaning if you have used the dataset for another assignment. Specify the improvements (at least 2) that you made to your cleaning if you selected the dataset before. If you select one with a low number or records, consider oversampling.**

Import the data

In [87]:
card = pd.read_csv("Credit_card.csv")

label = pd.read_csv("Credit_card_label.csv")

data = card.merge(label, on = "Ind_ID", how = "left")

Data Cleaning Function

In [88]:
def column_encoder(data, columns):
    labelencoder = preprocessing.LabelEncoder()
    for column in columns:
        if column in data.columns:
            data[column] = labelencoder.fit_transform(data[column].astype(str))
    data.fillna(data.mean(numeric_only=True), inplace=True)
    numeric_columns = data.select_dtypes(include=["float64", "int64"]).columns    
    scaler = StandardScaler()
    data[numeric_columns] = scaler.fit_transform(data[numeric_columns])
    for col in numeric_columns:
        if col in data.columns:
            Q1 = data[col].quantile(0.25)
            Q3 = data[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            data[col] = data[col].clip(lower=lower_bound, upper=upper_bound)
    return data

encode_column = ["Car_Owner", "GENDER", "Propert_Owner", "Marital_status", 
                 "EDUCATION", "Housing_type", "Type_Occupation", "Type_Income"]

data = column_encoder(data, encode_column)

Improvements:
1. StandardScaler was used to scale all numerical columns to have a mean equal to zero and a standard deviation of 1.

2. IQR Outlier Capping was used so that the numerical data are capped at 1.5 times the interquartile range which has many benefits such as improving model performance, prevents overfitting, minimizes distortions, etc.

**(10 pts) Create a neural network using Keras or PyTorch to predict the outcome of your datasets.**

In [89]:
X = data.drop('label', axis =1).values
y = data['label'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_test = torch.LongTensor(y_test)
y_train = torch.LongTensor(y_train)

class ANN_Model(nn.Module):
    def __init__(self,input_features=18,hidden1=20,hidden2=20,out_features=2):
        super().__init__()
        self.layer_1_connection = nn.Linear(input_features, hidden1)
        self.layer_2_connection = nn.Linear(hidden1,hidden2)
        self.out = nn.Linear(hidden2, out_features)

    def forward(self, x):
        x = F.relu(self.layer_1_connection(x))
        x = F.relu(self.layer_2_connection(x))
        s = self.out(x)
        return x

torch.manual_seed(42)
ann = ANN_Model()

loss_function = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(ann.parameters(), lr = 0.01)

final_loss = []
n_epochs = 500
for epochs in range (n_epochs):
    y_pred = ann.forward(X_train)
    loss = loss_function(y_pred, y_train)
    final_loss.append(loss)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

y_pred = []

with torch.no_grad():
    for i, data in enumerate(X_test):
        prediction = ann(data)
        y_pred.append(prediction.argmax())

y_pred = torch.tensor(y_pred).numpy()
y_test = y_test.numpy()

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 100.00%


**(5 pts) Compare the performance of the neural networks to another model you created. Which performed better? Why do you think that is?**

The performance of the neural network compared to the logistic regression model that I used in a prior assignment is night an day. The logistic regression had an accuracy score of 84.52% while the neural network had an accuracy of 100.00%. The neural network outperformed logistic regression because it can model complex, non-linear relationships in the data, while logistic regression is limited to linear boundaries. If the dataset has non-linear patterns, many features, or perfect separability, the neural network's flexibility gives it a clear advantage. However, the 100% accuracy suggests the possibility of overfitting, especially if the dataset is small or lacks noise. Proper validation (e.g., cross-validation) is necessary to confirm the results.

**Extra credit: (3 pts) for using Keras and PyTorch. Additional (2 pts) for identifying an optimal activation function and optimizer function for the dataset you chose.**

