# Lab Assignment Five: Wide and Deep Network Architectures

In this lab, we will select a prediction task to perform on our dataset, evaluate two different deep learning architectures and tune hyper-parameters for each architecture.

## Team Members:
1) Mohammed Ahmed Abdelrazek Aboelela.

2) Naim Barnett

## Dataset Selection

Data Set : Credit Card Classification - https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv

### Overview and Business Understanding

It is very important in the work of global financial companies and also banks to have a classifier that helps them to decide whether or not to trust customers when lending them large sum of money, such as a mortgage or a line of credit. To determine said reliability of customers, companies and banks utilizes the credit score, which is dependent on a variety of factors. Within the public domain of Kaggle, there is a large database that holds information on the known factors associated with credit score, and the final given credit score bracket. It contains over 100000 datapoints, and it utilizes both numeric and categorical data. Our goal is to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts. Thus, the main prediction task here is to classify the credit score of a customer based on their credit-related attributes. This is of direct interest to third parties (such as companies) that want a tool to reduce the efforts to classify their customer's credit scores. Consequently, for the prediction algorithm to be considered useful, it needs to be very efficient when applied to our test data in predicting the credit scores of the test customers. The model (from my own understanding of the difference between online and offline analysis) will be mostly for offline analysis, meaning that the model will be trained and tested using the already provided datapoints, and then the prediction data will be collected and fed to the algorithm that will predict the respective credit score bracket.

## Preparation

In [None]:
"""Importing all the needed packages"""
import numpy as np 
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from copy import deepcopy
import seaborn as sns
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
import re
import missingno as mn         #make sure to have the package installed "pip install missingno"
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import pprint as pp
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics as mt
import tensorflow as tf
from tensorflow import keras
import os
os.environ['AUTOGRAPH_VERBOSITY'] = '0'

print(tf.__version__)
print(keras.__version__)

from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate

In [None]:
"""Loading the dataset"""
df_train_orig = pd.read_csv('train.csv', low_memory=False)
df_train_orig.head()

In [None]:
data = deepcopy(df_train_orig)

From the data below, we can see the overall statistics from the raw data. Ideally, once we are done cleaning, such values as the average will become more accurate.

In [None]:
#Showing the initial form of the data and their related features and averages
data.info()
data.describe().T

Below is a table that includes a description of each attribute in our dataset.

| ID | Customer_ID | Month | Name | Age | SSN | Occupation | Annual_Income | Monthly_Inhand_Salary | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate| Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payments | Changed_Credit_Limit | Num_Credit_Inquiries | Credit_Mix | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthy | Payment_Behaviour | Monthly_Balance | Credit_Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| a unique identification of an entry | a unique identification of a person | the month of the year | the name of a person | the age of the person | the social security number of the person | the occupation of the person | the annual income of the person | the monthly base salary of a person | the number of bank accounts a person holds | the number of other credit cards held by a person | the interest rate on the credit card | the number of loans taken from the bank | the types of loan taken by a person | the average number of days delayed from the payment date | the average number of payments delayed by a person | the percentage change in credit card limit | the number of credit card inquiries | the classification of the mix of credits (the types of different credit accounts) | the remaining debt to be paid (in USD) | the utilization ratio of credit card (the sum of all your balances, divided by the sum of your cards' credit limits) | the age of credit history of the person | whether only the minimum amount was paid by the person | the monthly EMI "Equated monthly installment" payments (in USD) | the monthly amount invested by the customer (in USD) | the payment behavior of the customer (in USD) | the monthly balance amount of the customer (in USD) | the bracket of credit score (Poor, Standard, Good)|

Due to the relatively big number of attributes (27 + Credit Score), we will be more inclined to get rid of some of them if: 
1) They are not relevant to our analysis.

2) They have a big number of missing data (which make it hard to do imputation).

In [None]:
#Looking at the overall shape of the data
mn.matrix(data)
plt.title("Visualization of the overall data shape", fontsize=30)
plt.show()

To ensure our data maintains high quality, so our results can be as accurate as possible, we need to clean our data.

After an attempt to impute the null or unusable values, we found that it ended up skewing the values. Additionally the sheer amount of values caused overfitting in some of our calculations. Resultantly, we chose to simply remove the unusable data .

First, we are going to remove columns that are not useful to our analysis. We can see that information such as Customer_ID, Month, Name, and SSN are general information that is extremely unlikely to have any affect on the trends we are analyzing. As a result, we can remove the columns to narrow our dataset.

In [None]:
#Print column names
data.columns

In [None]:
'''Remove Customer_ID, Name, SSN, Month, and Type_of_Loan. We remove the "Type_of_Loan" because it has broadly many unique values
and combinations that will be very hard to trace and will more likely make it harder for the network to find a pattern in training. We remove
Occupation because becomes a redundant feature in conjunction with "Income". "Income" is more applicable to our model.
'''

data.drop(['ID','Customer_ID', 'Name', 'SSN', 'Type_of_Loan', 'Month'], axis=1, inplace=True)
data.info()

Secondly, we are going to fill null values in the columns missing data. We also want to remove any illegal values

In [None]:
#Replace Invalid Values
data =data.replace(r'[^\w\s.]|_|-', '', regex=True) 
#Replace all blank strings will null to be dropped
data.replace('', np.nan, inplace=True)
#Remove all rows with null values
data.dropna(inplace=True);

In [None]:
data.info()

Thirdly, we want to clear out all duplicate data so our frequency analysis remains accurate.

In [None]:
#Find duplicate instances
duplicates = data[data.duplicated()]

#Remove all duplicates
data = data.drop_duplicates()

data.info()

Fourthly, we would like to remove outliers from out dataset, so that our data analysis isn't skewed.

In [None]:
#Determine outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
#Remove outliers
data = data[~((data < (Q1 - 1.5*IQR)) | (data > (Q3 + 1.5*IQR))).any(axis = 1)]
data = data[(data['Num_Bank_Accounts'] >= 0)]

data[['Age']] = data[['Age']].apply(pd.to_numeric)
data = data.loc[(data["Age"] > 0) & (data["Age"] <= 112)] #112 is the recorded oldest age ever!

data[['Changed_Credit_Limit']] = data[['Changed_Credit_Limit']].apply(pd.to_numeric)
data = data.loc[(data["Changed_Credit_Limit"] > 0) & (data["Changed_Credit_Limit"] <= 100)] #Since it's a percentage,
#we do the selection of the instances with numbers between 0 and 100%.

data[['Monthly_Balance']] = data[['Monthly_Balance']].astype('float64')
data = data.loc[(data["Monthly_Balance"] < 10000)] #Keeping the reasonable monthly balance (value < 10000), 
#and the converting to numeric values

data = data.loc[(data["Payment_Behaviour"] != '98')] #noticed this unreasonable category thus removing it

data = data.loc[(data["Payment_of_Min_Amount"] != 'NM')] #keeping only known info about payment of min amount

data.info()

Fifthly, we convert all the values that unreasonably categorical into numeric for convinience.

In [None]:
"""Converting the unreasonable categorical features to numeric"""
data[['Annual_Income']] = data[['Annual_Income']].apply(pd.to_numeric)
data[['Num_of_Loan']] = data[['Num_of_Loan']].apply(pd.to_numeric)
data[['Num_of_Delayed_Payment']] = data[['Num_of_Delayed_Payment']].apply(pd.to_numeric)
data[['Outstanding_Debt']] = data[['Outstanding_Debt']].apply(pd.to_numeric)
data[['Outstanding_Debt']] = data[['Outstanding_Debt']].apply(pd.to_numeric)
data['Credit_History_Age'] = data['Credit_History_Age'].str[:2] #Keeping the year part only
data[['Credit_History_Age']] = data[['Credit_History_Age']].apply(pd.to_numeric)
data[['Amount_invested_monthly']] = data[['Amount_invested_monthly']].apply(pd.to_numeric)

In [None]:
data.reset_index()

In [None]:
data.info()

Our clean data then becomes: 

In [None]:
mn.matrix(data)
plt.title("Post-Cleaning", fontsize=30)
plt.show()

In [None]:
data.describe().T

Checking categorical variables, grouping them and also grouping the numerical ones together.

In [None]:
categorical_vars = []
numerical_vars = []
for column in data.columns:
    if data[column].dtype == 'object' and column != "Credit_Score":
        categorical_vars.append(column)
    else:
        if column != "Credit_Score": numerical_vars.append(column)
        
print("The categorical variables in our cleaned dataset are:", categorical_vars)
print("The numerical variables in our cleaned dataset are:", numerical_vars)

Performing the usual standard scaling on numerical features.

In [None]:
ss = StandardScaler()
data[numerical_vars] = ss.fit_transform(data[numerical_vars].values)
data.head()

In [None]:
#COME BACK LATER IF WE HAVE TIME
"""Creating a heatmap to see which numerical variables are mostly correlated to help in dimensionality reduction (if we will do it)"""
#sns.heatmap(data.corr(),annot=True)
#plt.show()

In [None]:
data["Credit_Score"].value_counts()

Encoding the categorical features as integers using the label encoder from scikit learn.

In [None]:
"""CHECK ONE-HOT ENCODING INSTEAD OF NORMAL ENCODER LATER HERE IF YOU HAVE TIME"""
# define objects that can encode each variable as integer    
encoders = dict() # save each encoder in dictionary
# train all encoders (special case the target 'income')
for col in categorical_vars+['Credit_Score']:
    data[col] = data[col].str.strip()
    
    if col=="Credit_Score":
        # special case the target, just replace the column
        tmp = LabelEncoder()
        data[col] = tmp.fit_transform(data[col])
    else:
        # integer encode strings that are features
        encoders[col] = LabelEncoder() # save the encoder
        data[col+'_int'] = encoders[col].fit_transform(data[col])
        

#Container for the names of our categorical encoded features
categorical_vars_ints = [x+'_int' for x in categorical_vars]

#Collecting together the features we will be interested in using later
feature_columns = categorical_vars_ints+numerical_vars

print(f"We will use the following {len(feature_columns)} features:")
pp.pprint(feature_columns)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# we have the following lists now of data that we can use with our dataframes:
print("Numeric Headers:")
pp.pprint(numerical_vars) # normalized numeric data
print("\nCategorical String Headers:")
pp.pprint(categorical_vars) # string data
print("\nCategorical Headers, Encoded as Integer:")
pp.pprint(categorical_vars_ints) # string data encoded as an integer

Thus, the final pre-processed clean dataset we have consists of 31,212 instances. It has 17 numerical features that are ['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries', 'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Credit_History_Age', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance'] and 4 categorical ones ['Occupation', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour']. The task is to perform a classification and predict the target variable that is ['Credit_Score'], which is encoded too using the same label encoder

In [None]:
# sandbox for looking at different categorical variables
for col in categorical_vars:
    vals = data[col].unique()
    print(col,'has', len(vals), 'unique values:')
    print(vals)

We will now look to combine related features into cross-product features.

In [None]:
# choose these as a class, what makes sense
cross_columns = ['Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour']

For our crossed columns, we chose to cross 'Credit_Mix', 'Payment_of_Min_Amount', and 'Payment_Behaviour' because we think that we can gain new knowledge by making combinations of these sort-of realated payment features. However, we see really no relevance in including the 'occupation' in any of the crossings as it does not have any reasonable relationship to customer's credit behavior or add new knowledge to what we have currently. We can also cross any two of them together later. We're going to investigate this in the modeling part

Before everything, we are going to do a small modification to our target variable. Our problem is mainly labeled as a multi-class classification, but when you actually think about, as a company, you are only concerned about knowing whether a person has a "poor" credit score or not to be able to determine whether to give them the loan or not. You don't really care whether their credit score is "Standard" or "Good" which are both deemed to be good enough to be given loans. So, this compels us to change the nature of our prediction class to a "binary" one, where we will be mainly interested in determining whether the targer variable will be "poor: 0" or "not poor:1". We implement this below.

In [None]:
data["Credit_Score"].value_counts()

In [None]:
data["Credit_Score"].replace([1,0,2],[0,1,1], inplace=True)
data["Credit_Score"].value_counts()

Given our current binary classification problem, the appropriate metric we decided to use is the "Precision" metric. This reasoning behind this is that we are only interested that we maximize the true positive out of the total predicted positives (with positive = the person being predicted to be not poor, thus we will be giving money as a company). We don't really care about whether the predicted negative is true or false cause in any case, we will not be losing money if we decide against giving the loans anyway. Thus, precision which is the ratio between the true predicted positives to the total predicted positives is the right metric we will want to maximize.

In [None]:
"""Separating the features and the target variable in the dataframe"""
X, y = data[feature_columns+categorical_vars], data['Credit_Score']

In [None]:
"""Separating into testing and training samples"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

For our model, we will choose to use Stratified 10-fold cross validation to ensure we uphold an even distribution across every fold. This will allow us to train our model numerous times with a variation of training and test sets. We think Stratified 10-fold is what we want because we don't want to do shuffle splits before each iteration as Shuffle splits/StratifiedShuffleSplit might do which might result in repeated training instances in some of the iterations, as we do not have a relatively huge sample to not care about this happening.

In [None]:
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X_train, y_train)
print(skf)

## Modeling (5 points total)

[2 points] Create at least three combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data.

In [None]:
"""Investigating the crossed columns"""
cols_list = cross_columns

# 1. create crossed labels by string join operation
X_crossed_train = X_train[cols_list].apply(lambda x: '_'.join(x), axis=1)
X_crossed_test = X_test[cols_list].apply(lambda x: '_'.join(x), axis=1)

# combine together for training
all_vals = np.hstack((X_crossed_train.to_numpy(),  X_crossed_test.to_numpy()))
print(np.unique(all_vals))
    
# 2. encode as integers, stacking all possibilities
enc = LabelEncoder()
enc.fit(all_vals)

encoded_vals_train = enc.transform(X_crossed_train)
encoded_vals_test  = enc.transform(X_crossed_test)

print(np.min(encoded_vals_train), np.max(encoded_vals_train))

In [None]:
# now let's create some different crossed values
cross_columns = [['Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour'],
                 ['Credit_Mix','Payment_of_Min_Amount'],
                 ['Credit_Mix','Payment_Behaviour'],
                 ['Payment_of_Min_Amount','Payment_Behaviour']
                ]


# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
    # encode as ints for the embedding
    enc = LabelEncoder()
    
    # 1. create crossed labels by join operation
    X_crossed_train = X_train[cols_list].apply(lambda x: '_'.join(x), axis=1)
    X_crossed_test = X_test[cols_list].apply(lambda x: '_'.join(x), axis=1)
    
    # get a nice name for this new crossed column
    cross_col_name = '_'.join(cols_list)
    
    # 2. encode as integers, stacking all possibilities
    enc.fit(np.hstack((X_crossed_train.to_numpy(),  X_crossed_test.to_numpy())))
    
    # 3. Save into dataframe with new name
    X_train[cross_col_name] = enc.transform(X_crossed_train)
    X_test[cross_col_name] = enc.transform(X_crossed_test)
    
    # Save the encoder used here for later:
    encoders[cross_col_name] = enc
    
    # keep track of the new names of the crossed columns
    cross_col_df_names.append(cross_col_name) 
    
cross_col_df_names

In [None]:
# Train a model only using crossed values
# get crossed columns
X_train_crossed = X_train[cross_col_df_names].to_numpy()
X_test_crossed = X_test[cross_col_df_names].to_numpy()

crossed_outputs = [] # this is where we will keep track of output of each branch

input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='categorical')
for idx,col in enumerate(cross_col_df_names):
    
    # track what the maximum integer value will be for this variable
    # which is the same as the number of categories
    N = max(X_train[col].max(),df_test[col].max())+1
    N = len(encoders[col].classes_)
    N_reduced = int(np.sqrt(N))
    
    # this line of code does this: input_branch[:,idx]
    x = tf.gather(input_crossed, idx, axis=1)
    
    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N, 
                  output_dim=N_reduced, 
                  input_length=1, name=col+'_embed')(x)
    
    # save these outputs to concatenate later
    crossed_outputs.append(x)
    

# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='concat_1')
wide_branch = Dense(units=1,activation='sigmoid', name='combined')(wide_branch)

model = Model(inputs=input_crossed, outputs=wide_branch)

model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])

model.fit(X_train_crossed,
        y_train, epochs=10, batch_size=32, verbose=1)

[2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to select the number of layers that performs superiorly. 

[1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). 

## Exceptional Work (1 points total)

5000 students: You have free rein to provide additional analyses.

One idea (required for 7000 level students): For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical methods to compare the performance of different models.