### Data Gathering and Import

First we use pandas to read from the dataset into our data frame.

We'll also display the first few rows so we get an idea of what the data looks like.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('financial_data.csv')

In [3]:
df.head(10)

Unnamed: 0,Age,Gender,Education Level,Marital Status,Income,Credit Score,Loan Amount,Loan Purpose,Employment Status,Years at Current Job,Payment History,Debt-to-Income Ratio,Assets Value,Number of Dependents,City,State,Country,Previous Defaults,Marital Status Change,Risk Rating
0,49,Male,PhD,Divorced,72799.0,688.0,45713.0,Business,Unemployed,19,Poor,0.154313,120228.0,0.0,Port Elizabeth,AS,Cyprus,2.0,2,Low
1,57,Female,Bachelor's,Widowed,,690.0,33835.0,Auto,Employed,6,Fair,0.14892,55849.0,0.0,North Catherine,OH,Turkmenistan,3.0,2,Medium
2,21,Non-binary,Master's,Single,55687.0,600.0,36623.0,Home,Employed,8,Fair,0.362398,180700.0,3.0,South Scott,OK,Luxembourg,3.0,2,Medium
3,59,Male,Bachelor's,Single,26508.0,622.0,26541.0,Personal,Unemployed,2,Excellent,0.454964,157319.0,3.0,Robinhaven,PR,Uganda,4.0,2,Medium
4,25,Non-binary,Bachelor's,Widowed,49427.0,766.0,36528.0,Personal,Unemployed,10,Fair,0.143242,287140.0,,New Heather,IL,Namibia,3.0,1,Low
5,30,Non-binary,PhD,Divorced,,717.0,15613.0,Business,Unemployed,5,Fair,0.295984,,4.0,Brianland,TN,Iceland,3.0,1,Medium
6,31,Non-binary,Master's,Widowed,45280.0,672.0,6553.0,Personal,Self-employed,1,Good,0.37889,,,West Lindaview,MD,Bouvet Island (Bouvetoya),0.0,1,Low
7,18,Male,Bachelor's,Widowed,93678.0,,,Business,Unemployed,10,Poor,0.396636,246597.0,1.0,Melissahaven,MA,Honduras,1.0,1,Low
8,32,Non-binary,Bachelor's,Widowed,20205.0,710.0,,Auto,Unemployed,4,Fair,0.335965,227599.0,0.0,North Beverly,DC,Pitcairn Islands,4.0,2,Low
9,55,Male,Bachelor's,Married,32190.0,600.0,29918.0,Personal,Self-employed,5,Excellent,0.484333,130507.0,4.0,Davidstad,VT,Thailand,,2,Low


### Data Cleanup and Normalization

##### Data Cleanup
Many rows contain null values or NaN values. We will:

- Locate the missing values
- Replace missing values with the median for that specific column

In [4]:
#missing values per column
missing_vals = df.isnull().sum()
print(missing_vals)

Age                         0
Gender                      0
Education Level             0
Marital Status              0
Income                   2250
Credit Score             2250
Loan Amount              2250
Loan Purpose                0
Employment Status           0
Years at Current Job        0
Payment History             0
Debt-to-Income Ratio        0
Assets Value             2250
Number of Dependents     2250
City                        0
State                       0
Country                     0
Previous Defaults        2250
Marital Status Change       0
Risk Rating                 0
dtype: int64


In [5]:
# store the columns that have null values
cols_missing_vals = ['Income', 'Credit Score', 'Loan Amount', 'Assets Value', 'Number of Dependents', 'Previous Defaults']

# figure out the medians for those columns
medians = df[cols_missing_vals].median()

# reassign null values within those columns to their respective medians
df[cols_missing_vals] = df[cols_missing_vals].fillna(medians)

In [6]:
# see if what we did worked
missing_vals_per_col = df.isnull().sum()

total_missing_vals = missing_vals_per_col.sum()

print(missing_vals_per_col)
print("\nTotal missing values in the entire dataset: ", total_missing_vals)

Age                      0
Gender                   0
Education Level          0
Marital Status           0
Income                   0
Credit Score             0
Loan Amount              0
Loan Purpose             0
Employment Status        0
Years at Current Job     0
Payment History          0
Debt-to-Income Ratio     0
Assets Value             0
Number of Dependents     0
City                     0
State                    0
Country                  0
Previous Defaults        0
Marital Status Change    0
Risk Rating              0
dtype: int64

Total missing values in the entire dataset:  0


##### Data Normalization

We normalize the columns containing numbers to be in the range $[0, 1]$. This will be helpful if we decisde to use KNN, SVM, boosting or PCA

In [7]:
from sklearn.preprocessing import MinMaxScaler

# make a copy of the dataset to normalize it
df_norm = df.copy()

# extract the columns containing numbers
number_cols = df.select_dtypes(include=['float64', 'int64']).columns

scaler = MinMaxScaler()

df_norm[number_cols] = scaler.fit_transform(df_norm[number_cols])

In [8]:
df_norm.head()

Unnamed: 0,Age,Gender,Education Level,Marital Status,Income,Credit Score,Loan Amount,Loan Purpose,Employment Status,Years at Current Job,Payment History,Debt-to-Income Ratio,Assets Value,Number of Dependents,City,State,Country,Previous Defaults,Marital Status Change,Risk Rating
0,0.607843,Male,PhD,Divorced,0.527982,0.442211,0.904774,Business,Unemployed,1.0,Poor,0.108626,0.357832,0.0,Port Elizabeth,AS,Cyprus,0.5,1.0,Low
1,0.764706,Female,Bachelor's,Widowed,0.49772,0.452261,0.640806,Auto,Employed,0.315789,Fair,0.097838,0.127861,0.0,North Catherine,OH,Turkmenistan,0.75,1.0,Medium
2,0.058824,Non-binary,Master's,Single,0.356849,0.0,0.702765,Home,Employed,0.421053,Fair,0.524824,0.573847,0.75,South Scott,OK,Luxembourg,0.75,1.0,Medium
3,0.803922,Male,Bachelor's,Single,0.065035,0.110553,0.47871,Personal,Unemployed,0.105263,Excellent,0.709969,0.490327,0.75,Robinhaven,PR,Uganda,1.0,1.0,Medium
4,0.137255,Non-binary,Bachelor's,Widowed,0.294244,0.834171,0.700653,Personal,Unemployed,0.526316,Fair,0.086482,0.954066,0.5,New Heather,IL,Namibia,0.75,0.5,Low


In [9]:
# get rid of location information, there are too many categories here
df = df.drop(columns=['City', 'State', 'Country'])
df_norm = df_norm.drop(columns=['City', 'State', 'Country'])

### Data Transformation and Conversion

We must now convert columns containing categories of data into numerical values. 

We will use OneHotEncoder to encode categorical values into binary, then transform the columns to prevent redundancy using ColumnTransformer. 

For example, the Education Level column will split into "Education Level_High School" and "Education Level_Master's" and "Education Level_PhD". Each will be assigned a 0 or 1. If all columns are 0, the Education Level must be 'Bachelor's'.

In [10]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# find the columns that are not numbers (categories)
category_cols = df.select_dtypes(include=['object']).columns

# encode the data, use drop='first' to avoid dummy variables and prevent redundancy
category_encoder = ColumnTransformer(
    transformers = [('Category Column Encoder', OneHotEncoder(drop='first', sparse_output=False), category_cols)], 
    remainder='passthrough'
)

# apply the encode to the normalized and non-normalized data
df_encoded = category_encoder.fit_transform(df)
df_norm_encoded = category_encoder.fit_transform(df_norm)


# create new column names based on categories
new_col_names = (
    category_encoder.named_transformers_['Category Column Encoder'].get_feature_names_out(category_cols).tolist() + number_cols.tolist()
)

# transform the encoded data back into a data frame using the column names
df_transformed = pd.DataFrame(df_encoded, columns=new_col_names)
df_norm_transformed = pd.DataFrame(df_norm_encoded, columns=new_col_names)

In [11]:
# have to reconstruct the 'Risk Rating' column since it was split up in the previous step
risk_rating_cols1 = [col for col in df_transformed.columns if col.startswith('Risk Rating')]
risk_rating_cols2 = [col for col in df_norm_transformed.columns if col.startswith('Risk Rating')]

df_transformed['Risk Rating'] = np.argmax(df_transformed[risk_rating_cols1].values, axis=1)
df_norm_transformed['Risk Rating'] = np.argmax(df_norm_transformed[risk_rating_cols2].values, axis=1)

rating_categories = ['Low', 'Medium', 'High']
df_transformed['Risk Rating'] = df_transformed['Risk Rating'].apply(lambda rating: rating_categories[rating])
df_norm_transformed['Risk Rating'] = df_norm_transformed['Risk Rating'].apply(lambda rating: rating_categories[rating])

df_transformed = df_transformed.drop(columns=risk_rating_cols1)
df_norm_transformed = df_norm_transformed.drop(columns=risk_rating_cols2)

Data frame after transformation and conversion:

In [12]:
df_transformed.head()

Unnamed: 0,Gender_Male,Gender_Non-binary,Education Level_High School,Education Level_Master's,Education Level_PhD,Marital Status_Married,Marital Status_Single,Marital Status_Widowed,Loan Purpose_Business,Loan Purpose_Home,...,Income,Credit Score,Loan Amount,Years at Current Job,Debt-to-Income Ratio,Assets Value,Number of Dependents,Previous Defaults,Marital Status Change,Risk Rating
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,72799.0,688.0,45713.0,19.0,0.154313,120228.0,0.0,2.0,2.0,Low
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,69773.0,690.0,33835.0,6.0,0.14892,55849.0,0.0,3.0,2.0,Medium
2,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,55687.0,600.0,36623.0,8.0,0.362398,180700.0,3.0,3.0,2.0,Medium
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,26508.0,622.0,26541.0,2.0,0.454964,157319.0,3.0,4.0,2.0,Medium
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,49427.0,766.0,36528.0,10.0,0.143242,287140.0,2.0,3.0,1.0,Low


Normalized data frame after transformation and conversion:

In [13]:
df_norm_transformed.head()

Unnamed: 0,Gender_Male,Gender_Non-binary,Education Level_High School,Education Level_Master's,Education Level_PhD,Marital Status_Married,Marital Status_Single,Marital Status_Widowed,Loan Purpose_Business,Loan Purpose_Home,...,Income,Credit Score,Loan Amount,Years at Current Job,Debt-to-Income Ratio,Assets Value,Number of Dependents,Previous Defaults,Marital Status Change,Risk Rating
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.527982,0.442211,0.904774,1.0,0.108626,0.357832,0.0,0.5,1.0,Low
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.49772,0.452261,0.640806,0.315789,0.097838,0.127861,0.0,0.75,1.0,Medium
2,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.356849,0.0,0.702765,0.421053,0.524824,0.573847,0.75,0.75,1.0,Medium
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.065035,0.110553,0.47871,0.105263,0.709969,0.490327,0.75,1.0,1.0,Medium
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.294244,0.834171,0.700653,0.526316,0.086482,0.954066,0.5,0.75,0.5,Low


### Machine Learning Setup

Importing the necessary libraries:

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

Test allocation:

In [15]:
X = df_transformed.drop(columns=['Risk Rating'])
Y = df_transformed['Risk Rating']

Xn = df_norm_transformed.drop(columns=['Risk Rating'])
Yn = df_norm_transformed['Risk Rating']

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=100)

Xn_train, Xn_test, Yn_train, Yn_test = train_test_split(Xn, Yn, test_size=0.3, random_state=100)

##### Machine Learning Algorithms

First defining all the models:

In [17]:
random_forcast = RandomForestClassifier()

gradient_boosting = GradientBoostingClassifier()

naive_bayes = GaussianNB()

KNN = KNeighborsClassifier(n_neighbors=5)

##### Random Forests

In [18]:
random_forcast.fit(X_train, Y_train)

random_forcast_prediction = random_forcast.predict(X_test)

random_forest_accuracy = accuracy_score(Y_test, random_forcast_prediction)

##### Gradient Boosting

In [19]:
gradient_boosting.fit(Xn_train, Yn_train)

gradient_boosting_prediction = gradient_boosting.predict(Xn_test)

gradient_boosting_accuracy = accuracy_score(Yn_test, gradient_boosting_prediction)

##### Naive Bayes

In [20]:
naive_bayes.fit(X_train, Y_train)

naive_bayes_prediction = naive_bayes.predict(X_test)

naive_bayes_accuracy = accuracy_score(Y_test, naive_bayes_prediction)

##### K-Nearest Neighbours

In [21]:
KNN.fit(Xn_train, Yn_train)

KNN_prediction = KNN.predict(Xn_test)

KNN_accuracy = accuracy_score(Yn_test, KNN_prediction)

##### Results

In [22]:
print("Random Forests Accuracy: ", random_forest_accuracy * 100, "%")
print("Gradient Boosting Accuracy: ", gradient_boosting_accuracy * 100, "%")
print("Naive Bayes Accuracy: ", naive_bayes_accuracy * 100, "%")
print("KNN Accuracy: ", KNN_accuracy * 100, "%")

Random Forests Accuracy:  70.33333333333334 %
Gradient Boosting Accuracy:  70.19999999999999 %
Naive Bayes Accuracy:  70.46666666666667 %
KNN Accuracy:  63.977777777777774 %
