<center> <h1><font size=7> Case Study D</font> </h1> </center>

# Exploring Covid US Government Loans - Example Answer

This notebook contains a minimum example to complete the tasks in Case Study D. There are other approaches that may work - this is just one example approach.

## Exploratory tasks


#### 1. Load the data

In [None]:
# Load libraries
import pandas as pd
import numpy as np
np.random.seed(1)

In [None]:
loans = pd.read_csv("../../data/NMLoans.csv")
loans.head()

#### 2. Explore the data to understand the different columns

In [None]:
loans.dtypes

In [None]:
loans.describe(include="all")

In [None]:
loans.isna().sum()

In [None]:
loans["LoanAmount"].plot.hist();

#### 3. Create a X_num using "LoanAmount", "JobsReported" and "DaysApprovedSinceMay1st", standard scale these features

In [None]:
# Load scaler
from sklearn.preprocessing import StandardScaler

In [None]:
numeric = loans[["LoanAmount", "JobsReported", "DaysApprovedSinceMay1st"]]
numeric

In [None]:
# This will be useful alter for interpreting arrays
numeric_names = numeric.columns.to_list()

In [None]:
X_num = numeric.to_numpy()
X_num

#### 4. Perform PCA on X_num with only 1 component. Which feature contributes to this component the most?

In [None]:
from sklearn.decomposition import PCA

In [None]:
num_PCA = PCA(n_components=1).fit(X_num)

num_PCA.components_

The above result shows that the first feature, "LoansAmount" contributes in degrees of mangnitude more than the other features to the first principle component.

#### 5. Explore the distributions of, then one-hot encode "BusinessType", "RaceEthnicity", "Gender", "Veteran", "NonProfit" into X_cat.

In [None]:
import matplotlib.pyplot as plt

In [None]:
categorical = loans[["BusinessType", "RaceEthnicity", "Gender", "Veteran", "NonProfit"]]

In [None]:
columns = categorical.columns.to_list()

for index, column in enumerate(columns):
    plt.figure()
    plt.bar(x=categorical[column].value_counts().index, height=categorical[column].value_counts())
    plt.xticks(rotation=90)

Significant parts of the data are unanswered, we will treat this as a category within itself rather than missing and needing imputing for now. We do have some responses for these categories so will be able to infer some things, but not necessarily rigorously.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# We are not yet going to drop categories to handle multicolinearity
ohencoder = OneHotEncoder(handle_unknown="ignore")

# The output is originally sparse, it can be easier to view / handle dense data
X_cat = ohencoder.fit_transform(categorical).todense()

In [None]:
X_cat

In [None]:
# Store the order of the features in the array
category_names = ohencoder.get_feature_names().tolist()

category_names

#### 6. Perform PCA on X_cat, explore the resulting components, which are the most important 5 features in the first component?. How many components would you choose to explain 70% of the variance.

In [None]:
# Perform PCA on the categorical features
cat_PCA = PCA().fit(X_cat)

Exploring the first component

In [None]:
# Exploring what is important in the first component
first_component = cat_PCA.components_[0].tolist()

# We don't care if the components impact is positive or negative, just it's magnitude
first_component_absolute = np.absolute(first_component)

In [None]:
# Get the component's corresponding feature names
names_important = list(zip(category_names, first_component_absolute))

In [None]:
names_important_sorted = sorted(names_important, key=lambda x: x[1], reverse=True)

names_important_sorted

In [None]:
# Remind ourselves of the feature groupings
column_indexes = list(enumerate(columns))

# get the prefixes
prefixes = [("x"+str(column_index[0]), column_index[1]) for column_index in column_indexes]

prefixes

In [None]:
# Look at the first 5 most important to this component
names_important_sorted[:5]

From this we can see that the majority class, Unanswered, for a range of classes is impactful on the first component. Interestingly Male Owned response for the Gender is a significant projection within this component too.

Determining the right number of components.

In [None]:
evr_cumsum = cat_PCA.explained_variance_ratio_.cumsum()

component_numbers = list(range(1, cat_PCA.n_components_ + 1))

In [None]:
plt.plot(component_numbers, evr_cumsum)
plt.ylim(0, 1.2)
plt.title("Cumulative sum of variance explained ratios across components")
plt.axhline(y=0.7, color='r', linestyle='--');

In [None]:
# Find the first index where the cumsum is > 0.7
# getting the value out of the 2D array requires multiple indexing
first_index = np.argwhere(evr_cumsum > 0.7)[0][0]

In [None]:
# Use the component numbers to find which component this index corresponds to
corresponding_component = component_numbers[first_index]

corresponding_component

The first 5 components explain the variance of 70% of the features. There are 29 features, showing we can compress out data quite well with few components as shown by the above figure.

#### 7. Combine `X_num` with `X_cat` to make `X`. 

In [None]:
# The two data sets need to be combined rowwise, 
# this means the number of records should stay the same, 
# but have more columns

X = np.concatenate((X_num, X_cat), axis=1)

In [None]:
print("X_num shape:\t", X_num.shape)
print("X_cat shape:\t", X_cat.shape)
print("X shape:\t", X.shape)

In [None]:
# Combine our lists of names to interpret the resulting X array
feature_names = numeric_names + category_names
feature_names

#### 8. Remove the column for LoanAmount from X. Use TSNE to reduce the dimensions of X to two. Take a sample of 500 records in X if this methods takes a prohibitive amount of time. Plot the data using the "LoansAmount" as colour. Can you see a trend based on this projection?

In [None]:
# Remove first column for array and supplementary names

y = X[:,0]

X = X[:,1:]

feature_names = feature_names[1:]

In [None]:
# get sample to avoid excessive computation
tsne_mask = np.random.randint(X.shape[0], size=500)

X_sample = X[tsne_mask, :]

y_sample = y[tsne_mask, :]

In [None]:
from sklearn.manifold import TSNE

# Produce the learned reduced dimension data
X_red = TSNE(n_components=2, n_jobs=-1).fit_transform(X_sample)

In [None]:
plt.scatter(X_red[:,0], X_red[:,1], c=y_sample.tolist(), alpha=0.5)
plt.colorbar();

The resulting visualisation will be different dependent on our random state due to how TSNE works with probabilities. From my running of this, there is not a clear / consistent trend between the resulting projections and the LoanAmount. However, there is some clustering of the data that occurs which may tell us that there are groupings of loans in some way based on the features given.

Reminder: we don't want to interpret results produced from methods we cannot ourselves explain. t-SNE is more of a black box than some other methods for dimension reduction, but it can tell us about the similarity of some data.

## Modelling Task

We are going to encode and process our data from scratch to ensure reproducibility, using a column encoder.

In [None]:
# Lets make a train/test split within pandas to avoid leakage before we fit anything
np.random.seed(1)

# 0.8 for 80% split
# Produce array of True/False values
# With approx 80% True
mask = np.random.rand(len(loans)) < 0.8

# True values to train
train = loans[mask]

# False values to test
test = loans[~mask]

In [None]:
print("Train:", train.shape)
print("Test:", test.shape)

In [None]:
# Separate out the targer
X_train, y_train = train.drop(columns="LoanAmount"), train[["LoanAmount"]]

X_test, y_test = test.drop(columns="LoanAmount"), test[["LoanAmount"]]

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Transformer for all columns and processing
# Remove unspecified columns with dropping
column_trans = ColumnTransformer(
    [('numeric', StandardScaler(), ["JobsReported", "DaysApprovedSinceMay1st"]), # numerical data scaled
    ('categorical', OneHotEncoder(handle_unknown="ignore", sparse=False), 
                 ["BusinessType", "RaceEthnicity", "Gender", "Veteran", "NonProfit"])], # categorical data onehot encoded
    remainder="drop")


In [None]:
# Perform Transformation
X_train = column_trans.fit_transform(X_train)

# Transform but not fit on test set
# We do not want the test data to be learned,
# Therefore we should not fit the data
X_test = column_trans.transform(X_test)

In [None]:
# Perform PCA on X_train
# by not specifying n_components we get all resulting components
pca = PCA().fit(X_train)

# Apply learned transformation to training data
X_train_red = pca.transform(X_train)

# Use the same learned transformation (projection) on test data
X_test_red = pca.transform(X_test)

### 1. Finding minimum components to achieve rmse<30000

In [None]:
# Loop through number of components and evaluate model
# we need as many components are there are features
components = list(range(1, X_train.shape[1] + 1))

# store resulting values
rmses = []

for n in components:
    
    # Train model on the first n components 
    # Remember components are ordered by how important they are
    lr = LinearRegression().fit(X_train_red[:,:n], y_train)
    
    y_pred = lr.predict(X_test_red[:,:n])
    
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    
    rmses.append(rmse)

In [None]:
plt.plot(components, rmses)
plt.axhline(y=30000, color='r', linestyle='--')
plt.title("RMSE values across principal components used to train linear model");

In [None]:
# Find first below 30,000
component_scores = zip(components, rmses)

threshold_component, threshold_score = None, None

# Assumes scores are ordered in descending order
for component, score in component_scores:
    if score < 30000:
        threshold_component, threshold_score = component, score
        break
        
print("Minimum number of components needed:", threshold_component)

### Exploring Gender breakdown

In [None]:
# We can join the predictions onto the original data frame and perform aggregate scoring
test_breakdowns = test.copy() # removes inplace warning

# Train on previously found component number
lr = LinearRegression().fit(X_train_red[:,:6], y_train)

# generate predictions for the test set
y_pred = lr.predict(X_test_red[:,:6])

test_breakdowns["prediction"] = y_pred

In [None]:
# You could do this via loops, but vectorized can be more efficient
unbalanced_breakdown = (test_breakdowns
                        .groupby("Gender")
                        .apply(lambda x : pd.Series({"count": len(x), 
                                                     "rmse": mean_squared_error(x["LoanAmount"], 
                                                                                x["prediction"], 
                                                                                squared=False)})))
unbalanced_breakdown

Remember, we want low rmse value for good prediction.

This tells us that our model performs worse for Female Owned buisnesses than for Male Owned. Be sure to look at the count of our breakdowns. Without them we can easily misinterpret aggregate data.

The result could be a result of multiple effects, including but not limited to:

* Unbalanced training data, more examples of Unanswered and Male Owned records will skew model weights to better predict those categories
* Missing data effects, as a large portion of our data is missing "Unanswered" there may be covariances with Gender which make it harder to predict loans of one gender than another.
* Statistical variance, the result may be random, performing hypothesis testing could help us determine whether this is likely. We can further explore the distributions of the errors produced. Considering we have taken just one sample - a train-test split the specifics of the data we picked may be harder to predict across Gender values
* Bias, the features given to the model may better predict based on Gender of owner.

Note: this is not an evaluation of the difference in loans received across genders, but rather an evaluation of how our model performs across this split.

## Extension

This extension is left for an exercise for the reader.

<br>
<br>
<br>
<br>
<br>
<br>

Some techniques you may wish to consider to improve the performance of the model, potentially reducing the number of components required:

#### Feature Engineering

* Look at combinations of features
* Explore polynomial / higher order relationships
* Combine One-hot encodings, reducing the number of categories
* Explore different standardising methods
* Transform data, such as converting to a log scale
* Explore feature importance
* Explore whether an order can be found for some of the categorical data

#### Enhancing Data

* Rebalance the data used to train the model across different features
* Impute "Unanswered" responses
* Remove outliers if they exist from the training set
* Remove low value columns, reducing noise in components