### Intro to Statistical Learning: 
6.1 and 6.2

### Ethical Algorithm:
Chapter 2

## Dataset:
Dataset on Kaggle: https://www.kaggle.com/datasets/saravananselvamohan/freddie-mac-singlefamily-loanlevel-dataset

Freddie Mac Single Family Loan-Level Dataset: https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset

Description of data fields: https://www.freddiemac.com/fmac-resources/research/pdf/file_layout.xlsx

The Federal Home Loan Mortgage Corporation, commonly known as Freddie Mac, is a publicly traded, government-sponsored enterprise, headquartered in Tysons Corner, Virginia. The FHLMC was created in 1970 to expand the secondary market for mortgages in the US: https://en.wikipedia.org/wiki/Freddie_Mac

The Federal National Mortgage Association, commonly known as Fannie Mae, is a United States government-sponsored enterprise and, since 1968, a publicly traded company: https://en.wikipedia.org/wiki/Fannie_Mae

The primary difference between Freddie Mac and Fannie Mae is where they source their mortgages from. Fannie Mae buys mortgages from larger, commercial banks, while Freddie Mac buys them from much smaller banks.

Dataset: Home Mortgage Disclosure Act, National Loan Applications for 2020 https://ffiec.cfpb.gov/data-publication/dynamic-national-loan-level-dataset/2020

Data field definitions/values: https://ffiec.cfpb.gov/documentation/2020/lar-data-fields/



### In this tutorial we will use the dataset with the following attributes as an example: 31 Attributes In the dataset. We are going to use the following:

Credit Score
First Time Homebuyer flag
Original Debt to Income Ratio
Number of Borrowers
Original Interest Rate - this is what we want to predict
This is a case of prediction because we want to predict a buyer’s mortgage interest rate (outcome) based on application information (predictors)



# Essence of Data:

## Why might you want to use model selection:
It can be easy to over fit a model if not using model selection. Using model Selection can help you create a model that best balances the bias variance trade off.

## How to compare different models?
There are many different model selection criterion that can be used to compare models. AIC, BIC, Mallow's CP, adjusted $R^2$ and other selection criterion that penalize more complex models used to be the best way to compare models. However, with increase of computing process, validation and cross validation have become a more accurate and still accessible way of comparing models. With new computational power, we can now calculate MSE of a test set as well as use cross validation and MSE to work with parameter tuning in certain models.


## Forward vs backwards selection
### Forward selection:
Begins with a model that contains no variables (called the Null Model)
Begin adding the most significant variables one after the other
Variable significance can be calculated using a statistical significance test such as an f-test or t-test. The most significant variable will have the lowest p-value.
Continue to add significant variables until a pre-specified stopping rule is reached or until all the variables under consideration are included in the model
The pre-specified stopping rule could include a specific test of model prediction accuracy or a certain error level (such as MSE or MAE)


### Backwards selection:
Begins with a model that contains all variables under consideration (called the Full Model)
Then starts removing the least significant variables one after the other.
Variable significance can be calculated using a statistical significance test such as an f-test or t-test. The least significant variable will have the highest p-value.
Until a pre-specified stopping rule is reached or until no variable is left in the model.



# Lasso and Ridge Regression:

## Ridge Regression:
Ridge regression is similar to linear regression however it adds a penalty term ($\lambda*slope^2$). This penalty introduces a small amount of bias on how the regression line fits the data and in turn lessens the variance. Ridge regression minimizes $SSR + \lambda *slope^2$. The severity of the penalty is determined by the non-negative value of $lambda$.You can use cross validation to select the best value of $lambda$ for your model. The larger we make $lambda$, the slope will get closer and closer to zero, making the prediction less and less sensitive to our predictor variables. 

Ridge regression can also be applied with categorical predictors and/or with logistic regression. The same penalty term and general concepts apply with these other forms of ridge regression.

With least squares, you need at least p data points to estimate a model with p parameters, however, one benefit of ridge regression is that the penalty term allows us to find a solution with cross validation when the sample size is less than the penalty.

## Lasso Regression:
LASSO stands for Least Absolute Shrinkage and Selection Operator. This regression analysis method is useful for prediction accuracy and interpretability of statistical models. The model uses shrinkage which is where data values are shrunk towards a central point as the mean, thus encouraging simple and sparse models.

Lasso regression is very similar to ridge regression, but instead of being penalized by $\lambda * slope^2$, the model is penalized by $|\lambda|$. Lasso regression also increases bias in hopes of decreasing variance. Ridge and Lasso regression might shrink different parameters by different amounts. Ridge regression can only shrink the slope of the model close to zero, while Lasso can shrink the parameters all the way to zero. This means that Lasso regression can remove useless parameters.

Ridge regression tends to do a little better when all of the variables are useful, while Lasso does well with models that have some useless variables.



# Now, let's run some code:

In [46]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import statsmodels.api as sm


In [47]:
# Import the data
df = pd.read_csv('loan_level_500k.csv')

In [48]:
# Choose the features we need
cols = ['ORIGINAL_INTEREST_RATE', 'CREDIT_SCORE', 'FIRST_TIME_HOMEBUYER_FLAG',
        'ORIGINAL_DEBT_TO_INCOME_RATIO', 'NUMBER_OF_BORROWERS', 'POSTAL_CODE', 'LOAN_PURPOSE']
df_selected = df[cols]
# drop null values
df_selected = df_selected.dropna()

# Convert categorical variables to dummy variables
df_selected = pd.get_dummies(df_selected, drop_first=True)



In [60]:
# Create a training and test set with an 70/30 split.
X = df_selected.drop('ORIGINAL_INTEREST_RATE', axis=1)
y = df_selected['ORIGINAL_INTEREST_RATE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)
# Convert bool to int
X_train = X_train.astype(float)


In [61]:
# Linear Regression with All Predictors
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
mse_linear = mean_squared_error(y_test, y_pred)
print(f'Linear Regression MSE: {mse_linear}')

Linear Regression MSE: 0.33249260231929145


In [62]:
# Forward Selection (using Statsmodels for model selection)
X_train_const = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_const)
results = model.fit()
print(results.summary())

                              OLS Regression Results                              
Dep. Variable:     ORIGINAL_INTEREST_RATE   R-squared:                       0.059
Model:                                OLS   Adj. R-squared:                  0.059
Method:                     Least Squares   F-statistic:                     2261.
Date:                    Thu, 03 Oct 2024   Prob (F-statistic):               0.00
Time:                            16:51:01   Log-Likelihood:            -2.2075e+05
No. Observations:                  253292   AIC:                         4.415e+05
Df Residuals:                      253284   BIC:                         4.416e+05
Df Model:                               7                                         
Covariance Type:                nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------

In [63]:
# Ridge Regression:
# Ridge regression adds a penalty term to control overfitting and reduce variance.
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, ridge_pred)
print(f'Ridge Regression MSE: {mse_ridge}')


Ridge Regression MSE: 0.3324925884734913


In [64]:
# Cross-validation to find the best λ value for Ridge Regression
alphas = np.logspace(-4, 4, 100)
ridge_cv = GridSearchCV(Ridge(), param_grid={'alpha': alphas}, scoring='neg_mean_squared_error', cv=5)
ridge_cv.fit(X_train, y_train)
best_alpha_ridge = ridge_cv.best_params_['alpha']
print(f'Best alpha for Ridge Regression: {best_alpha_ridge}')

Best alpha for Ridge Regression: 25.950242113997373


In [65]:
# Refit with the best alpha
ridge_best = Ridge(alpha=best_alpha_ridge)
ridge_best.fit(X_train, y_train)
ridge_best_pred = ridge_best.predict(X_test)
mse_ridge_best = mean_squared_error(y_test, ridge_best_pred)
print(f'Best Ridge Regression MSE: {mse_ridge_best}')


Best Ridge Regression MSE: 0.3324922462843652


In [66]:
# Lasso Regression:
# Lasso regression adds a penalty term to shrink unnecessary coefficients to zero, thus performing feature selection.
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, lasso_pred)
print(f'Lasso Regression MSE: {mse_lasso}')


Lasso Regression MSE: 0.34460788801877


In [67]:
# Cross-validation to find the best λ value for Lasso Regression
lasso_cv = GridSearchCV(Lasso(), param_grid={'alpha': alphas}, scoring='neg_mean_squared_error', cv=5)
lasso_cv.fit(X_train, y_train)
best_alpha_lasso = lasso_cv.best_params_['alpha']
print(f'Best alpha for Lasso Regression: {best_alpha_lasso}')


Best alpha for Lasso Regression: 0.0001


In [68]:
# Refit with the best alpha for Lasso
lasso_best = Lasso(alpha=best_alpha_lasso)
lasso_best.fit(X_train, y_train)
lasso_best_pred = lasso_best.predict(X_test)
mse_lasso_best = mean_squared_error(y_test, lasso_best_pred)
print(f'Best Lasso Regression MSE: {mse_lasso_best}')


Best Lasso Regression MSE: 0.33249533441281265


In [69]:
# Comparing the results of different models
print(f'Linear Regression MSE: {mse_linear}')
print(f'Best Ridge Regression MSE: {mse_ridge_best}')
print(f'Best Lasso Regression MSE: {mse_lasso_best}')


Linear Regression MSE: 0.33249260231929145
Best Ridge Regression MSE: 0.3324922462843652
Best Lasso Regression MSE: 0.33249533441281265


The MSE of our models are all rather similar, however you can see that Ridge regression out performs Lasso which out performs forward selection.



# Ethics
## What are protected characteristics?
https://www.youtube.com/watch?v=VXLtKlmtrvM&t=170s 

## Are they really protected:
Maybe? Many people are trying to protect them. It can hard to prove discrimination. Humans are biased and therefore the models and programs they make and the data used to train models are biased.



### Examples in Data Science
1. In October 2019, researchers found that an algorithm used on more than 200 million people in US hospitals to predict which patients would likely need extra medical care heavily favored white patients over black patients. While race itself wasn’t a variable used in this algorithm, another variable highly correlated to race was, which was healthcare cost history. The rationale was that cost summarizes how many healthcare needs a particular person has. For various reasons, black patients incurred lower health-care costs than white patients with the same conditions on average. Thankfully, researchers worked with Optum to reduce the level of bias by 80%. But had they not interrogated in the first place, AI bias would have continued to discriminate severely.

2. Arguably the most notable example of AI bias is the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm used in US court systems to predict the likelihood that a defendant would become a recidivist. Due to the data that was used, the model that was chosen, and the process of creating the algorithm overall, the model predicted twice as many false positives for recidivism for black offenders (45%) than white offenders (23%).

3. Amazon’s one of the largest tech giants in the world. And so, it’s no surprise that they’re heavy users of machine learning and artificial intelligence. In 2015, Amazon realized that their algorithm used for hiring employees was found to be biased against women. The reason for that was because the algorithm was based on the number of resumes submitted over the past ten years, and since most of the applicants were men, it was trained to favor men over women.


### Ways Forward:
1. Try to have models reflect how decisions should be made, not how they have been made historically. We don't want to perpetuate past biases.

2. Check prediction accuracy (false negatives, false positive rate, etc.) across social groups to catch models that might be discriminatory.

3. Consider ways of enforcing data governance and fairness. 

Source/further reading: https://towardsdatascience.com/real-life-examples-of-discriminating-artificial-intelligence-cae395a90070 


### Another Example
related article: https://arxiv.org/pdf/2110.14755v1.pdf

Discusses Machine Learning identifying race as a short cut when trying to gleam medical information off of a chest x-ray. This contrasts with human identifiers who are unable to (or at least not trained to) look for racial identifiers when looking at chest x-rays. This is an example of how algorithms, specifically black box algorithms, could potentially be using protected characteristics in decisions even when these protected characteristics are not directly identified in the data set. 

CREDIT_SCORE                     float64
ORIGINAL_DEBT_TO_INCOME_RATIO    float64
NUMBER_OF_BORROWERS              float64
POSTAL_CODE                      float64
FIRST_TIME_HOMEBUYER_FLAG_Y         bool
LOAN_PURPOSE_N                      bool
LOAN_PURPOSE_P                      bool
dtype: object
