<img src="../../predictioNN_Logo_JPG(72).jpg" width=200>

---

## Introduction to Fair Lending

### Programming for Data Science
### Last Updated: November 11, 2022
---  

### PREREQUISITES
- variables
- data types
- pandas

### SOURCES 
- Fair Lending [introduction](https://www.occ.treas.gov/topics/consumers-and-communities/consumer-protection/fair-lending/index-fair-lending.html)
- statsmodels [module](https://www.statsmodels.org/stable/index.html)

### OBJECTIVES
- Understand the definition of fair lending
- Identify examples of protected classes and disparate treatment
- Explore how to quantitatively check for fair lending
- Illustrate how to fit a logistic regression model with `statsmodels`
- Understand how to check if a predictor is statistically significant using software

### CONCEPTS
- fair lending
- default
- discrimination
- disparate treatment
- protected classes
- statistical significance

### GRADING
This will be graded as a homework exercise worth 10 points.  

---


### I. Executive Summary

This notebook introduces the concept of fair lending. It discusses how machine learning models are used for systematic decisioning of loans, and how models can be used to assess if the process is fair. You will complete the reading, and then work through an exercise that includes a synthetic dataset of lending information. You will fit (train) a logistic regression model to quantitatively assess fairness in the lending process.

---

### II. Introduction to Fair Lending

The ability to borrow in advance of payment, or **credit**, is essential.

As of September 2022, consumer debt in the United States was over $16 trillion (source: bankrate.com).


Commercial banks are one large player in this space, providing loans and lending products such as:

- mortgages
- home equity loans
- student loans
- auto loans
- credit cards

Lenders want to avoid the possibility of nonpayment, or **default**, on part of the loan or the full loan.  
They use historical information and judgement, among other things, to make lending decisions.  
Unfortunately, some lenders include bias - knowingly and sometimes unknowingly - in their decision.

Lending practices are not always fair. They may **discriminate** against subpopulations.

Federal bank regulatory agencies such as the OCC have developed approaches to address unfair and deceptive lending practices. 

The Fair Housing Act (FHA) and the Equal Credit Opportunity Act (ECOA) protect consumers by prohibiting unfair and discriminatory practices.

Discrimination in credit and real estate transactions is prohibited based on factors including:

- race or color
- national origin
- religion
- sex
- age (provided the applicant is old enough to enter into a contract)

A **protected class** refers to groups of individuals protected by anti-discrimination laws, such as: 

- women
- individuals over the age of 65
- Black, Hispanic, and Native American individuals

**Disparate Treatment**

Illegal disparate treatment occurs when a lender bases its decision on one or more discriminatory factors covered by fair lending laws. 

Example: A bank uses a lending process where females are offered an auto loan with a higher interest rate than males. This practice would be a violation of fair lending laws.


***

**THINK ABOUT AND DISCUSS**

1) What is discrimination?

2) What are possible reasons for discrimination in lending?

---

### III. Model-Based Lending

In modern finance, banks and other lenders typically use machine learning (ML) models to make lending decisions.

The decisions include:

- accept or deny the loan request (underwriting)
- given acceptance of the loan, which interest rate to offer (pricing)

The underwriting decision model is commonly a logistic regression model.  
The pricing decision model is commonly a linear regression model.

The model predictors should capture the ability and willingness of borrowers to repay the loan.  
The predictors should not use protected class information, as this can discriminate, promoting unfair lending.


***

**THINK ABOUT AND DISCUSS**

3) Why do you think ML models are used in lending?

4. Which variables might be good predictors? Which variables would promote discrimination?

---

### IV. Discussion of the Underwriting Models

There will be two models discussed in this exercise. 

- the underwriting model $M_U$ used for making automated underwriting decisions
- the fair lending model $M_{FL}$ used to assess if the underwriting process was systematically fair

**THINK ABOUT AND DISCUSS**

5) How might we quantify fair lending? Hint: think about the output from the model.


---

### V. A Brief Introduction to Statistical Significance

We will want to understand which predictors in a regression model are important, or **statistically significant**.

This is done with a statistical test. We give an overview next. The details are beyond the scope of this course, but if you would like to dig deeper, you can search the web for "p-values of regression coefficients."

Suppose the logistic regression model has this form:

$$Prob(Denied) = \frac{e^{0.3 - 0.006 * FICO}}{1 + e^{0.3 - 0.006 * FICO}}$$

where $Prob(Denied)$ is the probability that the loan request will be denied.

The coefficient of the FICO score is -0.006. This is fairly close to zero, but to understand if it's statistically different from zero (meaning FICO is important), we would need to:

- estimate its **standard error**, which quantifies the variability which arises because 1) the estimate was based on a sample and not the entire population and 2) the predictors and response are random variables with dispersed values
- run a formal statistical test

We used `sklearn` to fit logistic regression models, but unfortunately this package doesn't compute standard errors or run the statistical test. Next, we look at another module called `statsmodels` which will do this for us.

**THINK ABOUT AND DISCUSS**

6) Given the model, if FICO increases, what happens to $Prob(Denied)$?


---

### VI. Introduction to statsmodels 

The Python module `statsmodels` allows users to build various statistical models including regression models.  
Importantly, it computes standard errors and p-values.

The response data will be synthetic (made up) so that we can fit and examine a logistic regression example.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

Load Country Happiness dataset.

In [2]:
FULLPATH = '../datasets/Country_Happiness.csv'
df = pd.read_csv(FULLPATH)
df.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


Prepare the data

In [3]:
# build the design matrix X
X = df[['Explained by: Social support','Explained by: Healthy life expectancy']]

# append column of 1s to design matrix for intercept term
X = sm.add_constant(X) 
print('X first 5 rows:\n', X.values[:5])

# build the response vector y with some 1s and 0s
y = df["response"] = np.concatenate((np.repeat(1,100), np.repeat(0,56)))

print('\n')
print('y first 5 values:\n', y[:5])

X first 5 rows:
 [[1.    1.592 0.874]
 [1.    1.582 0.861]
 [1.    1.59  0.868]
 [1.    1.644 0.914]
 [1.    1.549 0.927]]


y first 5 values:
 [1 1 1 1 1]


Fit the logistic regression model and show the results.

In [4]:
# Logit is a nickname for logistic regression
results = sm.Logit(y, X).fit()

print(results.summary())

Optimization terminated successfully.
         Current function value: 0.293347
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  156
Model:                          Logit   Df Residuals:                      153
Method:                           MLE   Df Model:                            2
Date:                Fri, 11 Nov 2022   Pseudo R-squ.:                  0.5506
Time:                        06:04:34   Log-Likelihood:                -45.762
converged:                       True   LL-Null:                       -101.84
Covariance Type:            nonrobust   LLR p-value:                 4.419e-25
                                            coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------
const                                    -9.5796      

**Understanding Significance**

The bottom table summarizes the regression coefficients.  The column $P>|z|$ shows p-values.  
Values less than 0.05 (a common choice) can be interpreted to mean that the predictor is significant.

In this case, all predictors are significant:
- const (or intercept) 
- Social support
- Healthy life expectancy

You now have the tools to fit a logistic regression model and interpret significance by yourself!

**TRY FOR YOURSELF**

7) From the summary table, the Pseudo R-squared gives an estimate of how well the model fits (0=no fit, 1=perfect fit).  
Do these things:

- note the model fit
- you will now make a small change to fit a different model. possibilities include: don't include an intercept, add another predictor.
- note the model fit for this new model
- do the results make sense?


---

### VII. Quantifying Fair Lending using a Logistic Regression Model

Next, you will load in some synthetic mortgage lending data. This is a subset of what a typical file would look like.  

- It includes the demographic variable: `gender`  
- The credit factors are `fico` (for FICO score) and `loan_to_value`.  
- Attributes of the loan are `loan_term` and `rate_type`  
- The outcome variable is `denied`, where 1=denied, 0=approved 

You will explore the dataset and carry out the steps requested.

In [6]:
syn = pd.read_csv('../datasets/mortgage_lending.csv')
syn.head()

Unnamed: 0,id,fico,loan_to_value,gender,occupation,denied,loan_term,rate_type
0,0,696,62,m,salesman,0,30,fixed
1,1,752,61,m,salesman,0,30,fixed
2,2,726,63,m,salesman,0,30,fixed
3,3,759,60,m,salesman,0,30,fixed
4,4,798,62,m,salesman,0,15,fixed


**Exploratory Data Analysis**

8) **(2 PTS)** Compute the correlation between:  
a. `fico` and `loan_to_value`  
b. `fico` and `denied`  
c. `loan_to_value` and `denied`  

Do these results make sense? Explain your answer.

9) **(2 PTS)** Create the following scatterplots (first variable on x-axis, second on y-axis):  
a. `fico` vs `loan_to_value`  
b. `fico` vs `denied`  
c. `loan_to_value` vs `denied`  

Do these results make sense? Explain your answer.

10) **(1 PT)** Examine the relationship between `gender` and `occupation`.  
Do you notice anything that is interesting?  
Do you notice anything that could be problematic (discriminatory)?

11) Conduct any additional analysis that you would like and enter it in this section. This may include statistics, tables, or graphs.

---

The predictors used in the underwriting model $M_{U}$ are as follows:

- `fico`
- `loan_to_value`
- `loan_term`
- `nurse`

We will call this list $L_U$

12) **(1 PT)** From a Fair Lending perspective, which predictors from $L_U$ should be used, and which should not be used? Explain your answer.

13) **(2 PTS)** Train the underwriting model on the synthetic data, using $L_U$ as the predictors. Include an intercept term.  Print the model summary, and discuss which predictors are significant.

14) **(2 PTS)** Next, you will fit a fair lending model $M_{FL}$ to understand if `gender` is a significant predictor.  
The model should include an intercept term, the credit factors, the attributes of the loan, and `gender`. If you found problematic predictors from $L_U$ used in $M_{U}$, those should NOT be included.

If `gender` is a significant predictor in $M_{FL}$, this would indicate that gender is being used to make lending decisions, which is discriminatory practice.

Print the model summary, discuss if `gender` is significant, and discuss your conclusions.

---

### VIII. Summary

In this exercise, you learned about fair lending and protected classes.

We discussed how ML models are used in lending decisions, including underwriting and pricing.

Sometimes, discriminatory behavior can unintentionally creep into models, as in the case with the `occupation` variable.

You trained an underwriting model and a fair lending model on some synthetic data.

The results of the models provided insight into which predictors correlated with the denial decision.

This case was based on a real-world scenario. It illustrates the application of data science to a critically important task in finance.

---