# Building A Fair Mahine Learning Model Using Reduction & Threshold Techniques For Loan Acceptance Prediction

#### Maintaining Accuracy While Providing Fairness in Loan Acceptance For Sensitive Attributes Like Race & Sex

##### By: Aurelio Barrios

## What is Shown In This Notebook

This is the introduction notebook for this project were I clean raw data and prepare it for machine learning and fairlearn models. This notebook will be followed by another notebook that will deploy the machine learning and fairlearn models.

## What is Shown In This Project

- **Domain Of Project**
    - Domain for this project is related to financing and loan approval. The project will use data on loan acceptance to create machine learning models for future decisions, but will incorporate fairlearn models to make sure that the machine learning models deployed maintain fairness across race and sex.
- **Machine Learning Task**
    - Binary classification. Predict wether to accept or decline a loan application. 
- **Metrics**
    - The machine learning models will be evaluated using accuracy.
    - The fairlearn models will be evaluated using the following
        - Demographic parity difference
        - Equalized odds difference
        - Demographic parity ratio

### Imports

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

### Data Handling 

The data that is being used for this project is part of the California Home Mortgage Disclosure Act (HMDA). This dataset contains public loan data for the year of 2017 in the state of California. The points of interest in this dataset are attributes which record if a certain loan was accepted or rejected. This is important in order to analyze how loan acceptance rates vary with sensitive attributes such as race and sex.

There were a large number of missing values in this dataset, therefore columns which had large number of missing values were removed along with rows. This was a fairly large dataset so removing these rows still left a completely workable dataset.

In [2]:
df = pd.read_csv('data/ca_hmda17.csv')

In [3]:
#shows the sum amount of missing values per column
df.isnull().sum().to_frame('sum')

Unnamed: 0,sum
as_of_year,0
respondent_id,0
agency_name,0
agency_abbr,0
agency_code,0
...,...
hud_median_family_income,7461
tract_to_msamd_income,7461
number_of_owner_occupied_units,7461
number_of_1_to_4_family_units,7461


In [4]:
#these columns have most if not all values missing
s_df = df.isnull().sum().to_frame('sum')

drop_cols = list(s_df[s_df['sum'] > 1000000].index)

df = df.drop(drop_cols, axis=1)

In [5]:
#remove rows that have missing values
df = df.dropna()

In [6]:
#removed duplicate columns
repeat_cols = ['as_of_year', 'respondent_id', 'agency_name', 'agency_code',
              'state_name', 'state_abbr', 'state_code', 'county_code']
for i in df.columns:
    if i +'_name' in df.columns:
        repeat_cols.append(i)       
        
df = df.drop(repeat_cols, axis=1)

In [7]:
#saved the clean dataset into a csv so that it was easier to access in the future
# df.to_csv('data/ca_clean.csv', index=False)

In [8]:
#here we can see the number of applicants per race
df['applicant_race_name_1'].value_counts()

White                                                                                903076
Information not provided by applicant in mail, Internet, or telephone application    242517
Asian                                                                                203556
Black or African American                                                             62957
Not applicable                                                                        35239
Native Hawaiian or Other Pacific Islander                                             15258
American Indian or Alaska Native                                                      14193
Name: applicant_race_name_1, dtype: int64

I will be using the [fairlearn package](https://fairlearn.org). This is a package that aims to establish fairness in machine learning algorithms and AI systems. In order to use this package I must clean the data so that we can apply the fairlearn models.

In [9]:
#CLEANING FOR SEX MODEL

drop_cols = []
for i in df.columns:
    if 'race' in i:
        if i != 'applicant_race_name_1':
            drop_cols.append(i)
    elif 'ethnicity' in i:
        drop_cols.append(i)
        
#drop columns that are sensitive and not used for this model
df = df.drop(drop_cols, axis=1)

In [10]:
df = df.drop(['co_applicant_sex_name'], axis=1)

In [11]:
#create integer target for FairLearn which will be needed later
int_target = {'Loan originated': 1, 'Application denied by financial institution': 0}

def target_helper(action):
    if action in int_target:
        return int_target[action]
    return -1

df['target'] = df['action_taken_name'].apply(target_helper)

df = df[df['target'] >= 0].reset_index(drop=True)

df = df.drop(['action_taken_name'], axis=1)

In [12]:
#gather only individuals that have stated their sex in order to perform analysis
df = df[df['applicant_sex_name'].isin(['Male', 'Female'])]

In [13]:
#save the data for future access
# df.to_csv('data/race2_model.csv', index=False)
# df.to_csv('data/sex_model.csv', index=False)