# Home Loan Prediction
This dataset `home_loans_1.csv` is about home loan applications in San Diego county, where each row of the dataset is an individual loan application. This data could be used to build a machine learning model to predict whether to accept or reject a loan application.

**Your goal in this assignment is to understand the data and how biases can emerge in datasets.**


## Part 1: Data Exploration

Upload the .zip file ('data.zip') included in the homework assignment. I **strongly** recommend using the following code rather than the Colab web interface for uploading files, particularly for those with slower internet connections. 

In [None]:
pip install --upgrade google-api-python-client

In [None]:
conda install -c conda-forge google-colab

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['data.zip']),"r")
zf.extractall()

The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. 

> *Optional: Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/*



In [75]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans_1.csv', low_memory=False) # read the csv file into a pandas dataframe object



To understand what kind of data was collected, `pandas` has some handy commands:
- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset



### Question 1.A:  How many rows are in this dataset? How many columns?
_Double click to write your answer question here. Show your work in code below if applicable._

In [76]:
#df.shape returns frame dimensionality

df.shape[0]

#Amount of rows:60122

df.shape[1]

#Amount of columns: 12

12

### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
_Double click to write your answer question here. Show your work in code below if applicable._

In [77]:
pd.set_option("max_rows", 15)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)

display(df)

#The column we are trying to predict is column: loan_approved

Unnamed: 0,town_name,loan_amount_000s,applicant_income_000s,is_hoepa_loan,occupied_by_owner,loan_purpose_name,loan_approved,denial_reason,co_applicant_sex,co_applicant_race,applicant_sex,applicant_race
0,El Cajon,607.322,43.881,1,1,Home purchase,0,Collateral,Male,White,Male,White
1,El Cajon,524.421,44.531,1,1,Home purchase,1,,Male,White,Female,White
2,El Cajon,595.131,57.734,1,1,Home purchase,1,,No co-applicant,No co-applicant,Male,Asian
3,El Cajon,595.332,56.693,1,1,Refinancing,1,,No co-applicant,No co-applicant,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,..."
4,El Cajon,666.252,49.782,1,1,Home improvement,0,Credit history,No co-applicant,No co-applicant,Male,White
...,...,...,...,...,...,...,...,...,...,...,...,...
60117,Del Mar,4925.062,159.012,0,1,Home purchase,0,Credit history,No co-applicant,No co-applicant,Male,Black or African American
60118,Del Mar,3510.737,405.106,1,1,Home purchase,0,Credit history,Male,White,Male,White
60119,Del Mar,4046.828,226.348,1,1,Home purchase,0,Debt-to-income ratio,Female,Asian,Male,White
60120,Del Mar,4984.123,582.114,0,1,Home purchase,1,,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,...",Female,Asian


### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: Try looking up the pandas command to list the unique values in a column.

_Double click to write your answer question here. Show your work in code below if applicable._

In [78]:
df.denial_reason.unique()
df.loan_purpose_name.unique()

# Reasons given for denying a loan application = Collateral,Credit history,Debt-to-income ratio, Credit application incomplete,
# Mortgage insurance denied, Unverifiable information, Insufficient cash (downpayment, closing costs), Other, Employment History.

array(['Home purchase', 'Refinancing', 'Home improvement'], dtype=object)

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click to write your answer question here. Show your work in code below if applicable._
#1.  
#2.  
#3. 

In [79]:
#1. Missed Mortgage Payments: being inconsistent in your payments makes you less credible and responsible in the eyes
# of those who're being asked for loans. 

#2. Number of depdendents: Though it's not a deciding factor like Debt-to-income ratio, the number of dependents does tell
# lenders some ideas on responsibility and possible money challenges from those requesting loans

#3. Bank Statements: If a borrower has shown volatile activity in their recent transactions, this can be a good 
# red flag indicator for lenders in their decision making.

## Part 2: Understanding Bias in Datasets

### Question 2.A: Does the likelihood of loan approval differ by town in this data?

You may find the groupby function useful for answering this question.

_Double click to write your answer question here. Show your work in code below if applicable._

In [101]:
df_towns = df[["town_name","loan_approved"]].copy()

df_towns = df_towns.groupby(by = "town_name").mean()

df_towns = df_towns['loan_approved'].sort_values(ascending=False)

df_towns

# Yes, by grouping a new dataframe by unique town names and calculating the mean of loan approvals we deduced the 
# percentage of likelihood for loan approval for each town. Though by a small margin, we can see that 


town_name
El Cajon         0.737
Escondido        0.735
San Diego        0.734
Del Mar          0.734
Chula Vista      0.732
National City    0.729
Coronado         0.728
Carlsbad         0.726
Solana Beach     0.725
Oceanside        0.724
Poway            0.721
La Mesa          0.720
Name: loan_approved, dtype: float64

### Question 2.B: Does the likelihood of loan approval differ by gender in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.C: Does the likelihood of loan approval differ by race in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.D: Does the likelihood of loan approval differ by age in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.D: Do you have enough information to determine if differential approval rates are an example of bias? Why or why not?

*Double click to write your answer here.*

## Part 3: Helping Others Understand Fairness & Bias

Imagine that you work as a software engineer for a small credit union. Your boss has asked you to build a machine learning system to predict which home loan applications the credit union should approve. 

There are three possible data sets you could you use (included in the assignment materials in data.zip: home_loans_1, _2, and _3.csv). You need to design a visualization that will convince your boss to use the data set that you think is the right choice. 

### Part 3.A: List the four most important attributes of the datasets that you think should be considered to decide which dataset to use.

_Double click to write your answer question here._
#1.  
#2.  
#3. 
#4.

### Part 3.B: Sketch a visualization that your boss (who is not a software engineer) can understand, that will help your boss understand the dataset and the aspects of it that you consider important. 


_Attach a pdf with your sketches. Please include any annotations/description on the pdf itself (not in this notebook)._