# Part A: Grab the Data

We'll start by loading the data into a dataframe:

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
df_train = pd.read_csv(url)

We'll take a peak at how the data looks:

In [2]:
df_train.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,25,43200,RENT,,VENTURE,B,1200,9.91,0,0.03,N,4
1,27,98000,RENT,3.0,EDUCATION,C,11750,13.47,0,0.12,Y,6
2,22,36996,RENT,5.0,EDUCATION,A,10000,7.51,0,0.27,N,4
3,24,26000,RENT,2.0,MEDICAL,C,1325,12.87,1,0.05,N,4
4,29,53004,MORTGAGE,2.0,HOMEIMPROVEMENT,A,15000,9.63,0,0.28,N,10


# Part B: Explore The Data

Create at least two visualizations and one summary table in which you explore patterns in the data. You might consider some questions like:

How does loan intent vary with the age, length of employment, or homeownership status of an individual?
Which segments of prospective borrowers are offered low interest rates? Which segments are offered high interest rates?
Which segments of prospective borrowers have access to large lines of credit?

# Part C: Build a Model

Please use any technique to construct a score function and threshold for predicting whether a prospective borrower is likely to default on a given loan. You may use all the features in the data except loan_grade (and the target variable loan_status), and you may choose any subset of these. There are several valid ways to approach this modeling task:

1. Choose features and estimate entries of a weight vector **w** by hand (this is allowed but not recommended).
2. (Recommended): Choose your features, estimate new ones if needed, and fit a score-based machine learning model to the data. My suggestion is LogisticRegression. Once you have fit a logistic regression model, the weight vector **w** is stored as the attribute ``model.coef_``.


I suggest that you try several combinations of features, possibly including some which you create, and test out which combinations work best with cross-validation.

# Part D: Find a Threshold

Once you have a weight vector **w**, it is time to choose a threshold *t*. To choose a threshold that maximizes profit for the bank, we need to make some assumptions about how the bank makes and loses money on loans. Let’s use the following (simplified) modeling assumptions:

If the loan is repaid in full, the profit for the bank is equal to ``loan_amnt*(1 + 0.25*loan_int_rate)**10 - loan_amnt``. This formula assumes that the profit earned by the bank on a 10-year loan is equal to 25% of the interest rate each year, with the other 75% of the interest going to things like salaries for the people who manage the bank. It is extremely simplistic and does not account for inflation, amortization over time, opportunity costs, etc.
If the borrower defaults on the loan, the “profit” for the bank is equal to ``loan_amnt*(1 + 0.25*loan_int_rate)**3 - 1.7*loan_amnt``. This formula corresponds to the same profit-earning mechanism as above, but assumes that the borrower defaults three years into the loan and that the bank loses 70% of the principal.
These modeling assumptions are extremely simplistic! You may deviate from these assumptions if you have relevant prior knowledge to inform your approach!!

Based on your assumptions, determine the threshold *t* which optimizes profit for the bank on the training set. Explain your approach, including labeled visualizations where appropriate, and include a final estimate of the bank’s expected profit per borrower on the training set.

# Part E: Evaluate Your Model from the Bank’s Perspective

Only after you have finalized your weight **w** vector  and threshold *t*, evaluate your automated decision-process on the test set:

In [3]:
url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/test.csv"
df_test = pd.read_csv(url)

What is the expected profit per borrower on the test set? Is it similar to your profit on the training set?

# Part F: Evaluate Your Model From the Borrower’s Perspective

Now evaluate your model from the (aggregate) perspective of the prospective borrowers. Please quantitatively address the following questions, using the predictions of your model on the test data:

1. Is it more difficult for people in certain age groups to access credit under your proposed system?
2. Is it more difficult for people to get loans in order to pay for medical expenses? How does this compare with the actual rate of default in that group? What about people seeking loans for business ventures or education?
3. How does a person’s income level impact the ease with which they can access credit under your decision system?

# Part G: Write and Reflect

Write a brief introductory paragraph for your blog post describing the overall purpose, methodology, and findings of your study. Then, write a concluding discussion describing what you found and what you learned through from this blog post.

Please include one paragraph discussing the following questions:

Considering that people seeking loans for medical expense have high rates of default, is it fair that it is more difficult for them to obtain access to credit?
You are free to define “fairness” in a way that makes sense to you, but **please write down your definition** as part of your discussion.