# Introduction

Thank you for agreeing to take part in this evaluation. During this evaluation you will be asked to carry out a series of tasks related to the dataset shown in the next section. Please carry out these tasks with the same care and rigor as if these tasks were part of your job duties.

Before beginning please ensure that you are using the prompter kernel by looking at the top right of this notebook. It should say "prompter", rather than "Python 3". If says you are using a python kernel, please click on where it says Python, and select the prompter from the drop down study. If you encounter any errors, or if it says No Kernel! please contact PLACEHOLDER@uchicago.edu


## Task Description

This task is structured into four parts. 

1. **Dataset introduction**
2. **Data cleaning**
3. **Model Training**
4. **Model Selection**

In each of these, there will be some code pre-written. You are not required to use this code, and may replace it if you like. You may refer to any documentation source you like during this task (such as Pandas of scikit-learn API documentation, or StackOverflow).

We ask that you use pandas and scikit-learn to perform the tasks. We have also installed numpy and matplotlib, should those be helpful. You will not be able to install any other non-standard libraries. 

## 1. Dataset Introduction

We will be asking you to use loan_data dataset during this experiment. This data was collected in a major metropolitan city in the United States. It contains information about applications received for loan applications aggregated from several different loan providers.

In [51]:
import pandas as pd

loans = pd.read_csv("test_data.csv")
loans.head()

Unnamed: 0.1,Unnamed: 0,race,gender,date,fed,zip,income,type,interest,term,principal,approved
0,0,black,female,2017-10-01,1.15,60637,21155,personal,10.195762,12,44622,True
1,1,black,male,2018-04-01,1.69,60623,58431,home,,180,519375,False
2,2,black,female,2013-11-01,0.08,60637,28489,home,8.780219,360,265714,True
3,3,black,male,2011-10-01,0.07,60626,122782,home,1.91567,360,917704,True
4,4,white,female,2014-07-01,0.09,60614,324243,home,3.680747,300,1728441,True


Let's look at some of the columns in the dataframe. The column ``approved`` indicates whether or not the loan was approved

In [5]:
sum(loans["approved"])/len(loans["approved"]) # calculating the percent of approved loans

0.511

The column ``principal`` is the amount of money the loan was for, that is how much money the applicant received if the loan was approved.

In [8]:
# 25th and 75th percentile
print(loans["principal"].quantile(0.25), loans["principal"].quantile(0.75))

14943.0 353479.25


``interest`` is the annual percent interest on the loan. ``term`` is how long in months the loan was supposed to run. 

``income`` is the annual income of the loan applicants.

The ``type`` column denotes the purpose of the loan. 

In [9]:
loans["type"].unique()

array(['personal', 'home', 'auto'], dtype=object)

There are three possible values ``type`` can have: ``personal``, ``home`` and ``auto``. 

* ``auto`` These loans were for automobile purchases. With these loans, the lender may repossess the car if the person who took out the loan is unable to make payments.

* ``home`` These loans were for the purchase of residential real estate. Under these loans, the lender may reposssess the home through foreclosure if the person who took out the loan is unable to make payments.

* ``personal`` These loans are for personal expenses or investments other than a home or automobile. Under these loans, there is not generally any specific piece of property that a lender may repossess.

In [15]:
# percent of each type of loan

pct_df = pd.DataFrame({"type" : loans["type"], "count" : [1 for _ in loans["type"]]})
pct_df.groupby(["type"]).sum()/len(loans["type"])

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
auto,0.348
home,0.37
personal,0.282


## 2. Data Cleaning

Before we can use this data to build a model, we'll need to clean it up a bit. 

The ``date`` column is the date the loan was applied for. However because of the way Pandas works with csv files, this is interpreted as a string. In order do things like test whether a loan was issued before or after a certain date, we need to convert it to a different format.

In [16]:
loans["date"] = pd.to_datetime(loans["date"])

Now if we wanted to see how many applications there were from 2015 on, we can do things like this

In [20]:
from datetime import datetime

sum(loans["date"] >= datetime(month=1, day=1, year=2015))

627

Another thing we need to do is to deal with data that is categorical (not numeric and which does not have any sort of ordering). Since many scikit-learn models can only handle data that is numeric , it is necessary to represent these columns numerically.

One example of this is the ``type`` column. To turn this into numeric data, we'll apply a dummy encoding. This creates two columns: one which indicates whether the loan was a home loan and one which indicates whether the loan was a personal loan. If an entry is neither, we can the assume that it was an auto loan.

There are other ways of representing categorical data, such as OneHot or simple numeric labels. If you are familiar with those techniques, and believe them to be applicable in this case, you may choose to apply them instead.

**There are other categorical data columns in the data, you should take this opportunity to encode those too**

In [52]:
coded_cols = pd.get_dummies(loans["type"], prefix="type", drop_first=True)
encoded_loans = pd.concat([loans, coded_cols], axis=1)
loans = encoded_loans.drop(["type"], axis=1)
loans.head()

Unnamed: 0.1,Unnamed: 0,race,gender,date,fed,zip,income,interest,term,principal,approved,type_home,type_personal
0,0,black,female,2017-10-01,1.15,60637,21155,10.195762,12,44622,True,0,1
1,1,black,male,2018-04-01,1.69,60623,58431,,180,519375,False,1,0
2,2,black,female,2013-11-01,0.08,60637,28489,8.780219,360,265714,True,1,0
3,3,black,male,2011-10-01,0.07,60626,122782,1.91567,360,917704,True,1,0
4,4,white,female,2014-07-01,0.09,60614,324243,3.680747,300,1728441,True,1,0


Several columns contain null data. This means that the information was not recorded by the loan officer at the time of the application. This may be a problem as many machine learning models cannot handle undefined input or output values.

There are several methods of handling null data. One option is to drop all rows where the entry is not defined. 

In [42]:
all_dropped = loans.dropna()
len(all_dropped)

877

Another option is to drop just the columns where there is a null entry. 

In [44]:
cols_dropped = loans.dropna(axis="columns")
cols_dropped.columns

Index(['Unnamed: 0', 'race', 'gender', 'date', 'fed', 'zip', 'income', 'term',
       'principal', 'approved', 'type__home', 'type__personal'],
      dtype='object')

You can also try to attribute a value to data in a column that's missing. Here for example, we fill the missing entries in the ``interest`` column with 0.  

In [46]:
filled_interest = loans["interest"].fillna(0.0)
filled_interest.head()

0    10.195762
1     0.000000
2     8.780219
3     1.915670
4     3.680747
Name: interest, dtype: float64

**You should take a moment now to examine the missing data and pick a method for handling it** 

## 3. Model Training

Before beginning this section please execute this block of code. This tells the plugin that you have reached the stage where you are starting to build models. 

In [47]:
# %prompter_plugin model_training%

Executing this block does not prevent you from going back and revisiting previous data analysis decisions. 

Now we ask that you train a classifier which predicts whether a loan will be approved or not. The purpose of this classifier is to be used by loan officers or local lenders to recommend to applicants specific loans they might be eligible for. The lenders hope that this tool will help match financial products to borrowers more efficiently. 

Please note that this classifier *will not* be used to make decisions about whether to grant a loan or not. 

There are a couple different things you could do here. A small example using the ``LogisticRegressionClassifier`` is included below. Other classifiers include the ``RandomForestClassifier`` or ``KNeighborsClassifier``

In [64]:
from sklearn.linear_model import LogisticRegression

X = loans[["type_home", "type_personal", "income", "principal"]]

y = loans["approved"]

lr = LogisticRegression()
lr.fit(X,y)

lr.score(X, y)

0.518

## 4. Model Selection

There are many different ways that models can vary. You could use different model types, use different input variables, or different model parameters. In this section please take a moment to try to examine how if you change one or more of these parameters the model's performance changes.

At the end of this section you should have a model you believe to be the best for the purpose described in this scenario.

# End

When you feel you have completed this exercise, please execute this cell to submit your model. You may (but are not required to) submit multiple times. 

In [4]:
# %prompter_plugin submit%