# Introduction

Thank you for agreeing to take part in this evaluation. During this evaluation you will be asked to carry out a series of tasks related to the dataset shown in the next section. Please carry out these tasks with the same care and rigor as if these tasks were part of your job duties.
    
Before beginning please ensure that you are using the prompter kernel by looking at the top right of this notebook. It should say "prompter", rather than "Python 3". If says you are using a python kernel, please click on where it says Python, and select the prompter from the drop down study. If you encounter any errors, or if it says "No Kernel!" please contact [PLACEHOLDER@uchicago.edu](PLACEHOLDER@uchicago.edu) so we can fix the issue.


## Task Description

This task is structured into four parts.

1. **Dataset introduction**
2. **Data cleaning**
3. **Model Training**
4. **Model Selection**

In each of these, there will be some code pre-written. This code is meant to help you complete the task by providing structure, but you are not required to use the provided code if you do not want to. You may refer to any documentation source you like during this task (such as StackOverflow, or Pandas API documentation or tutorials).

We ask that you use pandas and scikit-learn to perform the tasks. We have also installed numpy and matplotlib, should those be helpful. You will not be able to install any other non-standard libraries.

## 1. Dataset Introduction



We will be asking you to use the provided "loan&#95;data.csv" dataset during this experiment. This data was collected in a major metropolitan city in the United States. It contains information about applications for loans recieved by several different loan providers.

Let's start trying to understand the dataset by writing some python code. Feel free to follow along by running the following sample commands in this notebook.

Below is a few lines of python code that loads the provided "loan&#95;data.csv" dataset into a pandas dataframe.

In [1]:
import pandas as pd

loans = pd.read_csv("loan_data.csv")
loans.head()

Unnamed: 0,race,gender,date,zip,income,type,term,interest,principal,approved,adj_bls_2,id
0,hispanic/latino,male,2016-01-01,60623.0,72230.0,home,180,3.389672,508761,False,0.34,AP20161-0-2
1,other,male,2013-03-01,60625.0,18543.0,home,360,0.277318,119738,False,0.14,AP20133-1-23
2,other,male,2011-11-01,60623.0,30228.0,home,240,4.398939,265779,False,0.08,AP201111-2-22
3,hispanic/latino,male,2014-08-01,60623.0,11129.0,personal,60,5.221935,15590,True,0.09,AP20148-1-29
4,black,female,2016-11-01,60637.0,,personal,60,10.843707,56301,True,0.41,AP201611-0-38


We can get a list of the columns in this dataframe with the following command

In [2]:
loans.columns

Index(['race', 'gender', 'date', 'zip', 'income', 'type', 'term', 'interest',
       'principal', 'approved', 'adj_bls_2', 'id'],
      dtype='object')

Let's look at some of the columns in the dataframe. The column ``approved`` indicates whether or not the loan was approved

In [3]:
# since python treats True as a 1, and False as a 0, the sum
# of this array is the number of entries in loans where approved == True

sum(loans["approved"])/len(loans["approved"]) # calculating the fraction of approved loans

0.43152257077276207

The column ``principal`` is the amount of money the loan was for, that is how much money the applicant received if the loan was approved.

In [4]:
# 25th and 75th percentile
print(loans["principal"].quantile(0.25), loans["principal"].quantile(0.75))

16150.0 375306.0


That's a fairly wide variation in loan amounts. One possible reason for this is that there are different purposes for which the loans were applied for. The ``type`` column denotes the purpose of the loan. 

In [4]:
loans["type"].unique()

array(['home', 'personal', 'auto'], dtype=object)

There are three possible values ``type`` can have: ``personal``, ``home`` and ``auto``. 

* ``auto`` These loans were for automobile purchases. With these loans, the lender may repossess the car if the person who took out the loan is unable to make payments.

* ``home`` These loans were for the purchase of residential real estate. Under these loans, the lender may reposssess the home through foreclosure if the person who took out the loan is unable to make payments.

* ``personal`` These loans are for personal expenses or investments other than a home or automobile. Under these loans, there is not generally any specific piece of property that a lender may repossess.

In [5]:
# here we use the same trick we used to calculate the approval rate 
# to calculate the different types of loans. 

personal_pct = sum(loans["type"] == "personal")/len(loans["type"])

# Since the 'type' column is a string, the statement 
# loans["type"] == "personal" produces a #Series of True/False values 
# indicating for each entry in the 'type' column whether it equals "personal"
# or not

home_pct = sum(loans["type"] == "home")/len(loans["type"])
auto_pct = sum(loans["type"] == "auto")/len(loans["type"])

print("Personal loans: {0}\nHome loans: {1}\nAuto loans: {2}\n".format(personal_pct, home_pct, auto_pct))

Personal loans: 0.25172149961744456
Home loans: 0.3802601377199694
Auto loans: 0.3680183626625861



``interest`` is the annual percent interest on the loan. ``term`` is how long in months the loan was supposed to run. 

``income`` is the annual income of the loan applicants.

# Your Turn

At the end of the next sections, there will be space like this where we will ask you to complete certain tasks. 

There's nothing that you need to do right now. However if you want to examine any of the columns in more depth this is a good spot to do so. 

In [6]:
# This is where you will write python code. Execute these cells using the
# play button or the keyboard shortcut to execute cells.

## 2. Data Cleaning

Before we can use this data to build a model, we'll need to clean it up a bit. Suppose we wanted to look at loans issued between certain dates. How would we do that?

In [8]:
loans["date"].dtypes # this tells us the type of entries in the date column

dtype('O')

``dtype('0')`` means that pandas thinks that the dates are generic objects (strings, in this case), as opposed to dates. 

In [10]:
type(loans["date"][0])

str

Because of the way Pandas works with csv files, this is interpreted as a string. In order do things like test whether a loan was issued before or after a certain date, we need to convert it to a different type.

In [13]:
loans["date"] = pd.to_datetime(loans["date"])
type(loans["date"][0])

pandas._libs.tslibs.timestamps.Timestamp

Now we can easily find out how many applications there were from 2015 on using the same trick we used to get the proportion of different types of loans in the previous section.

In [14]:
from datetime import datetime

sum(loans["date"] >= datetime(month=1, day=1, year=2015))

838

### 2a. Categorical data
Another thing we need to do is to deal with data that is categorical (not numeric and which does not have any sort of ordering). Since many scikit-learn models can only handle data that is numeric, it is necessary to represent these columns numerically.

One example of this is the ``type`` column. To turn this into numeric data, we'll apply a dummy encoding. 

In [15]:
# this encodes the type column of loans using a dummy coding
coded_cols = pd.get_dummies(loans["type"], 
                            prefix="type", # this sets the column names as type_<var_name>
                            drop_first=True) # we'll explain what this does below

coded_cols.head()

Unnamed: 0,type_home,type_personal
0,1,0
1,1,0
2,1,0
3,0,1
4,0,1


Each row in this dataframe indicates whether the type was ``home`` or ``personal``.

Maybe you've noticed that there are only two columns in this dataframe. That's because we set the ``drop_first`` argument in the ``get_dummies`` function to ``True``. Since there were three possible values for loan type, we know that any entry that has a 0 in both the ``type_home`` and ``type_personal`` columns is of type ``auto``.

Why don't we also have a column indicating whether the loan is an auto loan or not? In some types of model, having a redundant column like that can cause problems. 

Now we need to attach these dummy columns back to our original data.

In [16]:
# this attaches the coded columns "to the side" of loans
encoded_loans = pd.concat([loans, coded_cols], axis=1) 

# but loans still contains the "type" column so we drop that 
# to keep things neat
loans = encoded_loans.drop(["type"], axis=1)
loans.head()

Unnamed: 0,race,gender,date,zip,income,term,interest,principal,approved,adj_bls_2,id,type_home,type_personal
0,hispanic/latino,male,2016-01-01,60623.0,72230.0,180,3.389672,508761,False,0.34,AP20161-0-2,1,0
1,other,male,2013-03-01,60625.0,18543.0,360,0.277318,119738,False,0.14,AP20133-1-23,1,0
2,other,male,2011-11-01,60623.0,30228.0,240,4.398939,265779,False,0.08,AP201111-2-22,1,0
3,hispanic/latino,male,2014-08-01,60623.0,11129.0,60,5.221935,15590,True,0.09,AP20148-1-29,0,1
4,black,female,2016-11-01,60637.0,,60,10.843707,56301,True,0.41,AP201611-0-38,0,1


### 2b. Null data

Several columns contain null data. This means that the information was not recorded by the loan officer at the time of the application. This may be a problem as many machine learning models cannot handle undefined input or output values.

There are several methods of handling null data. One option is to drop all rows where the entry is not defined. 

In [17]:
all_dropped = loans.dropna()
print(len(loans), len(all_dropped))

1307 998


As you can see, this removed a significant number of entries. Doing this makes the data much easier to use, but may introduce systematic errors if the data is not missing in a purely random fashion.

Another option is to drop just the columns where there is a null entry. 

In [18]:
cols_dropped = loans.dropna(axis="columns")

# which columns are removed this way?
missing = [l for l in loans.columns if l not in cols_dropped.columns]

print(missing)

['race', 'gender', 'zip', 'income']


This preserved the number of entries, but means that we're missing possibly important columns like ``income``

You can also try to attribute a value to data in a column that's missing. Here for example, we fill the missing entries in the ``income`` column with 0.  

In [19]:
filled_interest = loans["income"].fillna(0.0)
filled_interest.head()

0    72230.0
1    18543.0
2    30228.0
3    11129.0
4        0.0
Name: income, dtype: float64

# Your Turn

There are two things you need to do here 

1. Handle the missing data in the "loan_data.csv". We suggest that you do this first as it may impact how you handle encoding the types of the data. 

2. Look through the columns and encode data into the correct type. A helpful command to see the type of each of the columns in the dataframe is shown below.

When completing both 1 & 2, you are free to use any method mentioned here. If you want to use a technique or method not mentioned here, you are also free to do so. You are also free to revisit this or any other section at any time.

**Remember**, when completeing the tasks try to treat them as if they are part of your job, and it is your responsibility to create an effective model for predicting loan acceptance. 

In [23]:
#write your code here to perform part 1 of the task. Below is a small
#code snippet to help you get started. You may delete the snippet
#if you wish.

import pandas as pd

def handle_nulls(loans_dataframe):
    '''function for cleaning '''
    # this just prints the columns with nulls
    cols_with_nulls = [c for c in loans_dataframe.columns if loans_dataframe[c].isna().any()]
    print(cols_with_nulls)
    
    # this just prints the number of rows with null data in any entry
    rows_with_nulls = loans_dataframe.isna().any(axis=1)
    
    # loans_dataframe.isna() produces a dataframe of the same size and 
    # shape as loans_dataframe, but with True/False in each entry indicating
    # whether each entry is null or not. 
    
    # .any checks if any of the entries are true, with the axis argument set to 
    # 1, this means that it checks for each row in loans_dataframe.any() if there is 
    # an entry with True in it
    
    print(sum(rows_with_nulls))
    
    # Now you should do something to the data
    
    return loans_dataframe
        
cleaned_nulls = handle_nulls(pd.read_csv("loan_data.csv"))
   

['race', 'gender', 'zip', 'income']
309


In [28]:
#write your code here to perform part 2 of the task
#Below is a small code snippet to help you get started,
#you may delete the snippet if you wish

def encode_data(cleaned_loans_dataframe):
        print(cleaned_loans_dataframe.dtypes)
        #do something
        return cleaned_loans_dataframe
        
cleaned_loans = handle_nulls(pd.read_csv("loan_data.csv"))        
cleaned_data = encode_data(cleaned_loans)

['race', 'gender', 'zip', 'income']
309
race          object
gender        object
date          object
zip          float64
income       float64
type          object
term           int64
interest     float64
principal      int64
approved        bool
adj_bls_2    float64
id            object
dtype: object


## 3. Model Training

**Do not uncomment the following line of code. Make Sure to run the following line of code as is.** This tells the plugin that you have reached the stage where you are starting to build models. 

In [6]:
# %prompter_plugin model_training%

Executing this block does not prevent you from going back and revisiting previous data analysis decisions. 

In section 4, we will ask you to train a classifier which predicts whether a loan will be approved or not. The purpose of this classifier is to be used by loan officers or local lenders to recommend to applicants specific loans they might be eligible for. The lenders hope that this tool will help match financial products to borrowers more efficiently. 

Please note that this classifier be used to make *recommendations* and **not** decisions about whether to grant a loan or not. 

There are a couple different things you could do here. A small example using the ``LogisticRegression`` is included below. 

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = loans[["type_home", "type_personal", "principal", "interest"]]
y = loans["approved"]

# this creates training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

lr = LogisticRegression()
lr.fit(X_train,y_train) # this "fits" or "trains" the model using X and y as training data

# this is the predictions for what y should be. 
# they may be helpful if you want to understand more about what a model is doing
preds = lr.predict(X_test)

lr.score(X_test, y_test)

0.4312977099236641

How well does the regression we built work? Is an accuracy of 0.43 good or bad in this case? If you followed along from the start, you might recall that the approval rate in our data was about 0.43 as well. 

This suggests the possibility that ``lr`` is predicting ``True`` no matter what.

In [30]:
preds.all() # this tests whether all the predictions are True

True

So it turns out that ``lr`` doesn't work very well. In fact, if we had a model that guessed ``False`` all the time, that the loan would not be approved, it would have an accuracy of 0.57!

We can likely improve upon this baseline. In the next **Your Turn** we will ask you to explore some of the ways in which might improve the classification performance. 

# Your Turn

Use the data you cleaned up in section 2 to build your own Logistic Regression. In the regression we built above, we just used a few columns. Try different combinations of columns to see if that changes the results you get. 

In [None]:
def choose_columns(cleaned_data):
    # write code here
    return cleaned_data

X = choose_columns(cleaned_data)
y = cleaned_data["approved"] # change this if you've renamed the `approved` column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test, y_test)

You can also try changing the type of model that you're using. In the example above, we just used the LogisticRegression classifier. Do you get different results if you use a different type of classifier?

There are other types of classifiers in scikit-learn, like the KNeighborsClassifier, or the DecisionTreeClassifier. If you are not familiar with these, don't worry. They all may be trained using the same calls to ``fit``, ``score`` and ``predict``.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

dt = DecisionTreeClassifier()
knn = KNeighborsClassifier()
lr = LogisticRegression()

X = choose_columns(cleaned_data)
y = cleaned_data["approved"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

dt.fit(X_train, y_train)
knn.fit(X_train, y_train)
lr.fit(X_train, y_train)

print(lr.score(X_test, y_test), dt.fit(X_test, y_test), knn.fit(X_test, y_test))

# End

Thank you for participating in the study. In section 3 you should have explored the performance of several different classifiers. Now is the time where you need to choose which one you believe to be the best. To submit your model for consideration, please assign your model to the variable SUBMITTED_MODEL in the cell below. Then execute the submission cell below that one. 

Remember that the model you submit will be used to recommend financial products to prospective loan applicants. 

You may submit multiple times, however only your last submitted model will be considered.

In [None]:
SUBMITTED_MODEL = # write the variable here

In [None]:
# %prompter_plugin submit %

You may now return to the Qualtrics and complete the survey.