# Introduction

Thank you for agreeing to take part in this evaluation. During this evaluation you will be asked to carry out a series of tasks related to the dataset shown in the next section. Please carry out these tasks with the same care and rigor as if these tasks were part of your job duties.
    
Before beginning please ensure that you are using the prompter kernel by looking at the top right of this notebook. It should say "prompter", rather than "Python 3". If says you are using a python kernel, please click on where it says Python, and select the prompter from the drop down study. If you encounter any errors, or if it says "No Kernel!" please contact [retrograde-plugin@uchicago.edu](retrograde-plugin@uchicago.edu) so we can fix the issue.


## Task Description

This task is structured into four parts.

1. **Dataset introduction**
2. **Data cleaning**
3. **Model Training**
4. **Model Selection**

In each of these four sections, there is a tutorial portion and a task portion.
The tutorial portions are meant to provide background and structure to the task, and may be helpful to you when completing the tasks.
You will know that you have reached a task portion because we will mark them <span style="color:red"> **in red** </span>

In each of these, there will be some code pre-written. This code is meant to help you complete the task by providing structure, but you are not required to use the provided code if you do not want to. You may refer to any documentation source you like during this task (such as StackOverflow, or Pandas API documentation or tutorials).

We ask that you use pandas and scikit-learn to perform the tasks. We have also installed numpy and matplotlib, should those be helpful. You will not be able to install any other non-standard libraries.

## 1. Dataset Introduction



We will be asking you to use the provided "loan&#95;data.csv" dataset during this experiment. This data was collected in a major metropolitan city in the United States. It contains information about applications for loans recieved by several different loan providers.

Let's start trying to understand the dataset by writing some python code. Feel free to follow along by running the following sample commands in this notebook.  

Below is a few lines of python code that loads the provided "loan&#95;data.csv" dataset into a pandas dataframe.

In [None]:
import pandas as pd

# What each of these columns represents is explained below. This dictionary tells pandas what 
# data type each of the columns should be treated as.

column_types = {
    "race" : "category",
    "gender" : "category",
    "zip" : "category",
    "income" : float,
    "type" : "category",
    "term" : int,
    "interest" : float,
    "principal" : int,
    "approved" : bool,
    "adj_bls_2" : float,
    "id" : str,
}

loans = pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types)
loans.head()

We can get a list of the columns in this dataframe with the following command

In [None]:
loans.columns

Let's look at some of the columns in the dataframe. The column ``approved`` indicates whether or not the loan was approved

In [None]:
# since python treats True as a 1, and False as a 0, the sum
# of this array is the number of entries in loans where approved == True

sum(loans["approved"])/len(loans["approved"]) # calculating the fraction of approved loans

The column ``principal`` is the amount of money the loan was for, that is how much money the applicant received if the loan was approved.

In [None]:
# 25th and 75th percentile
print(loans["principal"].quantile(0.25), loans["principal"].quantile(0.75))

That's a fairly wide variation in loan amounts. One possible reason for this is that there are different purposes for which the loans were applied for. The ``type`` column denotes the purpose of the loan. 

In [None]:
loans["type"].unique()

There are three possible values ``type`` can have: ``personal``, ``home`` and ``auto``. 

* ``auto`` These loans were for automobile purchases. With these loans, the lender may repossess the car if the person who took out the loan is unable to make payments.

* ``home`` These loans were for the purchase of residential real estate. Under these loans, the lender may reposssess the home through foreclosure if the person who took out the loan is unable to make payments.

* ``personal`` These loans are for personal expenses or investments other than a home or automobile. Under these loans, there is not generally any specific piece of property that a lender may repossess.

In [None]:
# here we use the same trick we used to calculate the approval rate 
# to calculate the different types of loans. 

personal_pct = sum(loans["type"] == "personal")/len(loans["type"])

# Since the 'type' column is a string, the statement 
# loans["type"] == "personal" produces a #Series of True/False values 
# indicating for each entry in the 'type' column whether it equals "personal"
# or not

home_pct = sum(loans["type"] == "home")/len(loans["type"])
auto_pct = sum(loans["type"] == "auto")/len(loans["type"])

print("Personal loans: {0}\nHome loans: {1}\nAuto loans: {2}\n".format(personal_pct, home_pct, auto_pct))

``interest`` is the annual percent interest on the loan. ``term`` is how long in months the loan was supposed to run. 

``income`` is the annual income of the loan applicants.

The file ``loan_data_dictionary.txt`` in this directory is a text file with a description of all of the columns in ``loan_data.csv``. 

### <span style="color:red">Your Turn</span>

Section 1 is just an introduction to the dataset you will be using. There's no specific task for you to do in this section. However, you should have a sense of what this data is and where it comes from before proceeding.

In the next section, you will be asked to make certain decisions about how to clean this data and get it into a usable form.

## 2. Data Cleaning

Before we can use this data to build a model, we'll need to clean it up a bit. 

Several columns contain null data. This means that the information was not recorded by the loan officer at the time of the application. This may be a problem as many machine learning models cannot handle undefined input or output values.

You may also find it useful to revisit your work in this section as you move on to the model building tasks in section 3.


There are several methods of handling null data. One option is to drop all rows where the entry is not defined. 

In [None]:
all_dropped = loans.dropna()
print(len(loans), len(all_dropped))

As you can see, this removed a significant number of entries. Doing this makes the data much easier to use, but may introduce systematic errors if the data is not missing in a purely random fashion. For example:


In [None]:
import numpy as np

true_data = np.array([10, 40, 36, 12, 67])
missing_data_lower = np.array([np.nan, 40, 36, np.nan, 67]) # lower values are missing
missing_data_upper = np.array([10, np.nan, 36, 12, np.nan]) # upper values are missing
print(np.nanmean(missing_data_upper), np.nanmean(missing_data_lower)) # nanmean computes the mean, but ignores nan or missing data


Another option is to drop just the columns where there is a null entry. 

In [None]:
cols_dropped = loans.dropna(axis="columns")

# which columns are removed this way?
missing = [l for l in loans.columns if l not in cols_dropped.columns]

print(missing)

This preserved the number of entries, but means that we're missing possibly important columns like ``income``

You can also try to attribute a value to data in a column that's missing. Here for example, we fill the missing entries in the ``income`` column with 0.  

In [None]:
filled_interest = loans["income"].fillna(0.0)
filled_interest.head()

Note that some columns in the dataframe are encoded as *categorical* data.  In order to fill missing values in columns of these types, you must either add a value to the categories, or fill with a value already in the column.

An example of this is gender.

In [None]:
loans["gender"].unique()

In [None]:
loans["gender"].cat.add_categories("missing").fillna("missing").unique()

### <span style="color:red">Your Turn</span>

Handle the missing data in the "loan_data.csv". We suggest that you do this first as it may impact how you handle the model training in section 3.

You are free to use any method mentioned here. If you want to use a technique or method not mentioned here, you are also free to do so. You are also free to revisit this or any other section at any time.

**Remember**, when completing the tasks try to treat them as if they are part of your job, and it is your responsibility to create an effective model for predicting loan acceptance. 

In [None]:
#write your code here to perform the task. Below is a small
#code snippet to help you get started. You may delete the snippet
#if you wish.

import pandas as pd

def handle_nulls(loans_dataframe):
    '''function for cleaning '''
    # this just prints the columns with nulls
    cols_with_nulls = [c for c in loans_dataframe.columns if loans_dataframe[c].isna().any()]
    print(cols_with_nulls)
    
    # this just prints the number of rows with null data in any entry
    rows_with_nulls = loans_dataframe.isna().any(axis=1)
    
    # loans_dataframe.isna() produces a dataframe of the same size and 
    # shape as loans_dataframe, but with True/False in each entry indicating
    # whether each entry is null or not. 
    
    # .any checks if any of the entries are true, with the axis argument set to 
    # 1, this means that it checks for each row in loans_dataframe.any() if there is 
    # an entry with True in it
    
    print(sum(rows_with_nulls))
    
    # Now you should do something to the data
    
    return loans_dataframe
        
cleaned_data = handle_nulls(pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types))
   

## 3. Model Training

In this section, we will ask you to train a classifier which predicts whether a loan will be approved or not. The purpose of this classifier is to be used by loan officers or local lenders to recommend to applicants specific loans they might be eligible for. The lenders hope that this tool will help match financial products to borrowers more efficiently. 

Please note that this classifier be used to make *recommendations* and **not** decisions about whether to grant a loan or not. 

In this part, there are two <span style="color:red">**Your Turn**</span> sections. In the first, you will be asked to select features to use. In the second, you will be asked to select different model architectures. While we suggest completing the first one, and then moving on to the second one, and then revisiting the first and any other previous steps as necessary. 

There are a couple different things you could do here. A small example using the ``DummyClassifier`` is included below. 

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

X_num = loans[["principal", "interest"]] # these columns are numeric

# since loan type is a categorical variable, we need to encode it numerically
# this creates three columns of 0/1 denoting the type of loan each row is
X_cat = pd.get_dummies(loans["type"], prefix="type")

X = pd.concat([X_num, X_cat], axis=1) # this combines the categorical and numeric columns back into 1 dataframe
y = loans["approved"]

# this creates training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# dummy classifier guesses randomly at the labels
clf = DummyClassifier(strategy="uniform")
clf.fit(X_train,y_train) # this "fits" or "trains" the model using X and y as training data

# this is the prediction for what y should be. 
# they may be helpful if you want to understand more about what a model is doing
preds = clf.predict(X_test)

clf.score(X_test, y_test)

We need to figure out if this score is good or bad. If you followed along from the start, you might recall that the approval rate in our data was about 0.43 as well. 

What if we tried using a model that always guesses the most frequent decision (to reject?)

In [None]:
clf_unif = DummyClassifier(strategy="most_frequent")
clf_unif.fit(X_train, y_train)
clf_unif.score(X_test, y_test)

So it turns out that ``clf`` doesn't work very well. In fact, the model that guessed ``False`` all the time had a higher accuracy!

We can likely improve upon this baseline. In the next **Your Turn** we will ask you to explore some of the ways in which might improve the classification performance. 

### <span style="color:red">Your Turn, Feature Selection</span>

Use the data you cleaned up in section 2 to build your own Logistic Regression. In the regression we built in the example above, we just used a few columns. Try different combinations of columns to see if that changes the results you get. 

In [None]:
from sklearn.linear_model import LogisticRegression

def choose_columns(cleaned_data):
    # write code here
    # Note: you may need to dummy code any non-numeric columns
    return cleaned_data[numeric_columns]

X = choose_columns(cleaned_data)
y = cleaned_data["approved"] # change this if you've renamed the `approved` column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# LogisiticRegression is another sklearn classifier
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test, y_test)

### Model Selection

You can also try changing the type of model that you're using. In the example above, we just used the LogisticRegression classifier. Do you get different results if you use a different type of classifier?

There are other types of classifiers in scikit-learn, like the KNeighborsClassifier, or the DecisionTreeClassifier. If you are not familiar with these, don't worry. They all may be trained using the same calls to ``fit``, ``score`` and ``predict``.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

dt = DecisionTreeClassifier()
knn = KNeighborsClassifier()
lr = LogisticRegression()

X = choose_columns(cleaned_data)
y = cleaned_data["approved"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

dt.fit(X_train, y_train)
knn.fit(X_train, y_train)
lr.fit(X_train, y_train)

print(lr.score(X_test, y_test), dt.score(X_test, y_test), knn.score(X_test, y_test))

### <span style="color:red">Your Turn, Model Selection</span>

Use the data and features you selected in the previous sections to train different types of models. You may use any scikit-learn model you see fit. If you think it might be useful, you can also try different parameters for different models. If you don't know what a model parameter is, or what the model parameter means, you don't need to worry about that.

In [None]:
# these are some examples of classifiers from the scikit-learn library

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

X = choose_columns(cleaned_data)
y = cleaned_data["approved"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# put models here 
rf = RandomForestClassifier().fit(X_train, y_train)

print(rf.score(X_test, y_test))

# End

Thank you for participating in the study. In section 3 you should have explored the performance of several different classifiers. Now is the time where you need to choose which one you believe to be the best. To submit your model for consideration, please assign your model to the variable SUBMITTED_MODEL in the cell below and then execute the cell.

Remember that the model you submit will be used to recommend financial products to prospective loan applicants. 

You may submit multiple times, however only your last submitted model will be considered.

In [None]:
SUBMITTED_MODEL = # write the variable here

# if the model you decide on is named clf, then you would write 
# SUBMITTED_MODEL = clf

You may now return to the Qualtrics and complete the survey.