# Introduction

Thank you for agreeing to take part in this evaluation. The <span style="color:green"> **goal** </span> of this evaluation task is to <span style="color:green">**build a machine learning model that decides whether or not someone should be granted a loan or not**</span>.

During this evaluation you are asked to assume the role of a **data scientist** working at a financial technology (FinTech) company that is introducing a new feature to its application. 

This feature will allow for customers seeking loans to submit some data/info to the app and immediately receive a receive a response on if they will be approved for that loan or not. 

Behind this interaction lies your work: <span style="color:green"> **a machine learning model that will decide whether or not this person should be granted this loan or not based on the data entered**</span>. 

The company has collected some data for you in order to train and test a model, and now it is your job to do so.

**Please carry out these tasks with the same care and rigor as if these tasks were part of your job duties.**
    
*Before beginning please ensure that you are using the retrograde kernel by looking at the top right of this notebook.* It should say "*retrograde*", rather than "*Python 3*". If it says you are using a "*Python 3*" or something similar, please click on where it says Python, and select retrograde from the drop down menu. If you encounter any errors, or if it says "No Kernel!" please contact [retrograde-plugin@uchicago.edu](retrograde-plugin@uchicago.edu) so we can fix the issue.


## Task Description

This task is structured into five parts.

#### [**1. Data Exploration**](#data-expl)
#### [**2. Data Cleaning**](#data-clean)
#### [**3. Feature Creation and Feature Selection (Feature Engineering)**](#feat-eng)
#### [**4. Model Training**](#model-train)
#### [**5. Model Selection**](#model-select)

In each of these five sections, there is an explanation/example portion and a task portion.
The explanations are meant to provide background and structure to the task, and may be helpful to you when completing the tasks.

**Example code sections should be executed by you.**

You will know that you have reached a task portion because we will mark them <span style="color:red">**in red**</span>.

When you are done with a section, you will see:
<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>

indicating that you should return to the Qualtrics survey tab and finish the questions for that section before moving onto the next section in this jupyter notebook.

*If at any point throughout working on the task you feel the need to revisit and revise your work on a particular section, feel free to do so.*

For some sections, there may be some code pre-written. **This code is meant to help you complete the task by providing structure, but you are not required to use the provided code if you do not want to.** 

**You may refer to any documentation** source or question asking/answering forum you like during this task (such as *StackOverflow, Pandas API documentation, etc.*).

We ask that you use pandas and scikit-learn to perform the tasks. We have also installed numpy and matplotlib, should those be helpful. You will not be able to install any other non-standard libraries.

---

## 1. Data Exploration <a class="anchor" id="data-expl"></a>

We will be asking you to use the provided "loan_data.csv" dataset during this experiment. This data was sourced from data collected in the company's app. It contains information about various users' loan history and other data voluntarily supplied in the app.

Let's start trying to understand the dataset by writing some python code. Feel free to follow along by running the following sample commands in this notebook.  

Below is a few lines of code that loads the provided "loan_data.csv" dataset into a pandas dataframe.

**Your goal for this section should be to understand what is in the data and to identify some of the features/characteristics that may be relevant to later sections**.

In [None]:
import pandas as pd

# What each of these columns represents is explained below. This dictionary tells pandas what 
# data type each of the columns should be treated as.
# There is also a file called loan_data_dictionary.txt that explains what each column represents.

column_types = {
    "race" : "category",
    "gender" : "category",
    "zip" : "category",
    "income" : float,
    "type" : "category",
    "interest" : float,
    "term" : float,
    "principal" : float,
    "approved" : bool,
    "adj_bls_2" : float,
    "id" : str,
}

loans = pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types)
loans.head()

Let's look at some of the columns in the dataframe. **The column** `approved` **indicates whether or not the loan was approved**. <span style="color:green">**This is the column your model should predict.** </span>

In [None]:
# since python treats True as a 1, and False as a 0, the sum
# of this array is the number of entries in loans where approved == True

print(f'{sum(loans["approved"])/len(loans["approved"])*100:.2f}% of loans were approved')

### <span style="color:red">Your Turn, Data Exploration</span>

Section 1 is just an introduction to the dataset you will be using. There's no specific task for you to do in this section, however you are responsible for understanding what the columns in this dataset are representing and developing an intuition for which of those may be helpful for your model.

In the next section ([**data cleaning**](#data-clean)), you will be asked to make certain decisions about how to clean this data and get it into a usable form.

**If you are feeling unsure of where to start try some of these** (each have a link to the official documentation):

*Assuming your dataframe is in the variable `df`*
- [df.describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas-dataframe-describe)
    - use the `include='all'` parameter to include non-numeric columns in this output
- [df['column_name'].quantile(0.25)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html?highlight=quantile#pandas-dataframe-quantile)
- [df['column_name'].unique()](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html?highlight=unique#pandas-series-unique)
- [df['column_name'].value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html?highlight=value_count#pandas-series-value-counts)

- Using pandas to understand different intersections of your data:
```python
    # boolean indexing with multiple columns 
    print(df[(df['column_name'] == some_val) & (df['different_col'] == different_val)].head())
    print(df[(df['column_name'] == some_val) & (df['different_col'] == different_val)].describe())
    # you have a subset of the original dataframe on 
    # ['column_name'] == some_val 
    # and ['different_col'] == different_val
```

Additionally, here are some brainstorming questions that you could try to explore in this section:
- *Do certain `types` of loans get approved more frequently?*
- *Is `income` data distributed in any notable way?*
- *Does `approval` rate change over time?*

In [None]:
# Data Exploration (feel free to create cells here as needed)
# Your code here




<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>

---
## 2. Data Cleaning <a class="anchor" id="data-clean"></a>

Before we can use this data to build a model, we'll need to clean it up a bit. Raw data often will have incomplete or inaccurate records present due to entry errors, inconsistent practices in data collection and many other reasons. The way this "dirtyness" will manifest is often **unknown without inspection** and is different for every dataset. 

It is a vital part of the data scientist role **to clean and standardize** the data used and so too will it be vital for you in this section.

Ideally throughout the data exploration section you identified some of these signs of "dirtyness", but if you have not yet identified any, your first step here should be to identify the manner(s) of dirtyness present in your data. Once you have done that it will be up to you to decide how you will **clean** the data. We recommend experimenting with several methods and deciding which method(s) best achieve your goals.

### <span style="color:red">Your Turn, Data Cleaning</span>

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

**Remember**, when completing the tasks try to treat them as if they are part of your job, and it is your responsibility to create an effective model for <span style="color:green">**deciding whether or not someone should be granted a loan or not**</span>.

In [None]:
# Data Cleaning (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

import pandas as pd
import numpy as np

def clean_data(loan_df):
    '''function for cleaning '''

    return loan_df # some clean dataframe
        
cleaned_data = clean_data(pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types))

<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>

---
## 3. Feature Creation and Selection (Feature Engineering) <a class="anchor" id="feat-eng"></a>

Feature Engineering is a process that involves you—a data scientist—using domain knowledge to extract notable features from the raw data. In this section you will be doing just that for the data from "loan_data.csv". Given that you are not expected to be an expert on loan decisions, use your best judgement and focus on the columns you think will help your model the most, also recall that data science is a cyclical process and it is normal to return to feature engineering in order to add and remove features after training. 

If you are unsure of where to start try these common methods:
- **If you are selecting categorical variables** like `type` or `gender`, you will **need** to encode them numerically. One simple way is one-hot encoding with [pd.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html#pandas-get-dummies), another different way is with [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn-preprocessing-ordinalencoder).

```python
# type is a categorical variable
onehot_type = pd.get_dummies(cleaned_data['type'])
''' 
output looks like this below
     auto  home  personal
0        0     1         0
1        0     1         0
2        0     1         0
3        0     0         1
4        0     0         1
...    ...   ...       ...
1302     1     0         0
1303     1     0         0
1304     1     0         0
1305     0     0         1
1306     0     0         1

[1307 rows x 3 columns]
'''
```
- Normalizing/Scaling data (try [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))
- Grouping/Clustering/Binning of values ([KBinsDiscretizer](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html))

As with previous sections, this is by no means an exhaustive list, just some simple things to start with.

### <span style="color:red">Your Turn, Feature Engineering</span>

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

In [None]:
# Feature Engineering (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

import pandas as pd
import numpy as np

def get_features(data):
    '''
    function for creating and selecting features from your data
    '''
    
    X = data # select your features
    y = data['approved'] # your target variable is 'approved'
    return X, y

X, y = get_features(cleaned_data)
X.columns

<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>

---
## 4. Model Training <a class="anchor" id="model-train"></a>

<span style="color:green">**In this section, we ask you to train a classifier model which will decide whether a loan will be approved or not.**</span> Your goal is to get a **baseline level of performance** from one model before moving on to the [**model selection**](#model-select) section.

Using the features you selected in the previous section, you will be asked to train a model that will give you a baseline level of performance to improve upon in the following [**model selection**](#model-select) section. Recall that revisiting you may revisit prior sections as necessary. 

A small example using the [`DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) is included below. 

In [None]:
def ready_for_training_testing(X, y):
    '''
    This function does 3 basic checks to see if you are ready to train/test your model.
    '''
    if len(X) != len(y):
        print(f"Your feature set 'X' is not the same size as your labels 'y'. This will cause an error with sklearn.")
        print(f"X size: {len(X)} != y size: {len(y)}")
        return False
    if pd.isna(X).any().any():
        print(f"Your feature set 'X' has null values. This will cause an error with sklearn.")
        has_nulls = []
        for col, null in zip(pd.isna(X).any().index, pd.isna(X).any()):
            if null:
                has_nulls.append(col)
        print(f"These columns {has_nulls}, have null values.")
        return False
    if pd.isna(y).any():
        print(f"Your labels 'y' have null values. This will cause an error with sklearn.")
        return False
    return True

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

def example_get_features(data):
    X_num = data[["principal", "interest"]] # these columns are numeric

    # since loan type is a categorical variable, we need to encode it numerically
    # this creates three columns of 0/1 denoting the type of loan each row is
    X_cat = pd.get_dummies(data["type"], prefix="type")

    X = pd.concat([X_num, X_cat], axis=1) # this combines the categorical and numeric columns back into 1 dataframe
    y = data["approved"]
    return X, y
X, y = example_get_features(loans)

# this creates training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# dummy classifier guesses randomly at the labels
clf = DummyClassifier(strategy="uniform")

# this "fits" or "trains" the model using 
# X and y as training data
clf.fit(X_train,y_train) 

# these are the predictions the machine learning model made
preds = clf.predict(X_test)

# this outputs the accuracy of the model's predictions on the y_test set
clf.score(X_test, y_test)

This is a pretty weak baseline that you can likely improve upon. In the next **Your Turn** we will ask you to explore some of the ways you might improve the classification performance. 

### <span style="color:red">Your Turn, Model Training</span>

Use the data you cleaned up in [**the data cleaning section**](#data-clean) and the features you selected in [**the feature engineering section**](#feat-eng) to build your own machine learning model. 

**In the cell below, we demonstrate importing different models, we ask that you choose to experiment with just one model in this section so you can get a baseline measure of performance.**

Also recall every model from `sklearn` is trained and tested like so:
```python
# 0. initialize model
model = MachineLearningModel()
# 1. fit
model.fit(X_train, y_train)
# 2. predict
preds = model.predict(X_test) # preds is a 1D array of predicted labels (you can compare these with y_test)
# 3. score
model.score(X_test, y_test) # Return the mean accuracy on the given test data and labels.
```

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

#### Set these parameters for [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) first!

In [None]:
'''
important parameter description:
test_size:    float - Represents the proportion of the dataset to include in the test split
random_state:   int - Controls the shuffling applied to the data before applying the split. 
                      Pass an int for reproducible output across multiple function calls.
'''
test_size = 
random_state = 

**Recall**, your goal is to get a **baseline level of performance** from **one** model before moving on to the [**model selection**](#model-select) section.

In [None]:
# Model Training (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

from sklearn.model_selection import train_test_split

X, y = get_features(cleaned_data)

if not ready_for_training_testing(X, y):
    print("We identified an issue with your data above! You must fix this before continuing.")

    
# split your data into training and test sets
'''
important parameter description:
test_size:    float - Represents the proportion of the dataset to include in the test split
random_state:   int - Controls the shuffling applied to the data before applying the split. 
                      Pass an int for reproducible output across multiple function calls.
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# YOUR CODE HERE

# SELECT A MACHINE LEARNING MODEL FROM HERE BY UNCOMMENTING THE LINE
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.linear_model import SGDClassifier
# from sklearn.svm import SVC
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier



<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>

---
## 5. Model Selection <a class="anchor" id="model-select"></a>

**Your goal in this section is to evaluate the performance of different model architectures.**
Use the data and features you selected in the previous sections to train different types of models. You may use any scikit-learn model you see fit and any method of model evaluation/selection available to you. 

### <span style="color:red">Your Turn, Model Selection</span>

If you think it might be useful, you can also try different hyperparameters for different models or any other method of evaluating/improving a model available. If you don't know what a model parameter is, or what the model parameter means, don't need to worry about it as it is not required to use these hyperparamters.

To see more models that you can use, here is an [extensive list of different models](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) from the scikit-learn docs. If you are not familiar with these different models, don't worry. They all may be trained and tested using the same calls to ``fit``, ``score`` and ``predict``.

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

In [None]:
# Model Selection (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

# import more models here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X,y  = get_features(cleaned_data)

# split your data into training and test sets
'''
important parameter description:
test_size: represents the proportion of the dataset to include in the test split
random_state: Controls the shuffling applied to the data before applying the split. 
    Pass an int for reproducible output across multiple function calls.
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# put models here 
lr = LogisticRegression().fit(X_train, y_train)

print(lr.score(X_test, y_test),)

---
# End

Thank you for participating in the study. In [**model selection**](#model-select) you should have explored the performance of several different classifiers. Now is the time where you need to choose which one you believe to be the best. 

**To submit your model for evaluation, please assign your model to the variable SUBMITTED_MODEL in the cell below and then execute the cell.**

**(You may submit multiple times, however only your last submitted model will be considered)**

In [None]:
SUBMITTED_MODEL = # write the model's variable name here

# if the model you want to submit is named clf, then you would write 
# SUBMITTED_MODEL = clf

<div class="prompt-ml qualtrics-warning" style="background-color: red; display: inline-block; border-radius: 7px;">
    <h1 style="font-weight: 600; font-size: 18px; text-overflow: ellipsis; overflow: hidden; white-space: nowrap; text-align: left; color: white; padding: 15px 25px; margin: 0;"
>Please return to the Qualtrics Survey tab before continuing.</h1>
</div>