# Introduction

Thank you for agreeing to take part in this evaluation. During this evaluation you will be asked to carry out a series of tasks related to the dataset shown in the next section. Please carry out these tasks with the same care and rigor as if these tasks were part of your job duties.
    
Before beginning please ensure that you are using the prompter kernel by looking at the top right of this notebook. It should say "prompter", rather than "Python 3". If says you are using a python kernel, please click on where it says Python, and select the prompter from the drop down study. If you encounter any errors, or if it says "No Kernel!" please contact [retrograde-plugin@uchicago.edu](retrograde-plugin@uchicago.edu) so we can fix the issue.


## Task Description

This task is structured into five parts.

1. [**Data Exploration**](#data-expl)
2. [**Data Cleaning**](#data-clean)
3. [**Feature Creation and Feature Selection (Feature Engineering)**](#feat-eng)
4. [**Model Training**](#model-train)
5. [**Model Selection**](#model-select)

In each of these five sections, there is an explanation/example portion and a task portion.
The explanations are meant to provide background and structure to the task, and may be helpful to you when completing the tasks.
You will know that you have reached a task portion because we will mark them <span style="color:red"> **in red** </span>

*If at any point throughout working on the task you feel the need to revisit and revise your work on a particular section, feel free to do so.*

For some sections, there may be some code pre-written. This code is meant to help you complete the task by providing structure, but you are not required to use the provided code if you do not want to. You may refer to any documentation source or question asking/answering forum you like during this task (such as StackOverflow, or Pandas API documentation) we ask that you list the sources you used in the [**references**](#ref) section below. 

We ask that you use pandas and scikit-learn to perform the tasks. We have also installed numpy and matplotlib, should those be helpful. You will not be able to install any other non-standard libraries.

### References
*(double click this cell to edit)*
- ex: https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values
- 

## 1. Data Exploration <a class="anchor" id="data-expl"></a>

We will be asking you to use the provided "loan_data.csv" dataset during this experiment. This data was collected in a major metropolitan city in the United States. It contains information about applications for loans recieved by several different loan providers. 

**Your goal is to build a machine learning model capable of giving good recommendations for granting/approving loans.**

Let's start trying to understand the dataset by writing some python code. Feel free to follow along by running the following sample commands in this notebook.  

Below is a few lines of python code that loads the provided "loan_data.csv" dataset into a pandas dataframe.

In [None]:
import pandas as pd

# What each of these columns represents is explained below. This dictionary tells pandas what 
# data type each of the columns should be treated as.

column_types = {
    "race" : "category",
    "gender" : "category",
    "zip" : "category",
    "income" : float,
    "type" : "category",
    "term" : int,
    "interest" : float,
    "principal" : int,
    "approved" : bool,
    "adj_bls_2" : float,
    "id" : str,
}

loans = pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types)
loans.head()

We can get a list of the columns in this dataframe with the following command

Let's look at some of the columns in the dataframe. The column ``approved`` indicates whether or not the loan was approved

In [None]:
# since python treats True as a 1, and False as a 0, the sum
# of this array is the number of entries in loans where approved == True

print(f'{sum(loans["approved"])/len(loans["approved"])*100:.2f}% of loans were approved')

### <span style="color:red">Your Turn, Data Exploration</span>

Section 1 is just an introduction to the dataset you will be using. There's no specific task for you to do in this section, however you are responsible for understanding what the columns in this dataset are representing and developing an intuition for which of those may be helpful for your model.

In the next section ([**data cleaning**](#data-clean)), you will be asked to make certain decisions about how to clean this data and get it into a usable form.

**If you are feeling unsure of where to start try some of these** (each have a link to the official documentation):

*Assuming your dataframe is in the variable `df`*
- [`df.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas-dataframe-describe)
    - use the `include='all'` parameter to include non-numeric columns in this output
- [`df['column_name'].quantile(0.25)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html?highlight=quantile#pandas-dataframe-quantile)
- [`df['column_name'].unique()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html?highlight=unique#pandas-series-unique)
- [`df['column_name'].value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html?highlight=value_count#pandas-series-value-counts)

- Using pandas to understand different intersections of your data:
```python
    # boolean indexing with multiple columns 
    subset = (df['column_name'] == some_val) & (df['different_col'] == different_val)
    print(df[subset].head())
    print(df[subset].describe())
    # you have a subset of the original dataframe on 
    # ['column_name'] == some_val 
    # and ['different_col'] == different_val
```

Additionally, here are some brainstorming questions that you could try to explore in this section:
- *Do certain `types` of loans get approved more frequently?*
- *Is `income` data distributed in any notable way?*
- *Does `approval` rate change over time?*

In [None]:
# Data Exploration (feel free to create cells here as needed)
# Your code here



### <span style="color:orange">You should return to the Qualtrics and complete the questions for this section when you move forward.</span>

## 2. Data Cleaning <a class="anchor" id="data-clean"></a>

Before we can use this data to build a model, we'll need to clean it up a bit. Raw data often will have incomplete or inaccurate records present due to entry errors, inconsistent practices in data collection and many other reasons. The way this "dirtyness" will manifest is often **unknown without inspection** and is different for every dataset. 

It is a vital part of the data scientist role **to clean and standardize** the data used and so too will it be vital for you in this section.

Ideally throughout the data exploration section you identified some of these signs of "dirtyness", but if you have not yet identified any, your first step here should be to identify the manner(s) of dirtyness present in your data. Once you have done that it will be up to you to decide how you will **clean** the data. We recommend experimenting with several methods and deciding which method(s) best achieve your goals.

### <span style="color:red">Your Turn, Data Cleaning</span>

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

**Remember**, when completing the tasks try to treat them as if they are part of your job, and it is your responsibility to create an effective model for predicting loan acceptance. 

In [None]:
# Data Cleaning (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

import pandas as pd

def clean_data(loans_dataframe):
    '''function for cleaning '''
    
    return loans_dataframe # some clean dataframe
        
cleaned_data = clean_data(pd.read_csv("loan_data.csv", parse_dates=["date"], dtype=column_types))

### <span style="color:orange">You should return to the Qualtrics and complete the questions for this section when you move forward.</span>

## 3. Feature Creation and Selection (Feature Engineering) <a class="anchor" id="feat-eng"></a>

Feature Engineering is a process that involves you—a data scientist—using domain knowledge to extract notable features from the raw data. In this section you will be doing just that for the data from "loan_data.csv". Given that you are not expected to be an expert on loan decision, use your best judgement and focus on the columns you think will help your model the most, also recall that data science is a cyclical process and it is normal to return to feature engineering in order to add and remove features after training. 

If you are unsure of where to start try these common methods:
- Normalizing/Scaling data (try [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))
- Encoding categorical variables, one common way is one-hot encoding (try [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html#pandas-get-dummies) or [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder))
- Grouping/Clustering/Binning of values ([`KBinsDiscretizer`](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html))

As with previous sections, this is by no means an exhaustive list, just some simple things to start with.

### <span style="color:red">Your Turn, Feature Engineering</span>

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

In [None]:
# Feature Engineering (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

import pandas as pd
import numpy as np

def get_features(data):
    '''
    function for creating and selecting features from your data
    '''
    col_list = ['income', 'principal']
    fake_data = np.random.randint(5,30,size=len(clean_data)) # this is fake data, only used for example
    new_feature = pd.DataFrame(fake_data, columns=["new_feature"])
    
    features = data[col_list]
    
    return pd.concat([features, new_feature], axis=1)
        
X = get_features(cleaned_data)
y = cleaned_data['approved']
X.columns

### <span style="color:orange">You should return to the Qualtrics and complete the questions for this section when you move forward.</span>

## 4. Model Training <a class="anchor" id="model-train"></a>

In this section, we ask you to train a classifier which predicts whether a loan will be approved or not. The purpose of this classifier is to be used by loan officers to recommend to applicants specific loans they might be eligible for.

Using the features you selected in the previous section, you will be asked to train a model that will give you a baseline level of performance to improve upon in the following [**model selection**](#model-select) section. Recall that revisiting you may revisit prior sections as necessary. 

A small example using the [`DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) is included below. 

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

X_num = loans[["principal", "interest"]] # these columns are numeric

# since loan type is a categorical variable, we need to encode it numerically
# this creates three columns of 0/1 denoting the type of loan each row is
X_cat = pd.get_dummies(loans["type"], prefix="type")

example_X = pd.concat([X_num, X_cat], axis=1) # this combines the categorical and numeric columns back into 1 dataframe
example_y = loans["approved"]

# this creates training and testing sets 
ex_X_train, ex_X_test, ex_y_train, ex_y_test = train_test_split(example_X, example_y, test_size=0.2, random_state=10)

# dummy classifier guesses randomly at the labels
clf = DummyClassifier(strategy="uniform")

# this "fits" or "trains" the model using 
# example_X and example_y as training data
clf.fit(ex_X_train,ex_y_train) 

# these are the predictions for what y should be. 
# they may be helpful if you want to understand more about what a model is doing
# since they are given on a row by row basis.
preds = clf.predict(ex_X_test)

clf.score(ex_X_test, ex_y_test)

This is a pretty weak baseline that you can likely improve upon. In the next **Your Turn** we will ask you to explore some of the ways you might improve the classification performance. 

### <span style="color:red">Your Turn, Model Training</span>

Use the data you cleaned up in [**the data cleaning section**](#data-clean) to build your own Logistic Regression. In the regression we built in the example above, we just used a few columns. Try different combinations of columns to see if that changes the results you get. 

Use the features you selected in the [**feature engineering**](#feat-eng) section to train a model. Experiment with one model here to get a baseline measure of the model's performance with this data.

If you do not know which classifier to use, refer to this extensive [list](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) from the docs or expand the output from the cell below.

Also recall every model from `sklearn` is trained and tested like so:
```python
# 0. initialize model
model = MachineLearningModel()
# 1. fit
model.fit(training_data)
# 2. predict
predictions = model.predict(test_data)
# 3. score
model.score(test_data, test_labels)
```

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

In [1]:
from sklearn.utils import all_estimators

estimators = all_estimators()

for name, class_ in estimators:
    if hasattr(class_, 'predict_proba'):
        print(name)

AdaBoostClassifier
BaggingClassifier
BayesianGaussianMixture
BernoulliNB
CalibratedClassifierCV
CategoricalNB
ClassifierChain
ComplementNB
DecisionTreeClassifier
DummyClassifier
ExtraTreeClassifier
ExtraTreesClassifier
GaussianMixture
GaussianNB
GaussianProcessClassifier
GradientBoostingClassifier
GridSearchCV
HalvingGridSearchCV
HalvingRandomSearchCV
HistGradientBoostingClassifier
KNeighborsClassifier
LabelPropagation
LabelSpreading
LinearDiscriminantAnalysis
LogisticRegression
LogisticRegressionCV
MLPClassifier
MultiOutputClassifier
MultinomialNB
NuSVC
OneVsRestClassifier
Pipeline
QuadraticDiscriminantAnalysis
RFE
RFECV
RadiusNeighborsClassifier
RandomForestClassifier
RandomizedSearchCV
SGDClassifier
SVC
SelfTrainingClassifier
StackingClassifier
VotingClassifier


In [None]:
# Model Training (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

# import a machine learning model here

X = get_features(cleaned_data)
y = cleaned_data["approved"] # you are predicting the "approved" column


### <span style="color:orange">You should return to the Qualtrics and complete the questions for this section when you move forward.</span>

## 5. Model Selection <a class="anchor" id="model-select"></a>

You can also try changing the type of model that you're using. In the example above, you used one type of classifier. Different classifier models will give you different performance and different results. Additionally you may consider tuning some of the hyperparameters of the model to achieve the results you want.

If you are not familiar with these different models, don't worry. They all may be trained and tested using the same calls to ``fit``, ``score`` and ``predict``.

### <span style="color:red">Your Turn, Model Selection</span>

Use the data and features you selected in the previous sections to train different types of models. You may use any scikit-learn model you see fit and any method of model evaluation/selection available to you. 

If you think it might be useful, you can also try different hyperparameters for different models or any other method of evaluating/improving a model available. If you don't know what a model parameter is, or what the model parameter means, don't need to worry about it as it is not required to use these hyperparamters.

*If at any point throughout working on the task you feel the need to revisit and revise your work on a previous section, feel free to do so.*

In [None]:
# Model Selection (feel free to create cells here as needed)
# Below is a small code snippet to help you get started. 
# You may delete the snippet if you wish.

# import more models here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = get_features(cleaned_data)
y = cleaned_data["approved"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# put models here 
lr = LogisticRegression().fit(X_train, y_train)

print(lr.score(X_test, y_test))

### <span style="color:orange">You should return to the Qualtrics and complete the questions for this section when you move forward.</span>

# End

Thank you for participating in the study. In [**model selection**](#model-select) you should have explored the performance of several different classifiers. Now is the time where you need to choose which one you believe to be the best. To submit your model for consideration, please assign your model to the variable SUBMITTED_MODEL in the cell below and then execute the cell.

Remember that the model you submit will be used to recommend financial products to prospective loan applicants. 

You may submit multiple times, however only your last submitted model will be considered.

In [None]:
SUBMITTED_MODEL = # write the variable here

# if the model you decide on is named clf, then you would write 
# SUBMITTED_MODEL = clf

### <span style="color:orange">If you have ran the above cell you may now return to the Qualtrics and complete the survey.</span>