In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

# Lab 4: Putting it all together in a mini project

**This lab is an optional group lab.** You can choose to work alone of in a group of up to four students. You are in charge of how you want to work and who you want to work with. Maybe you really want to go through all the steps of the ML process yourself or maybe you want to practice your collaboration skills, it is up to you! Just remember to indicate who your group members are (if any) when you submit on Gradescope. If you choose to work in a group, you only need to use one of your GitHub repos.

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## Submission instructions
rubric={mechanics}

<p>You receive marks for submitting your lab correctly, please follow these instructions:</p>

<ul>
  <li><a href="https://ubc-mds.github.io/resources_pages/general_lab_instructions/">
      Follow the general lab instructions.</a></li>
  <li><a href="https://github.com/UBC-MDS/public/tree/master/rubric">
      Click here to view a description of the rubrics used to grade the questions</a></li>
  <li>Make at least three commits.</li>
  <li>Push your <code>.ipynb</code> file to your GitHub repository for this lab and upload it to Gradescope.</li>
    <ul>
      <li>Before submitting, make sure you restart the kernel and rerun all cells.</li>
    </ul>
  <li>Also upload a <code>.pdf</code> export of the notebook to facilitate grading of manual questions (preferably WebPDF, you can select two files when uploading to gradescope)</li>
  <li>Don't change any variable names that are given to you, don't move cells around, and don't include any code to install packages in the notebook.</li>
  <li>The data you download for this lab <b>SHOULD NOT BE PUSHED TO YOUR REPOSITORY</b> (there is also a <code>.gitignore</code> in the repo to prevent this).</li>
  <li>Include a clickable link to your GitHub repo for the lab just below this cell
    <ul>
      <li>It should look something like this https://github.ubc.ca/MDS-2020-21/DSCI_531_labX_yourcwl.</li>
    </ul>
  </li>
</ul>
</div>

_Points:_ 2

https://github.ubc.ca/MDS-2022-23/DSCI_573_lab4_wthass

<!-- END QUESTION -->

## Introduction <a name="in"></a>

In this lab you will be working on an open-ended mini-project, where you will put all the different things you have learned so far in 571 and 573 together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. Since this mini-project is open-ended there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you instead do a bunch of sane things and you have clearly motivated your choices, but still get lower model performance than your friend, don't sweat it.


#### A final note
Finally, the style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "several hours" but not "many hours" is a good guideline for a high quality submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and we hope you enjoy it as well. 

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
rubric={reasoning}

In this mini project, you will pick one of the following problems: 

1. A classification problem of predicting whether a credit card client will default or not. For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

OR 

2. A regression problem of predicting `reviews_per_month`, as a proxy for the popularity of the listing with [New York City Airbnb listings from 2019 dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 
3. Carry out any preliminary preprocessing, if needed (e.g., changing feature names, handling of NaN values etc.)
    
</div>

_Points:_ 3

The problem is to predict whether a client will default on their payment the next month or not. The dataset provides us with demographic and payment information about 30000 clients from Taiwan between April 2005 to September 2005 with no missing values. All the rows are numeric, but certain features seem to be categorical like ‘SEX’ and ‘MARRIAGE’ or ordinal like ‘Education’. The repayment status columns ‘PAY_0’ to ‘PAY_4’ seem like they may be the most useful in predicting whether a client will default the next month as a client already missing payments may be more likely to continue doing so. The “ID” column simply idenitifies the client and will not assist in prediction so it will be dropped and ‘default.payment.next.month’ will be renamed to target during preliminary preprocessing. All column names were also changed to lower case for ease of use later on.

In [118]:
# Import 
import sklearn # for tests
from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures
)
from sklearn.metrics import recall_score, precision_score
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.model_selection import (
    RandomizedSearchCV, cross_validate, train_test_split
)
from scipy.stats import loguniform
import pandas as pd
import numpy as np
from numpy.linalg import norm
import altair as alt
from pandas_profiling import ProfileReport
import eli5
import shap
import matplotlib
%matplotlib inline

In [3]:
data = pd.read_csv("data/UCI_Credit_Card.csv")
data_processed = data.drop("ID", axis=1)
data_processed = data_processed.rename(columns={"default.payment.next.month": "target"})
data_processed.columns = data_processed.columns.str.lower()
data_processed["education"] = data_processed['education'].replace([0, 5, 6], 4)

data_processed.head()

Unnamed: 0,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,target
0,20000.0,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26,-1,2,0,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 2. Data splitting <a name="2"></a>
rubric={reasoning}

**Your tasks:**

1. Split the data into train and test portions.

> Make the decision on the `test_size` based on the capacity of your laptop. 
    
</div>

_Points:_ 1

In [116]:
train_df, test_df = train_test_split(data_processed, test_size=0.4)
X_train, y_train = train_df.drop("target", axis=1), train_df["target"]
X_test, y_test = test_df.drop("target", axis=1), test_df["target"]

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 3. EDA <a name="3"></a>
rubric={viz,reasoning}
    
Perform exploratory data analysis on the train set.

**Your tasks:**

1. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
2. Summarize your initial observations about the data. 
3. Pick appropriate metric/metrics for assessment. 
    
</div>

_Points:_ 6

1 & 2. (see plots from Pandas Profiler below for visuals). 
Our task is to predict whether a client will default on their payment next month or not (“target” == 0 is a predicted no, “target” == 1 is a predicted yes). The dataset provides us with demographic and payment information about 30,000 clients from Taiwan between April 2005 to September 2005. 

1. We can see that the majority of clients are clustered around similar values in each column (for example, the `BILL_AMT*` columns are all heavily right-skewed). In terms of the target’s classes, there is a strong class imbalance, as the class of client predicted to not default (target == 0) having only approximately 22% of target classes predicted to default (target == 1), which may need to be dealt with at some point. 

2. Using Pandas Profiler, we can see that there are no missing values, all columns are numeric, but certain features seem to be categorical, like ‘SEX’ and ‘MARRIAGE’ or ordinal like ‘Education’. The repayment status columns ‘PAY_0’, and ‘PAY_2’ to ‘PAY_4’ seem like they may be the most useful in predicting whether a client will default the next month, as, intuitively, we can say that if a client is already missing payments, their financial situation is unlikely to change in such a short time and therefore they may be more likely to continue doing so. The “ID” column simply identifies the client and will not assist in prediction, so it will be dropped, and ‘default.payment.next.month’ will be renamed to “target” during preliminary preprocessing. 

3. An appropriate metric to choose for our classification is recall. This is important for our problem since we want to predict whether or not our customers are going to default. Thus, it is detrimental to our company if our model predicts someone is not going to default, but then does (as we now have to pay their bills), i.e. having a high number of false negatives. We want to minimize the number of false positives, in other words, maximize our recall. This is more important than accurately predicting both true negatives and true negatives as well as the precision, where we predict someone is going to default but they don't. If they don't, then this is good for us since we don't have to pay their bills.

In [15]:
train_df.describe()

Unnamed: 0,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,target
count,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,...,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0
mean,167854.093333,1.603,1.8395,1.556444,35.385667,-0.021611,-0.140667,-0.172,-0.2175,-0.264944,...,43029.184333,40112.907556,38551.271944,5658.218944,6037.105,5218.411278,4837.536278,4845.267833,5254.729556,0.218722
std,129733.791829,0.48929,0.745426,0.521692,9.162583,1.135143,1.202019,1.197936,1.170158,1.135849,...,63701.280202,60238.176491,58655.020543,15695.228662,24403.28,15869.11163,15427.455419,15398.159304,18178.615856,0.413391
min,10000.0,1.0,1.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-81334.0,-61372.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2354.0,1722.75,1265.5,979.75,880.0,391.75,280.0,281.5,100.0,0.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,19102.0,18160.5,17179.0,2108.5,2011.0,1804.0,1500.0,1510.0,1500.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,54332.25,50237.25,49054.75,5003.0,5000.0,4500.0,4000.0,4119.25,4000.0,0.0
max,800000.0,2.0,4.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,...,706864.0,823540.0,699944.0,505000.0,1684259.0,508229.0,528897.0,426529.0,528666.0,1.0


In [6]:
train_df["target"].value_counts(normalize=True)

0    0.781278
1    0.218722
Name: target, dtype: float64

In [7]:
corr_matrx = train_df.corr('spearman').style.background_gradient()
corr_matrx

Unnamed: 0,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,target
limit_bal,1.0,0.059125,-0.262535,-0.107608,0.183254,-0.302271,-0.353007,-0.339854,-0.310888,-0.286625,-0.265661,0.046355,0.042619,0.058468,0.07206,0.080562,0.09017,0.272898,0.283265,0.279774,0.281165,0.30061,0.316776,-0.171089
sex,0.059125,1.0,0.017911,-0.031691,-0.094915,-0.056713,-0.072104,-0.070464,-0.061901,-0.054427,-0.045033,-0.041748,-0.042684,-0.030402,-0.026779,-0.017679,-0.014126,-0.004306,0.009935,0.015126,0.008106,0.012519,0.028027,-0.038183
education,-0.262535,0.017911,1.0,-0.155665,0.151755,0.126954,0.167649,0.160024,0.151579,0.133907,0.120005,0.093232,0.09122,0.080576,0.069291,0.05567,0.050869,-0.038347,-0.042503,-0.035303,-0.041882,-0.055484,-0.054027,0.047817
marriage,-0.107608,-0.031691,-0.155665,1.0,-0.46754,0.029904,0.040095,0.045655,0.050381,0.055019,0.047182,0.008516,0.00997,0.00652,0.010902,0.008727,0.006132,-0.007177,-0.011544,-0.013173,-0.015358,-0.014276,-0.013561,-0.020312
age,0.183254,-0.094915,0.151755,-0.46754,1.0,-0.067862,-0.088109,-0.088417,-0.085936,-0.091607,-0.083778,-0.007627,-0.00572,-0.004832,-0.009336,-0.007267,-0.006023,0.037897,0.043635,0.031497,0.039743,0.035646,0.036764,0.008158
pay_0,-0.302271,-0.056713,0.126954,0.029904,-0.067862,1.0,0.632394,0.553705,0.519837,0.488419,0.47473,0.319852,0.334366,0.317411,0.307567,0.301405,0.293518,-0.098906,-0.067628,-0.055509,-0.035605,-0.031071,-0.043585,0.287631
pay_2,-0.353007,-0.072104,0.167649,0.040095,-0.088109,0.632394,1.0,0.800274,0.714781,0.676008,0.639782,0.57585,0.553901,0.523305,0.500465,0.481273,0.463516,0.020637,0.081687,0.089825,0.096115,0.095759,0.083016,0.206025
pay_3,-0.339854,-0.070464,0.160024,0.045655,-0.088417,0.553705,0.800274,1.0,0.804886,0.721108,0.679549,0.529901,0.58996,0.560468,0.533217,0.508621,0.486891,0.214471,0.03406,0.104768,0.115703,0.117378,0.097207,0.185653
pay_4,-0.310888,-0.061901,0.151579,0.050381,-0.085936,0.519837,0.714781,0.804886,1.0,0.818532,0.733615,0.516527,0.558882,0.61906,0.591201,0.561654,0.53356,0.184424,0.241307,0.070706,0.141397,0.153113,0.145053,0.165471
pay_5,-0.286625,-0.054427,0.133907,0.055019,-0.091607,0.488419,0.676008,0.721108,0.818532,1.0,0.823422,0.501105,0.536847,0.584458,0.646661,0.617448,0.577247,0.173378,0.216297,0.263239,0.102616,0.179597,0.176396,0.150523


In [8]:
profile = ProfileReport(train_df, title="Pandas Profiling Report", minimal=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 4. Feature engineering (Challenging)
rubric={reasoning}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing.
    
</div>

_Points:_ 0.5

In [9]:
...

Ellipsis

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 5. Preprocessing and transformations <a name="5"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 
    
</div>

_Points:_ 4

In [18]:
categorical_features = ["marriage"] # encoded ordinally, but actually categorical
binary_features = ["sex"] # encoded with 1,2 - maybe a good idea to switch to 0,1?
passthrough_features = ["pay_0", "pay_2", "pay_3", "pay_4", "pay_5", "pay_6", "education"] # ordinal features already encoded
numeric_features = [
    "limit_bal",
    "age",
    "bill_amt1",
    "bill_amt2",
    "bill_amt3",
    "bill_amt4",
    "bill_amt5",
    "bill_amt6",
    "pay_amt1",
    "pay_amt2",
    "pay_amt3",
    "pay_amt4",
    "pay_amt5",
    "pay_amt6",
]
preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_features),
    (OneHotEncoder(drop='if_binary'), binary_features),
    (StandardScaler(), numeric_features),
    ("passthrough", passthrough_features)
)

In [19]:
train_df.query("target == 1")

Unnamed: 0,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,target
21057,160000.0,1,2,2,38,4,3,2,2,3,...,103928.0,101540.0,99587.0,0.0,5500.0,6700.0,0.0,27.0,2800.0,1
2144,20000.0,2,2,1,27,3,2,0,0,2,...,6490.0,6837.0,6435.0,0.0,1095.0,900.0,596.0,0.0,0.0,1
28287,70000.0,2,2,2,32,2,0,0,0,0,...,36908.0,29439.0,19494.0,3007.0,1794.0,2000.0,967.0,1000.0,870.0,1
27569,340000.0,1,2,1,53,0,0,0,0,0,...,304706.0,250216.0,253526.0,12604.0,14000.0,11000.0,9000.0,10000.0,30000.0,1
18071,60000.0,1,2,2,36,2,2,2,2,2,...,50737.0,52602.0,53613.0,2000.0,1500.0,2000.0,3000.0,2000.0,2000.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14434,10000.0,1,2,2,32,1,2,0,0,0,...,8630.0,7850.0,8150.0,0.0,1400.0,0.0,0.0,1600.0,0.0,1
12216,280000.0,2,2,1,39,2,3,2,2,2,...,189806.0,201410.0,205479.0,8000.0,8000.0,0.0,14500.0,7300.0,7500.0,1
23073,360000.0,2,3,2,45,-2,-2,-2,-2,-2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
6291,190000.0,1,2,1,51,0,0,0,0,0,...,34904.0,108419.0,75455.0,3359.0,13487.0,1414.0,40710.0,43406.0,2773.0,1


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 6. Baseline model <a name="6"></a>
rubric={accuracy}

**Your tasks:**
1. Train a baseline model for your task and report its performance.
    
</div>

_Points:_ 2

In [34]:
# Attributed to Varada, DSCI 571
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [36]:
cross_val_results = {}
classification_metrics = ["accuracy", "precision", "recall", "f1"]
dc = DummyClassifier()
cross_val_results["Dummy"] = mean_std_cross_val_scores(
    dc, X_train, y_train, return_train_score=True, scoring=classification_metrics, n_jobs=-1
)
pd.DataFrame(cross_val_results).T

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
Dummy,0.005 (+/- 0.000),0.007 (+/- 0.001),0.781 (+/- 0.000),0.781 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000)


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 7. Linear models <a name="7"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the regularization hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.
    
</div>

_Points:_ 8

Since this is a classification problem, our first linear model we attempt is Logistic Regression. In order to tune our regularisation hyperparameter, we perform a random search to optimise the recall metric. After this, we perform cross-validation to see how we scored on our classification metric.

As we can see from the results, we initially get a pretty high accuracy, however our f1 and recall scores are very low, even after our model determined that setting `class_weight="balanced"` resulted in a better recall. This means that we have a high number of false negatives, which is exactly what we want to prevent. We will consider different hyperparameter values and alternate models to see if we can improve our recall score.

Aside from this, it's worth mentioning that we end up with low standard deviations for all of our classification metrics, which indicates that our model is performing well across the cross validation folds, and not just 'getting lucky'.

In [111]:
lr_param = {
    'logisticregression__C': loguniform(1e-3, 1e3),
    'logisticregression__class_weight': [None, "balanced"]
}

pipe_lr = make_pipeline(preprocessor, LogisticRegression(random_state=123, max_iter=1000))
cross_val_results["Logistic Regression"] = mean_std_cross_val_scores(
    pipe_lr, X_train, y_train, return_train_score=True,
    scoring=classification_metrics, n_jobs=-1
)

random_search_lr = RandomizedSearchCV(
    pipe_lr, lr_param, n_iter=20, n_jobs=-1, scoring='recall', random_state=123
)

cross_val_results["Tuned Logistic Regression"] = mean_std_cross_val_scores(
    random_search_lr, X_train, y_train, return_train_score=True,
    scoring=classification_metrics, n_jobs=-1
)
pd.DataFrame(cross_val_results).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
Dummy,0.005 (+/- 0.000),0.007 (+/- 0.001),0.781 (+/- 0.000),0.781 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000)
Logistic Regression,0.413 (+/- 0.019),0.018 (+/- 0.004),0.810 (+/- 0.003),0.811 (+/- 0.001),0.714 (+/- 0.024),0.716 (+/- 0.005),0.223 (+/- 0.010),0.224 (+/- 0.007),0.339 (+/- 0.012),0.341 (+/- 0.008)
SVC,15.829 (+/- 0.560),2.579 (+/- 0.073),0.822 (+/- 0.006),0.827 (+/- 0.001),0.689 (+/- 0.032),0.710 (+/- 0.005),0.337 (+/- 0.015),0.352 (+/- 0.005),0.453 (+/- 0.018),0.471 (+/- 0.004)
Random Forest,4.270 (+/- 0.595),0.081 (+/- 0.007),0.815 (+/- 0.006),1.000 (+/- 0.000),0.640 (+/- 0.026),0.999 (+/- 0.000),0.356 (+/- 0.019),0.998 (+/- 0.001),0.458 (+/- 0.019),0.999 (+/- 0.000)
Stacking Model,29.997 (+/- 3.518),0.051 (+/- 0.002),0.821 (+/- 0.006),0.856 (+/- 0.001),0.676 (+/- 0.036),0.825 (+/- 0.008),0.346 (+/- 0.008),0.436 (+/- 0.008),0.458 (+/- 0.012),0.571 (+/- 0.006)
RFE SVC,22.107 (+/- 2.830),1.383 (+/- 0.286),0.821 (+/- 0.004),0.821 (+/- 0.001),0.691 (+/- 0.027),0.691 (+/- 0.004),0.327 (+/- 0.005),0.327 (+/- 0.002),0.444 (+/- 0.009),0.444 (+/- 0.002)
RFE Random Forest,17.710 (+/- 1.254),0.047 (+/- 0.019),0.818 (+/- 0.004),0.853 (+/- 0.071),0.670 (+/- 0.030),0.752 (+/- 0.135),0.332 (+/- 0.014),0.446 (+/- 0.263),0.444 (+/- 0.008),0.547 (+/- 0.228)
RFE Stacking Model,243.981 (+/- 6.527),0.045 (+/- 0.014),0.821 (+/- 0.005),0.823 (+/- 0.005),0.692 (+/- 0.027),0.703 (+/- 0.027),0.330 (+/- 0.009),0.332 (+/- 0.010),0.447 (+/- 0.013),0.451 (+/- 0.014)
Tuned SVC,22.824 (+/- 0.946),3.374 (+/- 0.695),0.772 (+/- 0.010),0.780 (+/- 0.003),0.482 (+/- 0.020),0.498 (+/- 0.005),0.581 (+/- 0.016),0.605 (+/- 0.007),0.527 (+/- 0.018),0.546 (+/- 0.005)
Tuned Random Forest,22.379 (+/- 1.276),0.105 (+/- 0.018),0.813 (+/- 0.006),0.999 (+/- 0.000),0.627 (+/- 0.028),0.999 (+/- 0.000),0.359 (+/- 0.016),0.999 (+/- 0.001),0.456 (+/- 0.015),0.999 (+/- 0.000)


In [112]:
random_search_lr.fit(X_train, y_train)

In [113]:
lg_C = random_search_lr.best_params_["logisticregression__C"]
print("Logistic Regression C:", lg_C)
print("Logistic Regression Alpha:", 1/lg_C)
print("Class Weight:", random_search_lr.best_params_["logisticregression__class_weight"])

Logistic Regression C: 0.8845321047965241
Logistic Regression Alpha: 1.1305412144763676
Class Weight: balanced


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 8. Different models <a name="8"></a>
rubric={accuracy,reasoning}

**Your tasks:**
1. Try out three other models aside from the linear model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat the performance of the linear model? 
    
</div>

_Points:_ 10

The three other models that were added were SVM classifier, random forest classifier and a stacked model containing Logistic Regression, LightGBM and XGBoost classifiers. In terms of fit times, the stacked model had the longest fit time with SVC and random forest following. Meanwhile, SVC had the longest score time with the stacked model and random forest following suite. For this problem, recall is the most important metric since we want to reduce false negative rates. Random forest classifier was severely overfit with a train recall of 0.999 and a test recall of 0.369, (this could be due to the max_depth not being set). The stacked model was slightly overfit with a train recall of 0.443 and a test recall of 0.354. Lastly, SVC reported a train recall of 0.355 and a test recall of 0.341. In terms of these three models and their performance compared to the linear model, both SVC and the stacked model outperformed logistic regression in terms of recall score with logistic regression obtaining a 0.235 and 0.234 for the train and test recall.

In [46]:
pipe_svc = make_pipeline(preprocessor, SVC(random_state=123))
pipe_rf = make_pipeline(preprocessor, RandomForestClassifier(random_state=123))
pipe_lgbm = make_pipeline(preprocessor, LGBMClassifier(random_state=123))
pipe_xgb = make_pipeline(preprocessor, XGBClassifier(random_state=123))

classifiers = {
    "Logistic Regression": pipe_lr,
    "LightGBM": pipe_lgbm,
    "XGBoost": pipe_xgb
}

models = {
    "SVC": pipe_svc,
    "Random Forest": pipe_rf,
    "Stacking Model": StackingClassifier(list(classifiers.items()))
}

for model_name, model in models.items():
    cross_val_results[model_name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True,
        scoring=classification_metrics, n_jobs=-1
        )


In [48]:
pd.DataFrame(cross_val_results).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
Dummy,0.005 (+/- 0.000),0.007 (+/- 0.001),0.781 (+/- 0.000),0.781 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000)
Logistic Regression,2.545 (+/- 0.373),0.014 (+/- 0.002),0.810 (+/- 0.002),0.811 (+/- 0.001),0.715 (+/- 0.024),0.717 (+/- 0.004),0.222 (+/- 0.011),0.223 (+/- 0.006),0.339 (+/- 0.012),0.340 (+/- 0.008)
SVC,15.829 (+/- 0.560),2.579 (+/- 0.073),0.822 (+/- 0.006),0.827 (+/- 0.001),0.689 (+/- 0.032),0.710 (+/- 0.005),0.337 (+/- 0.015),0.352 (+/- 0.005),0.453 (+/- 0.018),0.471 (+/- 0.004)
Random Forest,4.270 (+/- 0.595),0.081 (+/- 0.007),0.815 (+/- 0.006),1.000 (+/- 0.000),0.640 (+/- 0.026),0.999 (+/- 0.000),0.356 (+/- 0.019),0.998 (+/- 0.001),0.458 (+/- 0.019),0.999 (+/- 0.000)
Stacking Model,29.997 (+/- 3.518),0.051 (+/- 0.002),0.821 (+/- 0.006),0.856 (+/- 0.001),0.676 (+/- 0.036),0.825 (+/- 0.008),0.346 (+/- 0.008),0.436 (+/- 0.008),0.458 (+/- 0.012),0.571 (+/- 0.006)


In [49]:
pipe_rf.fit(X_train, y_train)
pipe_rf.named_steps['randomforestclassifier'].n_features_

{'columntransformer': ColumnTransformer(transformers=[('onehotencoder-1', OneHotEncoder(),
                                  ['marriage']),
                                 ('onehotencoder-2',
                                  OneHotEncoder(drop='if_binary'), ['sex']),
                                 ('standardscaler', StandardScaler(),
                                  ['limit_bal', 'age', 'bill_amt1', 'bill_amt2',
                                   'bill_amt3', 'bill_amt4', 'bill_amt5',
                                   'bill_amt6', 'pay_amt1', 'pay_amt2',
                                   'pay_amt3', 'pay_amt4', 'pay_amt5',
                                   'pay_amt6']),
                                 ('passthrough', 'passthrough',
                                  ['pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5',
                                   'pay_6', 'education'])]),
 'randomforestclassifier': RandomForestClassifier(random_state=123)}

In [51]:
X_train.columns.size

26

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 9. Feature selection (Challenging)
rubric={reasoning}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV`, forward selection or L1 regularization for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises unless you think there are other benefits with using less features.
    
</div>

_Points:_ 0.5

The feature selection we performed did not improve the CV scores for recall. We attempted `RFECV` with logistic regression and the piped the results to all of our models. Only two features were selected by `RFECV`: ____ and ____.
We debated using the model with only two features. All those models had smaller differences between cross-validation scores and training scores, leading us to think that they were overfitting less. Ultimately, though, we chose the larger model with the higher scores.

In [58]:
rfe = RFECV(LogisticRegression(max_iter=1000))

pipe_rfe_lr = make_pipeline(preprocessor, rfe, LogisticRegression(random_state=123, max_iter=1000))
pipe_rfe_svc = make_pipeline(preprocessor, rfe, SVC(random_state=123))
pipe_rfe_rf = make_pipeline(preprocessor, rfe, RandomForestClassifier(random_state=123))
pipe_rfe_lgbm = make_pipeline(preprocessor, rfe, LGBMClassifier(random_state=123))
pipe_rfe_xgb = make_pipeline(preprocessor, rfe, XGBClassifier(random_state=123))

classifiers_rfe = {
    "Logistic Regression": pipe_rfe_lr,
    "LightGBM": pipe_rfe_lgbm,
    "XGBoost": pipe_rfe_xgb
}

models_rfe = {
    "RFE SVC": pipe_rfe_svc,
    "RFE Random Forest": pipe_rfe_rf,
    "RFE Stacking Model": StackingClassifier(list(classifiers_rfe.items()))
}

for model_name, model in models_rfe.items():
    cross_val_results[model_name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True,
        scoring=classification_metrics, n_jobs=-1
        )

In [70]:
pd.DataFrame(cross_val_results).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
Dummy,0.005 (+/- 0.000),0.007 (+/- 0.001),0.781 (+/- 0.000),0.781 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000)
Logistic Regression,2.545 (+/- 0.373),0.014 (+/- 0.002),0.810 (+/- 0.002),0.811 (+/- 0.001),0.715 (+/- 0.024),0.717 (+/- 0.004),0.222 (+/- 0.011),0.223 (+/- 0.006),0.339 (+/- 0.012),0.340 (+/- 0.008)
SVC,15.829 (+/- 0.560),2.579 (+/- 0.073),0.822 (+/- 0.006),0.827 (+/- 0.001),0.689 (+/- 0.032),0.710 (+/- 0.005),0.337 (+/- 0.015),0.352 (+/- 0.005),0.453 (+/- 0.018),0.471 (+/- 0.004)
Random Forest,4.270 (+/- 0.595),0.081 (+/- 0.007),0.815 (+/- 0.006),1.000 (+/- 0.000),0.640 (+/- 0.026),0.999 (+/- 0.000),0.356 (+/- 0.019),0.998 (+/- 0.001),0.458 (+/- 0.019),0.999 (+/- 0.000)
Stacking Model,29.997 (+/- 3.518),0.051 (+/- 0.002),0.821 (+/- 0.006),0.856 (+/- 0.001),0.676 (+/- 0.036),0.825 (+/- 0.008),0.346 (+/- 0.008),0.436 (+/- 0.008),0.458 (+/- 0.012),0.571 (+/- 0.006)
RFE SVC,22.107 (+/- 2.830),1.383 (+/- 0.286),0.821 (+/- 0.004),0.821 (+/- 0.001),0.691 (+/- 0.027),0.691 (+/- 0.004),0.327 (+/- 0.005),0.327 (+/- 0.002),0.444 (+/- 0.009),0.444 (+/- 0.002)
RFE Random Forest,17.710 (+/- 1.254),0.047 (+/- 0.019),0.818 (+/- 0.004),0.853 (+/- 0.071),0.670 (+/- 0.030),0.752 (+/- 0.135),0.332 (+/- 0.014),0.446 (+/- 0.263),0.444 (+/- 0.008),0.547 (+/- 0.228)
RFE Stacking Model,243.981 (+/- 6.527),0.045 (+/- 0.014),0.821 (+/- 0.005),0.823 (+/- 0.005),0.692 (+/- 0.027),0.703 (+/- 0.027),0.330 (+/- 0.009),0.332 (+/- 0.010),0.447 (+/- 0.013),0.451 (+/- 0.014)


In [69]:
pipe_rfe_svc.fit(X_train, y_train)

In [72]:
pipe_rfe_lr.fit(X_train, y_train)
print(pipe_rfe_lr[:-1].get_feature_names_out())
print(pipe_rfe_lr[:-1].get_feature_names_out().size)

# Got 2 same features for each model using RFE 

----------
['standardscaler__pay_amt2' 'passthrough__pay_0']
2
----------
['standardscaler__pay_amt2' 'passthrough__pay_0']
2
----------
['standardscaler__pay_amt2' 'passthrough__pay_0']
2
----------
['standardscaler__pay_amt2' 'passthrough__pay_0']
2
----------
['standardscaler__pay_amt2' 'passthrough__pay_0']
2
----------


KeyboardInterrupt: 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 10. Hyperparameter optimization
rubric={accuracy,reasoning}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 
    
</div>

_Points:_ 6

_Type your answer here, replacing this text._

In [96]:
params = [
    {
        "svc__class_weight": [None, "balanced"],
        "svc__gamma": loguniform(1e-3, 1e3),
        "svc__C": loguniform(1e-3, 1e3)
    },
    {
        "logisticregression__class_weight": [None, "balanced"],
        "logisticregression__C": loguniform(1e-3, 1e3),
    },
    {
        "xgbclassifier__gamma": loguniform(1e-3, 1e3)
    },
    {
        "lgbmclassifier__class_weight": [None, "balanced"],
        "lgbmclassifier__max_depth": np.arange(10, 100, 10)
    },
    {
        "randomforestclassifier__max_features": ["sqrt", "log2", None],
        "randomforestclassifier__max_depth": np.arange(10, 100)
    }
]

classifiers_tuning = {
    "SVC": pipe_svc,
    "Logistic Regression": pipe_lr,
    "XGBoost": pipe_xgb,
    "LightGBM": pipe_lgbm,
    "Random Forest": pipe_rf
}

optim_models = {}

for i, model_name in enumerate(classifiers_tuning):
    print(model_name)
    param_grid = params[i]
    model = classifiers_tuning[model_name]
    random_search = RandomizedSearchCV(
        model, param_grid, n_iter=10, n_jobs=-1, random_state=123,
        scoring="recall", return_train_score=True
    )
    random_search.fit(X_train, y_train)
    optim_models[model_name] = random_search.best_estimator_
    print(random_search.best_params_)

SVC
{'svc__C': 0.11456925707187304, 'svc__class_weight': 'balanced', 'svc__gamma': 0.08808568992665847}
Logistic Regression
{'logisticregression__C': 0.8845321047965241, 'logisticregression__class_weight': 'balanced'}
XGBoost
{'xgbclassifier__gamma': 0.022967235384741526}
LightGBM
{'lgbmclassifier__max_depth': 30, 'lgbmclassifier__class_weight': 'balanced'}
Random Forest
{'randomforestclassifier__max_features': None, 'randomforestclassifier__max_depth': 89}


In [98]:
tuned_classifiers = {
    "Logistic Regression": optim_models["Logistic Regression"],
    "LightGBM": optim_models["LightGBM"],
    "XGBoost": optim_models["XGBoost"]
}

tuned_models = {
    "Tuned SVC": optim_models["SVC"],
    "Tuned Random Forest": optim_models["Random Forest"],
    "Tuned Stacking Model": StackingClassifier(list(tuned_classifiers.items()))
}

for model_name, model in tuned_models.items():
    cross_val_results[model_name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True,
        scoring=classification_metrics, n_jobs=-1
        )

In [110]:
pd.DataFrame(cross_val_results).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
Dummy,0.005 (+/- 0.000),0.007 (+/- 0.001),0.781 (+/- 0.000),0.781 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000),0.000 (+/- 0.000)
Logistic Regression,0.439 (+/- 0.057),0.019 (+/- 0.004),0.810 (+/- 0.003),0.811 (+/- 0.001),0.714 (+/- 0.024),0.716 (+/- 0.005),0.223 (+/- 0.010),0.224 (+/- 0.007),0.339 (+/- 0.012),0.341 (+/- 0.008)
SVC,15.829 (+/- 0.560),2.579 (+/- 0.073),0.822 (+/- 0.006),0.827 (+/- 0.001),0.689 (+/- 0.032),0.710 (+/- 0.005),0.337 (+/- 0.015),0.352 (+/- 0.005),0.453 (+/- 0.018),0.471 (+/- 0.004)
Random Forest,4.270 (+/- 0.595),0.081 (+/- 0.007),0.815 (+/- 0.006),1.000 (+/- 0.000),0.640 (+/- 0.026),0.999 (+/- 0.000),0.356 (+/- 0.019),0.998 (+/- 0.001),0.458 (+/- 0.019),0.999 (+/- 0.000)
Stacking Model,29.997 (+/- 3.518),0.051 (+/- 0.002),0.821 (+/- 0.006),0.856 (+/- 0.001),0.676 (+/- 0.036),0.825 (+/- 0.008),0.346 (+/- 0.008),0.436 (+/- 0.008),0.458 (+/- 0.012),0.571 (+/- 0.006)
RFE SVC,22.107 (+/- 2.830),1.383 (+/- 0.286),0.821 (+/- 0.004),0.821 (+/- 0.001),0.691 (+/- 0.027),0.691 (+/- 0.004),0.327 (+/- 0.005),0.327 (+/- 0.002),0.444 (+/- 0.009),0.444 (+/- 0.002)
RFE Random Forest,17.710 (+/- 1.254),0.047 (+/- 0.019),0.818 (+/- 0.004),0.853 (+/- 0.071),0.670 (+/- 0.030),0.752 (+/- 0.135),0.332 (+/- 0.014),0.446 (+/- 0.263),0.444 (+/- 0.008),0.547 (+/- 0.228)
RFE Stacking Model,243.981 (+/- 6.527),0.045 (+/- 0.014),0.821 (+/- 0.005),0.823 (+/- 0.005),0.692 (+/- 0.027),0.703 (+/- 0.027),0.330 (+/- 0.009),0.332 (+/- 0.010),0.447 (+/- 0.013),0.451 (+/- 0.014)
Tuned SVC,22.824 (+/- 0.946),3.374 (+/- 0.695),0.772 (+/- 0.010),0.780 (+/- 0.003),0.482 (+/- 0.020),0.498 (+/- 0.005),0.581 (+/- 0.016),0.605 (+/- 0.007),0.527 (+/- 0.018),0.546 (+/- 0.005)
Tuned Random Forest,22.379 (+/- 1.276),0.105 (+/- 0.018),0.813 (+/- 0.006),0.999 (+/- 0.000),0.627 (+/- 0.028),0.999 (+/- 0.000),0.359 (+/- 0.016),0.999 (+/- 0.001),0.456 (+/- 0.015),0.999 (+/- 0.000)


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 11. Interpretation and feature importances <a name="1"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 
    
</div>

_Points:_ 8

For this section, we inspect our `RFC` (Random Forest Classifier) model. We extract the feature importances using the `eli5` method (explain like I'm five), which gives us a table of our features, sorted by importance.

We can see that `pay_0` is by far the most important feature when it comes to classifying whether someone will default on their next month's credit card bill. This inherently makes sense: if someone defaulted on their previous bill, it often indicates they're in financial rough waters which are often not resolved within a month, leading to them missing the next month's  bill too.

The next couple of features aren't as strong as `pay_0` but can easily be explained:
- `age` plays a major role in someones financial status. As you become older, you often get the opportunity to set money aside and build up savings. If you encounter a bad financial month, you can rely on your savings to cover your credit card bill.
- `bill_amt1` is the amount of the bill in September. Logically a higher bill will increase the probability of someone defaulting on their payment.
- `limit_bal` is the amount of given credit. If you have a lower amount of given credit, you will have a higher probability of defaulting on your payment.

Lastly, it appears as though `education` has little effect on classifying whether someone will default on their payment. Of course this is not an inference on the population, but limited to the scope of this dataset.

In [109]:
explan = eli5.explain_weights(
    optim_models['Random Forest'].named_steps['randomforestclassifier'], feature_names=pipe_lr[:-1].get_feature_names_out().tolist(),

)
eli5.format_as_dataframe(explan)

Unnamed: 0,feature,weight,std
0,passthrough__pay_0,0.164015,0.01031
1,standardscaler__age,0.076076,0.006082
2,standardscaler__bill_amt1,0.067208,0.007482
3,standardscaler__limit_bal,0.061873,0.005026
4,standardscaler__pay_amt3,0.053391,0.006971
5,standardscaler__pay_amt2,0.051012,0.006712
6,standardscaler__pay_amt6,0.049022,0.005599
7,standardscaler__pay_amt1,0.048685,0.005492
8,standardscaler__bill_amt6,0.045723,0.004961
9,standardscaler__pay_amt5,0.045661,0.005319


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 12. Results on the test set <a name="12"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain them with SHAP force plots.  
    
</div>

_Points:_ 6

- The best performing model is the tuned logistic regression. We will use this model on the test data set and conduct predictions on two examples.
- From the SHAP force plot for the two prediction examples, we can see the output value to the `no default` prediction is highly negative with the feature `passthrough_pay_0` having the highest negative weight (most important feature for this prediction). This means that the model is predicting a `no default` with a high probability. In the `default` prediction, the output value is highly positive with the feature `passthrough_pay_0` with the highest positive weight (most important feature for this prediction). This means that the model is predicting a `default` with a high probability. We can also see these outcomes with the `predict_proba` function below the SHAP force plot.

In [120]:
print("Recall:", recall_score(y_test, random_search_lr.predict(X_test)))
print("Precision:", precision_score(y_test, random_search_lr.predict(X_test)))

Recall: 0.6580135440180587
Precision: 0.3854120758043191


In [None]:
shap.initjs()

best_model_lr = random_search_lr.best_estimator_.fit(X_train, y_train)
feature_names = pipe_lr[:-1].get_feature_names_out()

# transformed features on train data
X_train_enc = pd.DataFrame(
    data=preprocessor.transform(X_train),
    columns=feature_names,
    index=X_train.index,
)

# transformed features on test data
X_test_enc = pd.DataFrame(
    data=preprocessor.transform(X_test),
    columns=feature_names,
    index=X_test.index,
)
X_test_enc = round(X_test_enc, 3) 

# SHAP explainer on train test data set
lr_explainer = shap.LinearExplainer(best_model_lr.named_steps['logisticregression'], X_train_enc)
train_lr_shap_values = lr_explainer.shap_values(X_train_enc)
test_lr_shap_values = lr_explainer.shap_values(X_test_enc)


In [None]:
# index target to find examples for prediction
y_test_reset = y_test.reset_index(drop=True)
y_test_reset

default_ind = y_test_reset[y_test_reset == 1].index.tolist()
no_default_ind = y_test_reset[y_test_reset == 0].index.tolist()

# get a test prediction
ex_default_index = default_ind[1200]            # example that is predcting default at high probability
ex_no_default_index = no_default_ind[1212]      # example that is predicting no default at high probability 

# SHAP force plot for no default test prediction
shap.force_plot(
    lr_explainer.expected_value,
    test_lr_shap_values[ex_no_default_index, :],
    X_test_enc.iloc[ex_no_default_index, :],
    matplotlib=True,
)
# compare with model prediction
no_default_prob = best_model_lr.predict_proba(X_test)[ex_no_default_index] 
print('No default prediction probability', no_default_prob)     # prediction is right, no default

# SHAP force plot for defualt test prediction
shap.force_plot(
    lr_explainer.expected_value,
    test_lr_shap_values[ex_default_index, :],
    X_test_enc.iloc[ex_default_index, :],
    matplotlib=True,
)
# compare with model prediction
default_prob = best_model_lr.predict_proba(X_test)[ex_default_index] 
# compare SHAP force plot with predict proba
print('Default prediction probability', default_prob)           # prediction is right, default


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 13. Summary of results <a name="13"></a>
rubric={reasoning}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook.
    
</div>

_Points:_ 8

In addressing the problem of prediction of whether a customer will default on their payment next month or not, we cross-validated and tuned a variety of models to find the best performing model. We chose to use recall as our metric to measure performance as in this context, we want to identify as many customers potentially requiring interventions, but are not as concerned with accidentally reaching out to customers who will not in fact default on their next payment. Unfortunately, despite trying a wide variety of models and conducting hyperparameter optimization, we were only able to achieve a recall of about 0.65. Potentially, the features we had access to might not have been the most informative for our prediction problem or the relationships may be hard to capture with the models we used. To improve performance, we may want to try to tune our models using a different metric, collect more data for our training or consult experts to conduct some feature engineering and selection to extract more relevant information.

In [None]:
results_summary_df = (
    pd.DataFrame(cross_val_results)
    .T.drop(["fit_time", "score_time", "test_accuracy", "train_accuracy"], axis=1)
    .sort_values(by="test_recall", ascending=False)
    .iloc[:, [2, 3, 0, 1, 4, 5]]
)
results_summary_df

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 14. Creating a data analysis pipeline (Challenging)
rubric={reasoning}

**Your tasks:**
- In 522 you learned how build a reproducible data analysis pipeline. Convert this notebook into scripts and create a reproducible data analysis pipeline with appropriate documentation. Submit your project folder in addition to this notebook on GitHub and briefly comment on your organization in the text box below.
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 15. Your takeaway from the course (Challenging)
rubric={reasoning}

**Your tasks:**

What is your biggest takeaway from this course? 
    
</div>

_Points:_ 0.25

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<div class="alert alert-danger" style="color:black">
    
**Restart, run all and export a PDF before submitting**
    
Before submitting,
don't forget to run all cells in your notebook
to make sure there are no errors
and so that the TAs can see your plots on Gradescope.
You can do this by clicking the ▶▶ button
or going to `Kernel -> Restart Kernel and Run All Cells...` in the menu.
This is not only important for MDS,
but a good habit you should get into before ever committing a notebook to GitHub,
so that your collaborators can run it from top to bottom
without issues.
    
After running all the cells,
export a PDF of the notebook (preferably the WebPDF export)
and upload this PDF together with the ipynb file to Gradescope
(you can select two files when uploading to Gradescope)
</div>

---

## Help us improve the labs

The MDS program is continually looking to improve our courses, including lab questions and content. The following optional questions will not affect your grade in any way nor will they be used for anything other than program improvement:

1. Approximately how many hours did you spend working or thinking about this assignment (including lab time)?

#Ans:

2. Do you have any feedback on the lab you be willing to share? For example, any part or question that you particularly liked or disliked?

#Ans: