# Pipelines (preprocessing to estimator), CV, Gridsearch

- [ ] how to do: preprocessing (with `ColumnTransformer`)
- [ ] how to examine: preprecessing
- [ ] how to do: pipelines + cross validation + scoring
    - `make_pipeline()` with preprocessing and estimator
    - examine pipeline elements 
    - fit and predict using the pipeline (not CV)
    - examine the pipeline using CV (using the cv function and examining its output)
- [ ] scoring vocab: recall/sensitivity, precision, specificity, accuracy, 
- [ ] how to do: optimizing a pipeline by "tuning the hyperparameters"
    - hyperparameters are the parameters of functions/estimators in the steps in your pipeline
    - set up and use `gridsearchCV`
    - examine output of `gridsearchCV`
    

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline 
from sklearn.impute import SimpleImputer
from df_after_transform import df_after_transform
from sklearn.model_selection import KFold, cross_validate, GridSearchCV

from sklearn import set_config
set_config(display="diagram")  # display='text' is the default

pd.set_option('display.max_colwidth', 1000, 'display.max_rows', 50, 'display.max_columns', None) 

## Load data 

## Create the training and holdout samples

Split your data into test and train. Your options:
- [sklearn has some built in splitters](https://scikit-learn.org/stable/modules/cross_validation.html)
    - These are rarely the best optios for real world data. Prediction is often about the future!
- [`test_train_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) can do basic splits, but may not be appropriate
    - _test_ is the typical sklearn vernacular, on the website I call this this holdout sample
- You can just keep the most recent time period of your samples in the holdout and put the rest into the training data.

In [5]:
loans = pd.read_csv('inputs/2013_subsample.zip')

# split to test and train (what we call the "test" subset here is the "holdout" data)

# first let's separate y from X (as is typically done)
y = loans.loan_status == 'Charged Off'
print(y.value_counts()) # rougly 16% ish 
loans = loans.drop('loan_status',axis=1)

# stratify will make sure that test/train both have equal fractions of outcome
# stratify means that in test and train, the fraction of charge offs is equal
# test_size means 20% of data is in the holdout
# random_state is a seed meaning we all have the exact same split (reproducible!)
X_train, X_test, y_train, y_test = train_test_split(loans, y, stratify=y, test_size=.2, random_state=0)

# I would save X_train and y_train in a folder
# I would save X_test and y_test in a holdout folder

del y_test # paranoid step - ensures we can't look at the holdout data too soon
del X_test
del loans  # not paranoid! easy to use this by accident 

loan_status
False    113780
True      21024
Name: count, dtype: int64


## EDA

On the **TRAINING DATA ONLY**: 
- do lots of EDA
- look for missing values, which variables are what type, and outliers 
- figure out how you'd clean the data (imputation, scaling, encoding categorical vars)
- these lessons will go into the preprocessinG portion of your pipeline 
- PRO TIP: `pandas-profiling` and `dabl` build automated reports. Which is nice, but remember you need to examine them closely! **There is no shortcut for EDA.**

In [None]:
# from pandas_profiling import ProfileReport # now use ydata-profiling
# 
# profile = ProfileReport(pd.concat([y_train, X_train], axis=1), 
#                         title='Lending Club Profiling Report',
#                         html={'style':{'full_width':True}}) 
# profile.to_file("inputs/lending_club_EDA_training.html") # can take a minute or two with this dataset size. Let's look at the one I uploaded...


In [27]:
# X_train.isna().mean()
# X_train['purpose'].value_counts() 
# X_train['emp_title']
# X_train[X_train['emp_title'].isna()]['purpose'].value_counts()
X_train['issue_d'].str[-4:].value_counts()
X_train['term'].value_counts()

term
 36 months    80314
 60 months    27529
Name: count, dtype: int64

In [28]:
X_train.shape

(107843, 32)

EDA findings
- id is the unit identifier AND it is unique (no duplicate rows) 
- date: 2013 for all loans
- unit: a loan
- 32 vars, 107k loans 
- variables: - missing, categorical, info about continuous vars, outliers
   - 5% of cells are missing
   - member ID always missing, desc 65%, emp length and title halfish, 2 vars with a few missing
   - application_type is meaningless (always individ)
   - outliers in income: 6k to 6m
   - some are unemployed maybe? what does missing emp_title mean?
   - loan amounts: 14k mean, 35k max, min 1k  (corr with installment)
   - 36 or 60 months 

## NOW LET'S LEARN EACH PART OF "Optimize a series of models" FROM THE TEMPLATE

### Steps 1 and 2: Preprocessing


In [47]:
# set up pipeline to clean each type of variable (1 pipe per var type)

numer_pipe = make_pipeline(SimpleImputer(strategy='mean'),
                           StandardScaler()) 

cat_pipe   = make_pipeline(OneHotEncoder(drop='first'))
# drop first for reg like estimators, keep for others , eg,  trees 

# combine those pipes into "preprocess" pipe

preproc_pipe = ColumnTransformer(  
    [ 
        # a list of tuples 
        # tuple: name for step, which pipe for this step, which vars 
        ("num_impute", numer_pipe, ['annual_inc','dti, fico_range_high']),
        ("cat_trans", cat_pipe, ['grade'])
    ]
    , 
    remainder = 'drop' 
)

numer_pipe # prints out dialog, click the pipeline to show full steps 
cat_pipe
preproc_pipe

_Note: You can have multiple "numeric pipes" or "cat pipes" if you think some number variables should receive different treatments, which is very common!_

In [33]:
###########
# hot tip: check out what this preprocessing does before you continue!
###########

from df_after_transform import df_after_transform

preproc_df = df_after_transform(preproc_pipe,X_train)
print(f'There are {preproc_df.shape[1]} columns in the preprocessed data.')
preproc_df.describe().T.round(2)

There are 7 columns in the preprocessed data.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
annual_inc,107843.0,0.0,1.0,-1.36,-0.55,-0.19,0.32,121.55
grade_B,107843.0,0.33,0.47,0.0,0.0,0.0,1.0,1.0
grade_C,107843.0,0.28,0.45,0.0,0.0,0.0,1.0,1.0
grade_D,107843.0,0.15,0.36,0.0,0.0,0.0,0.0,1.0
grade_E,107843.0,0.07,0.25,0.0,0.0,0.0,0.0,1.0
grade_F,107843.0,0.03,0.18,0.0,0.0,0.0,0.0,1.0
grade_G,107843.0,0.01,0.08,0.0,0.0,0.0,0.0,1.0


In [34]:
preproc_df

Unnamed: 0,annual_inc,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G
0,-0.569106,0.0,1.0,0.0,0.0,0.0,0.0
1,-0.952298,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.024571,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.831290,0.0,1.0,0.0,0.0,0.0,0.0
4,0.237613,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
107838,-0.468266,0.0,0.0,1.0,0.0,0.0,0.0
107839,0.035933,0.0,1.0,0.0,0.0,0.0,0.0
107840,-0.750618,0.0,0.0,1.0,0.0,0.0,0.0
107841,0.580468,1.0,0.0,0.0,0.0,0.0,0.0


### EXERCISES

1. How many observations for `annual_inc` were non-missing before the processing? Does our imputation choice here we make a small or large impact?
1. How many values of `grade` was there in the data before and after `preproc_pipe`? 
1. Above, revise `preproc_pipe` to include another continuous variable.

In [45]:
# answers here
X_train['annual_inc'].count(), len(X_train), X_train['annual_inc'].isna().sum()
len(X_train['grade'].value_counts())

7

Answers:
1. Zero. Doesn't matter for THIS variable in THIS data
1. Seven before - 6 after
1. dti, fico_range_high

### Prof demo: Fitting and using ONE model

Warning: This is not best practice to do on the whole training sample. The point here is to simply show you how to estimate and use a model.

_(The only time you fit a model on the whole training sample is the VERY end of the template, right before you check to see how it does on the holdout.)_

Steps: https://ledatascifi.github.io/ledatascifi-2022/content/05/04c_onemodel.html

To fit the model: `<model>.fit(X)`

In [None]:
# create ("instantiate") the estimator class with some hyperparameters,
# note: the hypermeters are whatever is inside the "()"
# assign this instance of the estimator to an object
logit = LogisticRegression()

# fit the model to the data: <model>.fit(X,y)
# note: I'm only using annual income here for illustration
logit.fit(X_train[['annual_inc']], 
          y_train) 

To use the model: `<model>.predict(X)`

In [None]:
# this creates predicted values 
y_pred = logit.predict(X_train[['annual_inc']],)

print(f'''
% predicted as charge offs: {round(100*(y_pred == 1).mean(),2)}
Accuracy:                   {round(100*(y_pred == y_train).mean(),2)}
''')

### EXERCISES

- Q4: Let's estimate a different logit model, and see if the accuracy or the number of predictions of charge offs changes. [This time change the penalty OR the value of "C".](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

In [None]:
# answers here

### Steps 3, 4, and 5: Pipelines, CV, and model evaluation (scoring)

Making a pipeline is easy: `make_pipeline` will put steps together for you:

In [None]:
logit_pipe = make_pipeline(preproc_pipe, LogisticRegression())

A pipeline is an object that stores its steps (and steps within steps)

In [None]:
# logit_pipe

Fitting the model and using it is easy:

In [None]:
logit_pipe.fit(X_train, y_train) 
y_pred = logit_pipe.predict(X_train)

print(f'''
% predicted as charge offs: {round(100*(y_pred == 1).mean(),2)}
Accuracy:                   {round(100*(y_pred == y_train).mean(),2)}
''')

But the better idea is to see how that model does on different "folds" of the data: CROSS VALIDATION.

The [`cross_validate` (function docs here)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html?highlight=cross_validate#sklearn.model_selection.cross_validate) makes this easy:

In [None]:
scores = cross_validate(logit_pipe,
                        X_train, y_train,
                        scoring='recall')

### EXERCISES

- Q5: What is the average score of this model in the folds?
- Q6: How many folds did we use above (by default). Change to 10 folds and repeat.
- Q7: What is the "score" being reported? 
- Q8: Without running code, [what do you think our model is currently scoring for precision, sensitivity, and recall?]
- Q9: Change the scoring method to one of those. [Specify them in sklearn is here.](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). How did your score change and why?


In [None]:
# put answers here

### Step 6: Optimizing the model

### Aka "tuning the hyperparameters"

It seems like something about how the default values of the logit model are set is leading to a simpleton model: Predict every loan is paid back!

So, the idea here is to repeat the CV above, many times, with different parameter values.


### What hyperparameters can I change?

The easiest way to see all the hyperparameters in the pipeline is this: 

`<pipename>.get_params()`

Notice how the "C" parameter for the LogisticRegression function is called "`logisticregression__C`" below? We will come back to that in  a second!

In [None]:
# logit_pipe.get_params()

### Setting up the "hyperparameter grid"

The combination of parameters you want to try is
- a dictionary
- the **keys** in the dict are the hyperparameters you want to change (specifically, how the parameter is **named in the pipeline**
- the **values** in the dict are the values for that hyperparameter you want to change

For example:
```python
parameters =  {'logisticregression__C': [0.001,0.1,1,5]}
```

The reason I wrote the weird "`logisticregression__C`", is because this will help `sklearn` find the function and its parameter to change. This means that within the called "logisticregression" (followed by two underscores) there is a parameter called "C".

Similarly, within the "columntransformer" step, there is a "num_impute" step, which has a "simpleimputer" step, which is a function that has a parameter called "strategy".

Thus, if you want to try other strategies for filling blank numbers in, "`'columntransformer__num_impute__simpleimputer__strategy'`" is what you need.

Thus:
```python
parameters =  {'logisticregression__C': [0.001,0.1,1,5],
              'columntransformer__num_impute__simpleimputer__strategy' : ['mean','median']}
```


### All together now...

1. Set up your parameter grid
1. Make a "super estimator" object with `GridSearchCV`. This just runs cross_validate for every combination of parameters in the grid. (Below, 2x3=6 combinations.)
1. Just like CV, use `.fit()` to run the grid search.

In [None]:
# set up hyper param grid - what params in a pipeline do you want to change?
# a dictionary. keys are things to change in pipeline, values are what to try for that param
# key: <stepname>__<parametername>

parameters =  {'logisticregression__C': [0.1,1,5], 
              'columntransformer__num_impute__simpleimputer__strategy' : ['mean','median']}

#     find optimal hyper params (gridsearchcv)

grid_search = GridSearchCV(estimator = logit_pipe, 
                           param_grid = parameters,
                           cv = 10
                           )

results = grid_search.fit(X_train,y_train)

#     save pipeline with optimal params in place
#     (Note: you should spend time interrogating model predictions, plotting and printing.
#     Does the model struggle predicting certain obs? Excel at some?)

### What to do after that?

[The "outputs" of the grid search are the attributes of the results object, as listed here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV)

- Q10: Output the CV results
    - Bonus: Do it as a dataframe
- Q11: Make the grid search output recall, sensitivity, f1, and accuracy 

---

Now, here is what I'd _**like**_ to ask you:

- Q12: Which of these models would you choose, taking into account the bias-variance tradeoff? Discuss whether these models are high bias or not, and whether they are high variance or not.
- Q13: Outline how we might adjust our model here to improve its performance

**Except: Don't answer those for today's (bad) logit model. I've "hidden" something from you about why this model is performing so poorly. Think of this as "a case study in miniature" about how "black boxes" can make dealing with ML models impenetrable.**

---


In [None]:
# put answers here for Q10 and Q11