# Pipelines (preprocessing to estimator), CV, Gridsearch

- [ ] how to do: preprocessing (with `ColumnTransformer`)
- [ ] how to examine: preprecessing
- [ ] how to do: pipelines + cross validation + scoring
    - `make_pipeline()` with preprocessing and estimator
    - examine pipeline elements 
    - fit and predict using the pipeline (not CV)
    - examine the pipeline using CV (using the cv function and examining its output)
- [ ] scoring vocab: recall/sensitivity, precision, specificity, accuracy, 
- [ ] how to do: optimizing a pipeline by "tuning the hyperparameters"
    - hyperparameters are the parameters of functions/estimators in the steps in your pipeline
    - set up and use `gridsearchCV`
    - examine output of `gridsearchCV`
    

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline 
from sklearn.impute import SimpleImputer
from df_after_transform import df_after_transform
from sklearn.model_selection import KFold, cross_validate, GridSearchCV

from sklearn import set_config
set_config(display="diagram")  # display='text' is the default

pd.set_option('display.max_colwidth', 1000, 'display.max_rows', 50, 'display.max_columns', None) 

## Load data 

In [2]:
loans = pd.read_csv('inputs/2013_subsample.zip')

In [3]:
loans

Unnamed: 0,id,member_id,loan_status,addr_state,annual_inc,application_type,desc,dti,earliest_cr_line,emp_length,emp_title,fico_range_high,fico_range_low,grade,home_ownership,initial_list_status,installment,int_rate,issue_d,loan_amnt,mort_acc,open_acc,pub_rec,pub_rec_bankruptcies,purpose,revol_bal,revol_util,sub_grade,term,title,total_acc,verification_status,zip_code
0,10148122,,Fully Paid,TX,96500.0,Individual,"Borrower added on 12/31/13 > Bought a new house, furniture, water softener, a second car, etc. Got our lives started and now a manageable monthly payment will help keep them going!<br>",12.61,Sep-2003,3 years,Systems Engineer,709.0,705.0,A,MORTGAGE,f,373.94,7.62,Dec-2013,12000.0,1.0,17.0,0.0,0.0,debt_consolidation,13248.0,55.7,A3,36 months,Debt Consolidation and Credit Transfer,30.0,Not Verified,782xx
1,10149342,,Fully Paid,MI,55000.0,Individual,Borrower added on 12/31/13 > Combining high interest credit cards to lower interest rate.<br>,22.87,Oct-1986,10+ years,Team Leadern Customer Ops & Systems,734.0,730.0,B,OWN,w,885.46,10.99,Dec-2013,27050.0,4.0,14.0,0.0,0.0,debt_consolidation,36638.0,61.2,B2,36 months,Debt Consolidation,27.0,Verified,481xx
2,10119623,,Fully Paid,CO,130000.0,Individual,,13.03,Nov-1997,10+ years,LTC,719.0,715.0,B,MORTGAGE,f,398.52,11.99,Dec-2013,12000.0,3.0,9.0,0.0,0.0,debt_consolidation,10805.0,67.0,B3,36 months,Debt consolidation,19.0,Source Verified,809xx
3,10149577,,Fully Paid,CA,325000.0,Individual,,18.55,Nov-1994,5 years,Area Sales Manager,749.0,745.0,A,MORTGAGE,w,872.52,7.62,Dec-2013,28000.0,5.0,15.0,0.0,0.0,debt_consolidation,29581.0,54.6,A3,36 months,Pay off other Installment loan,31.0,Source Verified,945xx
4,10129454,,Fully Paid,NC,60000.0,Individual,Borrower added on 12/31/13 > I would like to use this money to payoff existing credit card debt and use the remaining about to purchase a used car that is fuel efficient.<br>,4.62,Dec-2009,4 years,Project Manager,724.0,720.0,B,RENT,f,392.81,10.99,Dec-2013,12000.0,0.0,15.0,0.0,0.0,debt_consolidation,7137.0,24.0,B2,36 months,No Regrets,18.0,Not Verified,281xx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134799,2334898,,Fully Paid,CA,85000.0,Individual,Borrower added on 12/05/12 > pay off credit card debt<br><br> Borrower added on 12/10/12 > pay credit card debt<br><br> Borrower added on 12/12/12 > credit card debt<br>,21.70,Aug-1997,10+ years,local 729,734.0,730.0,B,MORTGAGE,w,341.22,10.16,Jan-2013,16000.0,3.0,10.0,0.0,0.0,credit_card,8921.0,54.7,B1,60 months,lending club loan,28.0,Verified,910xx
134800,2375068,,Fully Paid,NJ,55500.0,Individual,,23.48,Jun-1991,1 year,Pomptonian food service company,669.0,665.0,D,RENT,f,657.54,18.75,Jan-2013,18000.0,0.0,15.0,0.0,0.0,debt_consolidation,13102.0,82.1,D3,36 months,consolidation,38.0,Verified,088xx
134801,2374791,,Fully Paid,TX,158000.0,Individual,"Borrower added on 12/07/12 > I'm wanting to get this consolidation loan to help my cash flow and make one payment each month. I have a great income and hope to get the note payed off much quicker then the 36 month terms, so I can get back to saving again. Thanks<br>",25.54,May-1990,10+ years,,679.0,675.0,B,RENT,f,565.62,12.12,Jan-2013,17000.0,4.0,7.0,0.0,0.0,debt_consolidation,5896.0,57.8,B3,36 months,DEBT CONSOLIDATION,19.0,Verified,781xx
134802,2301035,,Fully Paid,CA,200000.0,Individual,,13.81,Aug-2000,9 years,direct telecom inc,709.0,705.0,B,MORTGAGE,f,1048.06,12.12,Jan-2013,31500.0,3.0,13.0,0.0,0.0,credit_card,31860.0,79.9,B3,36 months,my cc loan,24.0,Verified,915xx


## Create the training and holdout samples

Split your data into test and train. Your options:
- [sklearn has some built in splitters](https://scikit-learn.org/stable/modules/cross_validation.html)
    - These are rarely the best optios for real world data. Prediction is often about the future!
- [`test_train_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) can do basic splits, but may not be appropriate
    - _test_ is the typical sklearn vernacular, on the website I call this this holdout sample
- You can just keep the most recent time period of your samples in the holdout and put the rest into the training data.

In [4]:
# split to test and train (what we call the "test" subset here is the "holdout" data)

# first let's separate y from X (as is typically done)
y = loans.loan_status == 'Charged Off'
y.value_counts()
loans = loans.drop('loan_status',axis=1)

# stratify will make sure that test/train both have equal fractions of outcome
X_train, X_test, y_train, y_test = train_test_split(loans, y, stratify=y, test_size=.2, random_state=0)

In [5]:
X_test

Unnamed: 0,id,member_id,addr_state,annual_inc,application_type,desc,dti,earliest_cr_line,emp_length,emp_title,fico_range_high,fico_range_low,grade,home_ownership,initial_list_status,installment,int_rate,issue_d,loan_amnt,mort_acc,open_acc,pub_rec,pub_rec_bankruptcies,purpose,revol_bal,revol_util,sub_grade,term,title,total_acc,verification_status,zip_code
15170,9075096,,SC,93600.0,Individual,,12.01,Mar-1990,7 years,Project Controls Manager,724.0,720.0,A,MORTGAGE,f,187.75,7.90,Nov-2013,6000.0,4.0,9.0,0.0,0.0,debt_consolidation,9622.0,18.5,A4,36 months,Debt Freeze,27.0,Not Verified,296xx
8533,9695357,,PA,72000.0,Individual,,18.66,Aug-2001,10+ years,Production Supervisor,679.0,675.0,C,MORTGAGE,f,282.16,14.47,Dec-2013,12000.0,2.0,9.0,0.0,0.0,debt_consolidation,5502.0,79.7,C2,60 months,Debt consolidation,34.0,Verified,168xx
12199,9187121,,CA,36000.0,Individual,,16.93,Jun-1992,3 years,Assistant to the Athletic Director,694.0,690.0,B,RENT,w,257.38,11.99,Dec-2013,7750.0,7.0,13.0,0.0,0.0,debt_consolidation,4039.0,45.9,B3,36 months,Debt Consolidation,49.0,Source Verified,920xx
21305,8874946,,FL,38000.0,Individual,,18.29,Jun-1994,10+ years,Scheduling Coordinator,679.0,675.0,E,RENT,w,652.17,23.10,Nov-2013,16825.0,0.0,16.0,0.0,0.0,debt_consolidation,13018.0,73.1,E4,36 months,Debt Consolidation,28.0,Source Verified,330xx
78574,5976932,,IL,45466.0,Individual,Borrower added on 06/24/13 > Lower rate to pay off credit cards<br>,16.13,Dec-1967,10+ years,Elizabeth doty,764.0,760.0,A,MORTGAGE,f,506.62,6.62,Jul-2013,16500.0,4.0,14.0,0.0,0.0,credit_card,17749.0,43.2,A2,36 months,Credit Card Consoladation,31.0,Verified,610xx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35474,7691411,,NY,88000.0,Individual,,11.51,Mar-1977,9 years,Superintendent,689.0,685.0,B,RENT,f,415.12,11.99,Oct-2013,12500.0,0.0,12.0,0.0,0.0,credit_card,9819.0,60.6,B3,36 months,Credit card refinancing,37.0,Not Verified,113xx
87086,5610652,,IL,85000.0,Individual,"Borrower added on 06/03/13 > Pay off PNC Personal loan, Citi Bank Loan, Room place Credit card, Best Buy Credit card, Bank Of America credit card, Menards credit card, Walmart credit card and use the rest for my wedding in September of this year. This will help me save some money instead of making 8 payments I can now make one.<br>",18.11,Feb-2000,10+ years,"Gateway Glazing, Inc.",689.0,685.0,D,MORTGAGE,f,790.15,19.72,Jun-2013,30000.0,2.0,15.0,0.0,0.0,debt_consolidation,10311.0,31.7,D5,60 months,Debt consolidation,31.0,Verified,601xx
116743,3638191,,CA,58000.0,Individual,,24.70,Jul-1999,10+ years,stater bros.,714.0,710.0,B,MORTGAGE,f,392.16,10.16,Mar-2013,12125.0,4.0,8.0,0.0,0.0,credit_card,35217.0,77.2,B1,36 months,Credit card refinancing,24.0,Verified,925xx
93939,5146891,,TX,90000.0,Individual,,16.27,Nov-1991,10+ years,American National Insurance Company,664.0,660.0,D,RENT,w,584.48,18.75,May-2013,16000.0,0.0,16.0,0.0,0.0,credit_card,21188.0,72.3,D3,36 months,Credit card refinancing,36.0,Not Verified,775xx


## EDA

On the **TRAINING DATA ONLY**: 
- do lots of EDA
- look for missing values, which variables are what type, and outliers 
  -  You want to know which variables are continuous - maybe we Log these? or scale them? or standardize them?
  -  Which are categorical - especially any "numeric" looking categories. We need to tell SKlearn how to use these!
- figure out how you'd clean the data (imputation, scaling, encoding categorical vars)
- these lessons will go into the preprocessinG portion of your pipeline 
- PRO TIP: `pandas-profiling` and `dabl` build automated reports. Which is nice, but remember you need to examine them closely! **There is no shortcut for EDA.**

In [6]:
# from pandas_profiling import ProfileReport # now use ydata-profiling
# 
# profile = ProfileReport(pd.concat([y_train, X_train], axis=1), 
#                         title='Lending Club Profiling Report',
#                         html={'style':{'full_width':True}}) 
# profile.to_file("inputs/lending_club_EDA_training.html") # can take a minute or two with this dataset size. Let's look at the one I uploaded...


## NOW LET'S LEARN EACH PART OF "Optimize a series of models" FROM THE TEMPLATE

### Steps 1 and 2: Preprocessing


In [7]:
# set up pipeline to clean each type of variable (1 pipe per var type)

numer_pipe = make_pipeline(SimpleImputer(strategy='mean'),
                           StandardScaler()) 

cat_pipe   = make_pipeline(OneHotEncoder(drop='first'))

# combine those pipes into "preprocess" pipe

preproc_pipe = ColumnTransformer(  
    [ # arg 1 of ColumnTransformer is a list, so this starts the list
    # a tuple for the numerical vars: name, pipe, which vars to apply to
    ("num_impute", numer_pipe, ['annual_inc','fico_range_high','dti']),
    # a tuple for the categorical vars: name, pipe, which vars to apply to
    ("cat_trans", cat_pipe, ['grade'])
    ]
    , # ColumnTransformer can take other args, most important: "remainder"
    remainder = 'drop' # you either drop or passthrough any vars not modified above
)



In [8]:
# show numer_pipe on y as an example:
income_changed = numer_pipe.fit_transform(X_train[['annual_inc']])

# print mean and Std and count
print(f"Mean before: {X_train['annual_inc'].mean()}")
print(f"Mean after: {income_changed.mean()}")
print(f"Std before: {X_train['annual_inc'].std()}")
print(f"Std after: {income_changed.std()}")
print(f"Count before: {X_train['annual_inc'].count()}")
print(f"Count after: {len(income_changed)}")

Mean before: 73218.32334949881
Mean after: 3.16256514715696e-17
Std before: 49583.80197879759
Std after: 1.0
Count before: 107843
Count after: 107843


In [9]:
X_train['grade'].value_counts().sort_index()

grade
A    14183
B    35307
C    30395
D    16517
E     7216
F     3515
G      710
Name: count, dtype: int64

In [10]:
# show cat_pipe on grade as an example:
grade_changed = cat_pipe.fit_transform(X_train[['grade']]).toarray()
print(grade_changed)

X_train[['grade']].head(3)

[[0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 ...
 [0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]


Unnamed: 0,grade
23251,C
108933,C
55850,D


_Note: You can have multiple "numeric pipes" or "cat pipes" if you think some number variables should receive different treatments, which is very common!_

In [11]:
# try the preproc_pipe on X_train

preproc_pipe.fit_transform(X_train)

array([[-5.69106302e-01, -4.19381405e-06,  3.18333958e-01, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [-9.52297734e-01, -3.47906878e-01, -3.28137507e-01, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [-2.45711089e-02, -8.69760903e-01,  4.59214705e-01, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [-7.50618033e-01, -8.69760903e-01,  7.73883813e-02, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.80467994e-01,  5.21849832e-01, -9.68025760e-01, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 3.38452353e-01,  3.47898490e-01, -3.11021155e-01, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

In [12]:
###########
# hot tip: check out what this preprocessing does before you continue!
###########

from df_after_transform import df_after_transform

preproc_df = df_after_transform(preproc_pipe,X_train)
print(f'There are {preproc_df.shape[1]} columns in the preprocessed data.')
preproc_df.describe().T.round(2)

There are 9 columns in the preprocessed data.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
annual_inc,107843.0,0.0,1.0,-1.36,-0.55,-0.19,0.32,121.55
fico_range_high,107843.0,0.0,1.0,-1.22,-0.7,-0.17,0.52,5.25
dti,107843.0,-0.0,1.0,-2.27,-0.76,-0.04,0.73,2.34
grade_B,107843.0,0.33,0.47,0.0,0.0,0.0,1.0,1.0
grade_C,107843.0,0.28,0.45,0.0,0.0,0.0,1.0,1.0
grade_D,107843.0,0.15,0.36,0.0,0.0,0.0,0.0,1.0
grade_E,107843.0,0.07,0.25,0.0,0.0,0.0,0.0,1.0
grade_F,107843.0,0.03,0.18,0.0,0.0,0.0,0.0,1.0
grade_G,107843.0,0.01,0.08,0.0,0.0,0.0,0.0,1.0


In [13]:
preproc_df

Unnamed: 0,annual_inc,fico_range_high,dti,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G
0,-0.569106,-0.000004,0.318334,0.0,1.0,0.0,0.0,0.0,0.0
1,-0.952298,-0.347907,-0.328138,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.024571,-0.869761,0.459215,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.831290,0.521850,0.318334,0.0,1.0,0.0,0.0,0.0,0.0
4,0.237613,1.391607,0.977972,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
107838,-0.468266,-1.043712,1.155719,0.0,0.0,1.0,0.0,0.0,0.0
107839,0.035933,0.521850,-0.671781,0.0,1.0,0.0,0.0,0.0,0.0
107840,-0.750618,-0.869761,0.077388,0.0,0.0,1.0,0.0,0.0,0.0
107841,0.580468,0.521850,-0.968026,1.0,0.0,0.0,0.0,0.0,0.0


### EXERCISES

1. How many observations for `annual_inc` were non-missing before the processing? Does our imputation choice here we make a small or large impact?
1. How many values of `grade` was there in the data before and after `preproc_pipe`? 
1. Above, revise `preproc_pipe` to include another continuous variable.

In [14]:
# answers here
print('q1: ',
X_train['annual_inc'].isna().sum(), preproc_df['annual_inc'].isna().sum()) # 0 missing values in annual_inc after preprocessing

print('q2',
X_train['grade'].nunique(), len(preproc_df.filter(like='grade').columns))

q1:  0 0
q2 7 6


### Prof demo: Fitting and using ONE model

Warning: This is not best practice to do on the whole training sample. The point here is to simply show you how to estimate and use a model.

_(The only time you fit a model on the whole training sample is the VERY end of the template, right before you check to see how it does on the holdout.)_

Steps: https://ledatascifi.github.io/ledatascifi-2022/content/05/04c_onemodel.html

To fit the model: `<model>.fit(X)`

In [15]:
# create ("instantiate") the estimator class with some hyperparameters,
# note: the hypermeters are whatever is inside the "()"
# assign this instance of the estimator to an object
logit = LogisticRegression()

# fit the model to the data: <model>.fit(X,y)
# note: I'm only using annual income here for illustration
logit.fit(X_train[['annual_inc']], 
          y_train) 

logit

To use the model: `<model>.predict(X)`

In [16]:
# this creates predicted values 
y_pred = logit.predict(X_train[['annual_inc']],)

print(f'''
% predicted as charge offs: {round(100*(y_pred == 1).mean(),2)}
Accuracy:                   {round(100*(y_pred == y_train).mean(),2)}
''')

y_pred.mean()


% predicted as charge offs: 0.0
Accuracy:                   84.4



0.0

### EXERCISES

- Q4: Let's estimate a different logit model, and see if the accuracy or the number of predictions of charge offs changes. [This time change the penalty OR the value of "C".](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

In [17]:
logit2 = LogisticRegression(C=0.5)

logit2.fit(X_train[['annual_inc']], 
           y_train) 

y_pred2 = logit2.predict(X_train[['annual_inc']],)

y_pred2.mean()

0.0

### Steps 3, 4, and 5: Pipelines, CV, and model evaluation (scoring)

Making a pipeline is easy: `make_pipeline` will put steps together for you:

In [18]:
logit_pipe = make_pipeline(preproc_pipe, LogisticRegression())

A pipeline is an object that stores its steps (and steps within steps)

In [19]:
logit_pipe

Fitting the model and using it is easy:

In [20]:
logit_pipe.fit(X_train, y_train) 
y_pred = logit_pipe.predict(X_train)

print(f'''
% predicted as charge offs: {round(100*(y_pred == 1).mean(),2)}
Accuracy:                   {round(100*(y_pred == y_train).mean(),2)}
''')


% predicted as charge offs: 0.0
Accuracy:                   84.4



But the better idea is to see how that model does on different "folds" of the data: CROSS VALIDATION.

The [`cross_validate` (function docs here)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html?highlight=cross_validate#sklearn.model_selection.cross_validate) makes this easy:

In [21]:
scores = cross_validate(logit_pipe,
                        X_train, y_train,
                        cv=10,
                        scoring='recall', )
scores

{'fit_time': array([0.21168399, 0.17036748, 0.18379927, 0.2187016 , 0.23139644,
        0.2145288 , 0.2036984 , 0.21788812, 0.22253323, 0.21006274]),
 'score_time': array([0.01679897, 0.00896144, 0.01736474, 0.01685119, 0.02115774,
        0.01198697, 0.02766085, 0.01584458, 0.01775241, 0.01608443]),
 'test_score': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}

### EXERCISES

- Q5: What is the average score of this model in the folds?
- Q6: How many folds did we use above (by default). Change to 10 folds and repeat.
- Q7: What is the "score" being reported? 
- Q8: Without running code, [what do you think our model is currently scoring for precision, sensitivity, and recall?]
- Q9: Change the scoring method to one of those. [Specify them in sklearn is here.](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). How did your score change and why?


In [22]:
# put answers here
scores['test_score'].mean()

# q6 - 5 

# q7 = recall: fraction of actual charge offs that are predicted as charge offs

0.0

- recall aka sensitivity - what fraction of a given outcome does my model predict?
   - recall (defaults):   # the defaults the model predicts default / # of real defaults
      - TP / # of actual defaults = 0 / 100k = 0% 
   - recall (pay off):   # the paid off loans the model predicts pay off / # of real pay off 
      - TN / # of actual pay offs = 100% here  
- precision: what fraction of a given label my model makes is correct for that given outcome?
   - precision (default): # defaults I call defaults / # of predicted defaults 
      - TP / # of predicted defaults = 0/0 
   - precision (pay off): # paid off loans the model predicts pay off / # of predicted pay offs 
      - TN / # of predicted pay offs = ~85%

### Step 6: Optimizing the model

### Aka "tuning the hyperparameters"

It seems like something about how the default values of the logit model are set is leading to a simpleton model: Predict every loan is paid back!

So, the idea here is to repeat the CV above, many times, with different parameter values.


### What hyperparameters can I change?

The easiest way to see all the hyperparameters in the pipeline is this: 

`<pipename>.get_params()`

Notice how the "C" parameter for the LogisticRegression function is called "`logisticregression__C`" below? We will come back to that in  a second!

In [39]:
# logit_pipe.get_params()

### Setting up the "hyperparameter grid"

The combination of parameters you want to try is
- a dictionary
- the **keys** in the dict are the hyperparameters you want to change (specifically, how the parameter is **named in the pipeline**
- the **values** in the dict are the values for that hyperparameter you want to change

For example:
```python
parameters =  {'logisticregression__C': [0.001,0.1,1,5]}
```

The reason I wrote the weird "`logisticregression__C`", is because this will help `sklearn` find the function and its parameter to change. This means that within the called "logisticregression" (followed by two underscores) there is a parameter called "C".

Similarly, within the "columntransformer" step, there is a "num_impute" step, which has a "simpleimputer" step, which is a function that has a parameter called "strategy".

Thus, if you want to try other strategies for filling blank numbers in, "`'columntransformer__num_impute__simpleimputer__strategy'`" is what you need.

Thus:
```python
parameters =  {'logisticregression__C': [0.001,0.1,1,5],
              'columntransformer__num_impute__simpleimputer__strategy' : ['mean','median']}
```


### All together now...

1. Set up your parameter grid
1. Make a "super estimator" object with `GridSearchCV`. This just runs cross_validate for every combination of parameters in the grid. (Below, 2x3=6 combinations.)
1. Just like CV, use `.fit()` to run the grid search.

In [40]:
# set up hyper param grid - what params in a pipeline do you want to change?
# a dictionary. keys are things to change in pipeline, values are what to try for that param
# key: <stepname>__<parametername>

parameters =  {'logisticregression__C': [0.1,1,5], 
              'columntransformer__num_impute__simpleimputer__strategy' : ['mean','median']}

#     find optimal hyper params (gridsearchcv)

grid_search = GridSearchCV(estimator = logit_pipe, 
                           param_grid = parameters,
                           cv = 3,
                           scoring='recall'
                           )

results = grid_search.fit(X_train,y_train)

#     save pipeline with optimal params in place
#     (Note: you should spend time interrogating model predictions, plotting and printing.
#     Does the model struggle predicting certain obs? Excel at some?)

In [41]:
results.best_params_
results.best_score_
results.cv_results_
pd.DataFrame(results.cv_results_).round(2) # easy to read!

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__num_impute__simpleimputer__strategy,param_logisticregression__C,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.11,0.03,0.02,0.01,mean,0.1,"{'columntransformer__num_impute__simpleimputer__strategy': 'mean', 'logisticregression__C': 0.1}",0.0,0.0,0.0,0.0,0.0,1
1,0.11,0.01,0.02,0.0,mean,1.0,"{'columntransformer__num_impute__simpleimputer__strategy': 'mean', 'logisticregression__C': 1}",0.0,0.0,0.0,0.0,0.0,1
2,0.11,0.01,0.02,0.0,mean,5.0,"{'columntransformer__num_impute__simpleimputer__strategy': 'mean', 'logisticregression__C': 5}",0.0,0.0,0.0,0.0,0.0,1
3,0.11,0.01,0.02,0.01,median,0.1,"{'columntransformer__num_impute__simpleimputer__strategy': 'median', 'logisticregression__C': 0.1}",0.0,0.0,0.0,0.0,0.0,1
4,0.11,0.01,0.01,0.01,median,1.0,"{'columntransformer__num_impute__simpleimputer__strategy': 'median', 'logisticregression__C': 1}",0.0,0.0,0.0,0.0,0.0,1
5,0.12,0.0,0.02,0.0,median,5.0,"{'columntransformer__num_impute__simpleimputer__strategy': 'median', 'logisticregression__C': 5}",0.0,0.0,0.0,0.0,0.0,1


### What to do after that?

[The "outputs" of the grid search are the attributes of the results object, as listed here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV)

- Q10: Output the CV results
    - Bonus: Do it as a dataframe
- Q11: Make the grid search output recall, sensitivity, f1, and accuracy 

---

Now, here is what I'd _**like**_ to ask you:

- Q12: Which of these models would you choose, taking into account the bias-variance tradeoff? Discuss whether these models are high bias or not, and whether they are high variance or not.
- Q13: Outline how we might adjust our model here to improve its performance

**Except: Don't answer those for today's (bad) logit model. I've "hidden" something from you about why this model is performing so poorly. Think of this as "a case study in miniature" about how "black boxes" can make dealing with ML models impenetrable.**

---


In [None]:
# put answers here for Q10 and Q11