# Code for the results in the preprocessing lecture

Exercises:
- Don't run my code as-is. It will take 15+ minutes. 
- Delete my `param_grid` and create your own. (Good practice!) Try PCA, and then Poly(2). 
- Try to examine and use the estimators _after_ the grid search. 
- How can you use a model that isn't `best_estimate_`? (Perhaps you prefer the model without the absolute highest mean test score!) Pick a model that isn't the `best_estimate_`, and figure out how you can use that model (to `.fit()` it and `.predict()` with it.)
- See how our profitable model does on the test sample!

In [1]:
# I'm putting all code we've seen before here

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from df_after_transform import df_after_transform
from sklearn import set_config
from sklearn.calibration import CalibrationDisplay
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.decomposition import PCA
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.feature_selection import (
    RFECV,
    SelectFromModel,
    SelectKBest,
    SequentialFeatureSelector,
    f_classif,
)
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LassoCV, LogisticRegression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    DetCurveDisplay,
    PrecisionRecallDisplay,
    RocCurveDisplay,
    classification_report,
    make_scorer,
)
from sklearn.model_selection import (
    GridSearchCV,
    KFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.svm import LinearSVC

set_config(display="diagram")  # display='text' is the default

pd.set_option(
    "display.max_colwidth", 1000, "display.max_rows", 50, "display.max_columns", None
)

import warnings

warnings.filterwarnings("ignore")

# load data

loans = pd.read_csv("ML - prof only/inputs/2013_subsample.zip")

# drop some bad columns here, or in the pipeline

# loans = loans.drop(
#     ["member_id", "id", "desc", "earliest_cr_line", "emp_title", "issue_d"], axis=1
# )

# create holdout sample

y = loans.loan_status == "Charged Off"
y.value_counts()
loans = loans.drop("loan_status", axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    loans, y, stratify=y, test_size=0.2, random_state=0
)  # (stratify will make sure that test/train both have equal fractions of outcome)

# define the profit function


def custom_prof_score(y, y_pred, roa=0.02, haircut=0.20):
    """
    Firm profit is this times the average loan size. We can
    ignore that term for the purposes of maximization. 
    """
    TN = sum((y_pred == 0) & (y == 0))  # count loans made and actually paid back
    FN = sum((y_pred == 0) & (y == 1))  # count loans made and actually defaulting
    return TN * roa - FN * haircut


# so that we can use the fcn in sklearn, "make a scorer" out of that function

prof_score = make_scorer(custom_prof_score)

### A new step right before `preproc_pipe`

The `columntransformer` has a list of pipelines, each needs a list of variables to use. Options:
1. Define manually: `num_pipe_vars = ['A','B','C']`, then `(numer_pipe, num_pipe_vars)` 
    - Maybe tedious, but explicit (good)
2. Let the CT find the numeric vars: `(numer_pipe, make_column_selector(dtype_include=np.number))` 
    - gets numbers, similar for cat vars
    - lets in all numeric/object variables, but often some of your columns shouldn't be used!
3. Get pandas to list all numeric/cat vars, then drop those you don't want (`example below`)

In [2]:
dont_use = ["member_id", "id", "desc", "earliest_cr_line", "emp_title", "issue_d","title"]

# list of all num vars:
num_pipe_features = X_train.select_dtypes(include="number").columns

# exclude any bad features:
num_pipe_features = [e for e in num_pipe_features if e not in dont_use]

cat_pipe_features = ["grade"]  # all: X_train.select_dtypes(include='object').columns

#### Tips:
1. **Check these lists manually!** (It's very easy to mess up.)
2. **You may want/need more than 2 pipes**. E.g.:
    - one for boolean vars 
    - one for log-normal vars
    - one for skewed vars
    - one for cat vars
    - also: you can take a column and use/transform it multiple ways

In [3]:
numer_pipe = make_pipeline(SimpleImputer(strategy="mean"), StandardScaler())

cat_pipe = make_pipeline(OneHotEncoder())

# didn't use make_column_transformer; wanted to name steps
preproc_pipe = make_column_transformer(
    (numer_pipe, num_pipe_features), 
    (cat_pipe, cat_pipe_features), 
    remainder="drop",
)

In [4]:
# lets make the model from last class too, as a baseline
preproc_pipe_old = make_column_transformer(
    (numer_pipe, ["annual_inc", "int_rate"]), 
    (cat_pipe, cat_pipe_features), 
    remainder="drop",
)

In [None]:
# preproc_df = df_after_transform(preproc_pipe,X_train)
# preproc_df.describe().T.round(2)

## Our generic pipeline

1. Process data
1. Create features
1. Select features
1. Estimator/model

Let's make the pipeline. `'passthough'` means a step isn't doing anything. (Yet!)

In [5]:
# I used "Pipeline" not "make_pipeline" bc I wanted to name the steps
pipe = Pipeline([('columntransformer',preproc_pipe),
                 ('feature_create','passthrough'), 
                 ('feature_select','passthrough'), 
                 ('clf', LogisticRegression(class_weight='balanced'))
                ])

pipe

## Trying many different models 

By using `gridsearchCV`, we can see how our model does for as many different combinations of choices as we want.

This means we can try different options for the preprocessing step, the feature selection step, the feature creation step, and the estimator. 

We just need to set up the list of combinations (`param_grid` below) we want to try.

In [55]:
# I'm setting up this parameter grid a little different. Read this:
# https://stackoverflow.com/questions/45352420/avoid-certain-parameter-combinations-in-gridsearchcv

param_grid = [
    
    # baseline: last class's 3 variable logit, no feature creation or selection
    {'columntransformer': [preproc_pipe_old]},
    
    # now, try different feature selection methods (no creation, logit as estimator)
    dict(feature_select=['passthrough', 
                         
                         PCA(5), 
                         PCA(10), 
                         PCA(15),
                         
                         SelectKBest(f_classif,k=5),
                         SelectKBest(f_classif,k=10),
                         SelectKBest(f_classif,k=15),
                         
                         SelectFromModel(LassoCV()),
                         SelectFromModel(LinearSVC(penalty="l1",
                                                   dual=False,
                                                   class_weight='balanced'),
                                        threshold='median'),
                         
                         # RFECV is slow here...
                         RFECV(LinearSVC(penalty="l1",
                                         dual=False,
                                         class_weight='balanced'),
                               cv=2,scoring=prof_score),                         
                         RFECV(LogisticRegression(class_weight='balanced'),
                               cv=2,scoring=prof_score),
                         
                         # slow but faster than RFECV 
                         SequentialFeatureSelector(LogisticRegression(class_weight='balanced'),
                                                   scoring=prof_score,
                                                   n_features_to_select=5,
                                                   cv=2),      
                         SequentialFeatureSelector(LogisticRegression(class_weight='balanced'),
                                                   scoring=prof_score,
                                                   n_features_to_select=10,
                                                   cv=2),    
                         SequentialFeatureSelector(LogisticRegression(class_weight='balanced'),
                                                   scoring=prof_score,
                                                   n_features_to_select=15,
                                                   cv=2)                         
                         
                         ]),
    
    # now, try different feature creation methods (and possibly reduce the features after)
    {'feature_create': [
                        # this creates interactions between all variables
                        PolynomialFeatures(degree=2, interaction_only=True)],
     'feature_select': ['passthrough', PCA(15), PCA(25)]
    },
    
]

Now we put that list of combinations into `GridSearchCV` to run them all!

In [9]:
grid_search = GridSearchCV(estimator = pipe, 
                           param_grid = param_grid,
                           cv = 5, 
                           scoring=prof_score
                           )

results = grid_search.fit(X_train,y_train)

And this is here just to make the slides a little nicer:

In [52]:
pretty = pd.DataFrame(results.cv_results_).iloc[:,[4,5,6,-3,-2,1,2]].fillna('')
pretty = pretty.replace('passthrough','')
pretty.iloc[6,1] = 'SelectKBest(k=10)'
pretty.iloc[0,0] = '3 vars (Last class)'
pretty.iloc[1:,0] = 'All numeric vars + Grade'
pretty

Unnamed: 0,param_columntransformer,param_feature_select,param_feature_create,mean_test_score,std_test_score,std_fit_time,mean_score_time
0,3 vars (Last class),,,-2.7,5.508829,0.02692,0.021488
1,All numeric vars + Grade,,,-3.84,4.428237,0.075386,0.023774
2,All numeric vars + Grade,PCA(n_components=5),,-27.768,5.06618,0.006887,0.027951
3,All numeric vars + Grade,PCA(n_components=10),,-8.232,3.808592,0.006719,0.02623
4,All numeric vars + Grade,PCA(n_components=15),,-2.452,3.665048,0.010823,0.023858
5,All numeric vars + Grade,SelectKBest(k=5),,-5.068,4.560519,0.020482,0.027804
6,All numeric vars + Grade,SelectKBest(k=10),,-5.896,6.231397,0.010137,0.028221
7,All numeric vars + Grade,SelectKBest(k=15),,-2.58,5.209683,0.04222,0.033146
8,All numeric vars + Grade,SelectFromModel(estimator=LassoCV()),,-3.892,4.542028,0.115747,0.032657
9,All numeric vars + Grade,"SelectFromModel(estimator=LinearSVC(class_weight='balanced', dual=False,\n penalty='l1'),\n threshold='median')",,-6.736,5.610268,0.321201,0.03471


We can use the best estimator easily, and we can also learn about what that estimator decided.



In [60]:
# this is the highest ranked estimator. You can now .predict() with it on new data!
print(results.best_estimator_)

# this tells us which vars we picked, but is a boolean array (not var names)
print("="*60)
print("Columns picked")
print(results.best_estimator_['feature_select'].support_)

# the df AFTER the preproc step has the var names
preproc_df = df_after_transform(preproc_pipe,X_train)

# so to get the names of the vars, our df_after_transform helps!
mask = results.best_estimator_['feature_select'].support_
print("="*60)
print("Columns picked")
print(preproc_df.columns[mask].tolist())

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['annual_inc', 'dti',
                                                   'fico_range_high',
                                                   'fico_range_low',
                                                   'installment', 'int_rate',
                                                   'loan_amnt', 'mort_acc',
                                                   'open_acc', 'pub_rec',
                                                   'pub_rec_bankruptcies',
                                          