# Challenge 2: Make a Scikit Learn + MLFlow Tracked Lead Scoring Model


Machine Learning & APIs with Python Course (DS4B 201-P)

Business Science

# Challenge Summary

**The Situation:** Your team is excited about the results from the H2O solution, but wants to try using a model that doesn't require the H2O initialization in production. They aren't familiar with Pycaret, but they are familiar with Scikit Learn. They want it tracked in MLFlow. 

**Your Proposed Solution:** You recommend to create a solution that mimics your Pycaret but that is built using Scikit Learn and XGBoost. And then to track the AUC in MLFlow. 

## Objectives:

1. Implement `sklearn` functionality to build a pipeline that mimics the Pycaret solution. 

2. Use `mlflow` to track the solution in a new experiment called: `sklearn_lead_scoring_2`.

## Getting Started

To read in the data, make sure your current working directory is set to the project directory. Two useful jupyter magic commands are:

1. `%pwd`: Print working directory (you can detect your current directory)
2. `%cd`: You can change directory to your working directory using relative paths or full paths.

In [1]:
%pwd

'C:\\Users\\daver\\OneDrive\\DESKTOP\\DS4B_201P\\ds4b_201p_course\\challenge_02'

In [2]:
# You can change this to your correct directory
%cd C:\\Users\\daver\\OneDrive\\DESKTOP\\DS4B_201P\\ds4b_201p_course

C:\Users\daver\OneDrive\DESKTOP\DS4B_201P\ds4b_201p_course


In [3]:
%pwd

'C:\\Users\\daver\\OneDrive\\DESKTOP\\DS4B_201P\\ds4b_201p_course'

## Python Package Imports

Import core packages

In [4]:
import pandas as pd
import numpy as np

import email_lead_scoring as els

Add Pycaret, Scikit Learn, and MLFlow Imports

In [5]:
# Pycaret
import pycaret.classification as clf

# Sklearn
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Sklearn Model Selection
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# Sklearn Feature Engineering
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

# Sklearn Metrics
from sklearn.metrics import SCORERS

# Model
from xgboost import XGBClassifier

# MLFlow
import mlflow
import mlflow.sklearn as mlflow_sklearn

Read in the data. 

In [6]:
leads_df = els.db_read_and_process_els_data()

leads_df

Unnamed: 0,mailchimp_id,user_full_name,user_email,member_rating,optin_time,country_code,tag_count,made_purchase,optin_days,email_provider,tag_count_by_optin_day,tag_aws_webinar,tag_learning_lab,tag_learning_lab_05,tag_learning_lab_09,tag_learning_lab_11,tag_learning_lab_12,tag_learning_lab_13,tag_learning_lab_14,tag_learning_lab_15,tag_learning_lab_16,tag_learning_lab_17,tag_learning_lab_18,tag_learning_lab_19,tag_learning_lab_20,tag_learning_lab_21,tag_learning_lab_22,tag_learning_lab_23,tag_learning_lab_24,tag_learning_lab_25,tag_learning_lab_26,tag_learning_lab_27,tag_learning_lab_28,tag_learning_lab_29,tag_learning_lab_30,tag_learning_lab_31,tag_learning_lab_32,tag_learning_lab_33,tag_learning_lab_34,tag_learning_lab_35,tag_learning_lab_36,tag_learning_lab_37,tag_learning_lab_38,tag_learning_lab_39,tag_learning_lab_40,tag_learning_lab_41,tag_learning_lab_42,tag_learning_lab_43,tag_learning_lab_44,tag_learning_lab_45,tag_learning_lab_46,tag_learning_lab_47,tag_time_series_webinar,tag_webinar,tag_webinar_01,tag_webinar_no_degree,tag_webinar_no_degree_02
0,3,Garrick Langworth,garrick.langworth@gmail.com,2,2019-05-22,in,6,1,589,gmail.com,0.010169,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,4,Cordell Dickens,cordell.dickens@gmail.com,4,2018-11-19,other,0,1,773,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8,Inga Dach,inga.dach@gmail.com,2,2018-11-19,other,0,1,773,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10,Ferdinand Bergstrom,ferdinand.bergstrom@gmail.com,2,2020-03-20,co,3,1,286,gmail.com,0.010453,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,11,Justen Simonis,justen.simonis@gmail.com,2,2020-04-14,other,0,0,261,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19914,33405,Ms. Felicity Moore MD,ms.felicity.moore.md@gmail.com,2,2018-11-18,other,0,0,774,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19915,33406,Shirley Rowe,shirley.rowe@gmail.com,1,2019-03-12,br,0,0,660,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19916,33407,Jarrett Walker-Carroll,jarrett.walkercarroll@gmail.com,2,2019-09-09,in,0,0,479,gmail.com,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19917,33408,Tanja Herzog,tanja.herzog@gmail.com,2,2019-10-24,other,2,0,434,gmail.com,0.004598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Remove unnecessary columns.

In [7]:
# Removing unnecessary columns
df = leads_df \
    .drop(['mailchimp_id', 'user_full_name', 'user_email', 'optin_time', 'email_provider'], axis = 1)

df

Unnamed: 0,member_rating,country_code,tag_count,made_purchase,optin_days,tag_count_by_optin_day,tag_aws_webinar,tag_learning_lab,tag_learning_lab_05,tag_learning_lab_09,tag_learning_lab_11,tag_learning_lab_12,tag_learning_lab_13,tag_learning_lab_14,tag_learning_lab_15,tag_learning_lab_16,tag_learning_lab_17,tag_learning_lab_18,tag_learning_lab_19,tag_learning_lab_20,tag_learning_lab_21,tag_learning_lab_22,tag_learning_lab_23,tag_learning_lab_24,tag_learning_lab_25,tag_learning_lab_26,tag_learning_lab_27,tag_learning_lab_28,tag_learning_lab_29,tag_learning_lab_30,tag_learning_lab_31,tag_learning_lab_32,tag_learning_lab_33,tag_learning_lab_34,tag_learning_lab_35,tag_learning_lab_36,tag_learning_lab_37,tag_learning_lab_38,tag_learning_lab_39,tag_learning_lab_40,tag_learning_lab_41,tag_learning_lab_42,tag_learning_lab_43,tag_learning_lab_44,tag_learning_lab_45,tag_learning_lab_46,tag_learning_lab_47,tag_time_series_webinar,tag_webinar,tag_webinar_01,tag_webinar_no_degree,tag_webinar_no_degree_02
0,2,in,6,1,589,0.010169,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,4,other,0,1,773,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,other,0,1,773,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2,co,3,1,286,0.010453,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,other,0,0,261,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19914,2,other,0,0,774,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19915,1,br,0,0,660,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19916,2,in,0,0,479,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19917,2,other,2,0,434,0.004598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Part 1: Examine a Pycaret Model

### Step 1: Load the pycaret model. 

Use `clf.load_model()` to get the Pycaret model provided in challenge_02/model_pycaret/xgb_model_tuned.pkl. Store it as `model_pycaret`.

In [8]:
# Enter your code here...
model_pycaret = clf.load_model('challenge_02/model_pycaret/xgb_model_tuned')
model_pycaret

Transformation Pipeline and Model Successfully Loaded


Pipeline(steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=['country_code'],
                                      display_types=False,
                                      ml_usecase='classification',
                                      numerical_features=['tag_count',
                                                          'tag_count_by_optin_day',
                                                          'tag_aws_webinar',
                                                          'tag_learning_lab',
                                                          'tag_learning_lab_05',
                                                          'tag_learning_lab_09',
                                                          'tag_learning_lab_11',
                                                          'tag_learning_lab_12',
                                                          'tag_learning_lab_13',
                                                       

### Step 2: Examine the Pipeline. 

Examine the elements in the Pipeline, `model_pycaret`, with a for-loop. Use something like: 

``` python
for index, item in enumerate(my_list):
    print(f"Index: {index}, Item: {item}")
```

In [9]:
# Enter your code here...
for index, item in enumerate(model_pycaret):
    print(f"Index: {index}, Item: {item}")


Index: 0, Item: DataTypes_Auto_infer(categorical_features=['country_code'], display_types=False,
                     ml_usecase='classification',
                     numerical_features=['tag_count', 'tag_count_by_optin_day',
                                         'tag_aws_webinar', 'tag_learning_lab',
                                         'tag_learning_lab_05',
                                         'tag_learning_lab_09',
                                         'tag_learning_lab_11',
                                         'tag_learning_lab_12',
                                         'tag_learning_lab_13',
                                         'tag_learning_lab_14',
                                         'tag_learning_lab...
                                         'tag_learning_lab_20',
                                         'tag_learning_lab_21',
                                         'tag_learning_lab_22',
                                         'tag_learning_

Reviewing the pycaret model, we will tackle doing 2 main things to prepare the data:

1. Imputation
2. Categorical Encoding & Ordinal Encoding 

Reviewing the pycaret model, you'll need these columns to help identify what to do for the different feature types. 

In [10]:
categorical_features=['country_code']

#ordinal_features = {'member_rating':["1", "2", "3", "4", "5"]}

ordinal_features = ['member_rating']

numeric_features = ['tag_count',
 'tag_count_by_optin_day',
 'tag_aws_webinar',
 'tag_learning_lab',
 'tag_learning_lab_05',
 'tag_learning_lab_09',
 'tag_learning_lab_11',
 'tag_learning_lab_12',
 'tag_learning_lab_13',
 'tag_learning_lab_14',
 'tag_learning_lab_15',
 'tag_learning_lab_16',
 'tag_learning_lab_17',
 'tag_learning_lab_18',
 'tag_learning_lab_19',
 'tag_learning_lab_20',
 'tag_learning_lab_21',
 'tag_learning_lab_22',
 'tag_learning_lab_23',
 'tag_learning_lab_24',
 'tag_learning_lab_25',
 'tag_learning_lab_26',
 'tag_learning_lab_27',
 'tag_learning_lab_28',
 'tag_learning_lab_29',
 'tag_learning_lab_30',
 'tag_learning_lab_31',
 'tag_learning_lab_32',
 'tag_learning_lab_33',
 'tag_learning_lab_34',
 'tag_learning_lab_35',
 'tag_learning_lab_36',
 'tag_learning_lab_37',
 'tag_learning_lab_38',
 'tag_learning_lab_39',
 'tag_learning_lab_40',
 'tag_learning_lab_41',
 'tag_learning_lab_42',
 'tag_learning_lab_43',
 'tag_learning_lab_44',
 'tag_learning_lab_45',
 'tag_learning_lab_46',
 'tag_learning_lab_47',
 'tag_time_series_webinar',
 'tag_webinar',
 'tag_webinar_01',
 'tag_webinar_no_degree',
 'tag_webinar_no_degree_02',
 'optin_days']

# Part 2: Build a Scikit Learn Model



## Step 1: Prepare the Data

Create `X` and `y` from the categorical, ordinal and numeric features. The target is 'made_purchase'.

In [11]:
# Enter your code here...
X = df.drop('made_purchase', axis = 1)
y = df['made_purchase']

## Step 2: Make the Transformers (Preprocessor)

We will implement several key preprocessing steps:
1. `sklearn.impute.SimpleImputer()`
2. `sklearn.preprocessing.OrdinalEncoder()`
3. `sklearn.preprocessing.OneHotEncoder()`

### Step 2A: Create the Stacked Transformer

1. Use the `make_column_transformer()` function to stack 3 transformation steps:
   * A `SimpleImputer()` for the `numeric_features`
   * An `OrdinalEncoder()` the `ordinal_features`. 
   * A `OneHotEncoder()` that is applied to only the `categorical_features`
2. Make sure to set `remainder = 'passthrough'` to keep other columns. 
3. Store the output as `transformers`.
4. Test the transformation by running `pd.DataFrame(transformers.fit_transform(df))`

In [12]:
# Enter your code here...
transformers = make_column_transformer(
    (SimpleImputer(), numeric_features),
    (OrdinalEncoder(), ordinal_features),
    (OneHotEncoder(), categorical_features),
    remainder='passthrough'
                        )
#transformers
transformers.fit_transform(df)

array([[6.00000000e+00, 1.01694915e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.00000000e+00, 4.59770115e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

## Step 3: Make the Scikit Learn Model

### Step 3A: Make a StratfiedKFold() 


Create a Stratified K-Fold Cross Validation Strategy by:
1. Using the `StratifiedKFold()`
2. Setting key parameters: n_splits=5, shuffle=True, random_state=123
3. Store the output as `stratified_kfold`

In [13]:
# Enter your code here...
stratified_kfold = StratifiedKFold(n_splits = 5, shuffle =True, random_state = 123)



### Step 3B: Make the Grid Search Cross Validation


Next, make a `GridSearchCV()` using:
1. estimator = XGBClassifier()
2. param_grid = {'learning_rate': [0.1, 0.2, 0.35, 0.4, 0.5]}
3. cv = stratified_kfold
4. refit = True
5. scoring = "roc_auc"
6. Store the output as `grid_xgb`

In [14]:
# Enter your code here...
grid_xgb = GridSearchCV(
    estimator  = XGBClassifier(),
    param_grid = {'learning_rate': [0.1,0.2, 0.35, 0.4, 0.5]},
    cv         = stratified_kfold,
    refit      = True,
    scoring    = "roc_auc"
) 


### Step 3C: Make the Pipeline

Use the `make_pipeline()` function to combine the `transformers` and `grid_xgb` into a single pipline. Store the object as `pipeline_grid_xgb`.

In [15]:
# Enter your code here...
pipeline_grid_xgb = make_pipeline(transformers, grid_xgb)


Next, fit the `pipeline_grid_xgb`. This will take 30 seconds or so.  

In [16]:
# Enter your code here...
pipeline_grid_xgb.fit(X, y) 


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(),
                                                  ['tag_count',
                                                   'tag_count_by_optin_day',
                                                   'tag_aws_webinar',
                                                   'tag_learning_lab',
                                                   'tag_learning_lab_05',
                                                   'tag_learning_lab_09',
                                                   'tag_learning_lab_11',
                                                   'tag_learning_lab_12',
                                                   'tag_learning_lab_13',
                                                   'tag_learning_lab_14',
                                          

Note that I have stored this model in `challenge_02/model_sklearn`. You can load this model to proceed with the challenge if needed.

In [17]:
# import joblib
# joblib.dump(pipeline_grid_xgb, "challenge_02/model_sklearn/pipeline_grid_xgb.pkl")
# pipeline_grid_xgb = joblib.load("challenge_02/model_sklearn/pipeline_grid_xgb.pkl")

### Step 3D: Extract Information from the Grid Search CV

Extract the Best Paramters by:
1. Extracting the XGBoost Grid Search object from the pipeline. `pipeline_grid_xgb[1]`
2. Use the `.best_params_` attribute to return the results.
3. Store the results as `best_params_`. 

In [18]:
# Enter your code here...
best_params = pipeline_grid_xgb[1].best_params_

Next, extract the CV results as a DataFrame by using: `pd.DataFrame(pipeline_grid_xgb[1].cv_results_)`

And store the result as `cv_results`.

In [19]:
# Enter your code here...
cv_results = pd.DataFrame(pipeline_grid_xgb[1].cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,6.605837,1.02674,0.043769,0.01239,0.1,{'learning_rate': 0.1},0.796591,0.801273,0.836092,0.81178,0.815966,0.812341,0.01377,1
1,3.956447,1.02426,0.040236,0.009907,0.2,{'learning_rate': 0.2},0.780018,0.78473,0.815499,0.805747,0.807167,0.798632,0.013767,2
2,3.283359,0.186593,0.032873,0.012571,0.35,{'learning_rate': 0.35},0.753555,0.775321,0.798627,0.797505,0.796512,0.784304,0.017633,3
3,2.169538,2.225239,0.011203,0.008563,0.4,{'learning_rate': 0.4},0.75535,0.763089,0.7858,0.776463,0.782932,0.772727,0.011695,4
4,0.522773,0.009882,0.006401,0.00196,0.5,{'learning_rate': 0.5},0.744965,0.75073,0.781247,0.772833,0.775083,0.764972,0.014367,5


Extract the `mean_test_score` for the grid search run. Store the result as a single value numeric value called `best_auc`.

In [20]:
# Enter your code here...
best_auc = cv_results['mean_test_score'][0]


Finally, predict with the pipeline to validate: `pipeline_grid_xgb.predict_proba(df)`

In [21]:
# Enter your code here...
pipeline_grid_xgb.predict_proba(X)


array([[0.880723  , 0.11927701],
       [0.93436354, 0.06563646],
       [0.98283577, 0.01716421],
       ...,
       [0.98630804, 0.01369198],
       [0.9826373 , 0.01736271],
       [0.9744798 , 0.02552019]], dtype=float32)

# Part 3: MLFlow

With the model and attributes in hand, we can now store the model for production and track it with MLFlow. 

In [22]:
# Make sure you have these objects before proceeding:
[
    pipeline_grid_xgb,
    best_params,
    cv_results,
    best_auc
]

[Pipeline(steps=[('columntransformer',
                  ColumnTransformer(remainder='passthrough',
                                    transformers=[('simpleimputer',
                                                   SimpleImputer(),
                                                   ['tag_count',
                                                    'tag_count_by_optin_day',
                                                    'tag_aws_webinar',
                                                    'tag_learning_lab',
                                                    'tag_learning_lab_05',
                                                    'tag_learning_lab_09',
                                                    'tag_learning_lab_11',
                                                    'tag_learning_lab_12',
                                                    'tag_learning_lab_13',
                                                    'tag_learning_lab_14',
                            

### Create the MLFLow Experiment

Create a new mlflow experiment:
1. Use `mlflow.set_tracking_uri("mlruns")` to set the active tracking folder to your mlruns folder in this project.
2. Create an `mlflow` experiment named, `'sklearn_lead_scoring_2'`. 
3. Use the `try except` strategy to create the experiment with `mlflow.create_experiment()` from the course. 

In [23]:
# Enter your code here...
EXPERIMENT_NAME = 'sklearn_lead_scoring_2'
mlflow.set_tracking_uri("mlruns")

try:
    mlflow.create_experiment('sklearn_lead_scoring_2')
except:
    f" Experiment Name: {EXPERIMENT_NAME} already exists."

### Set the Experiment and Start the Run

Use `mlflow.set_experiement()` to set the active experiment to your 'sklearn_lead_scoring_2' experiment. 

In [24]:
# Enter your code here...
mlflow.set_experiment('sklearn_lead_scoring_2')

<Experiment: artifact_location='mlruns/3', experiment_id='3', lifecycle_stage='active', name='sklearn_lead_scoring_2', tags={}>

Start the mlflow run. 

In [25]:
# Enter your code here...
mlflow.start_run()

<ActiveRun: >

Log the Model:
1. Use `mlflow_sklearn.log_model()`
2. The `sk_model` should be set to `pipeline_grid_xgb` from your GridSearchCV pipeline. 
3. The `artifact_path` should be set to "model"

In [27]:
# Enter your code here...
mlflow_sklearn.log_model(
 sk_model      = pipeline_grid_xgb,
 artifact_path = "model"    
 )

ModelInfo(artifact_path='model', flavors={'python_function': {'model_path': 'model.pkl', 'loader_module': 'mlflow.sklearn', 'python_version': '3.7.1', 'env': 'conda.yaml'}, 'sklearn': {'pickled_model': 'model.pkl', 'sklearn_version': '0.23.2', 'serialization_format': 'cloudpickle', 'code': None}}, model_uri='runs:/b6f1c2dda3d04fa3aa512dd64ea021bb/model', model_uuid='ab3cb07cf38040b2bbd10bbda80dce3a', run_id='b6f1c2dda3d04fa3aa512dd64ea021bb', saved_input_example_info=None, signature_dict=None, utc_time_created='2023-08-09 22:14:48.052185', mlflow_version='1.28.0')

Log the best AUC:
1. Use `mlflow.log_metric()` to log the AUC of the best model. 
2. Use the `best_auc` that was stored previously.

In [28]:
# Enter your code here...
mlflow.log_metric("AUC", best_auc) 


Set a Source Tag:
1. Use `mlflow.set_tag()` to create a "Source" tag. 
2. Tag the model `"sklearn_gridsearch_model"` so you know that scikit learn GridSearchCV produced this model.

In [30]:
# Enter your code here...
mlflow.set_tag("Source", "sklearn_gridsearch_model")



Save the CV Results dataframe as a CSV (this code is provided to eliminate confusion):

In [31]:
# Save CV Results as CSV

experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

exp_id = experiment.experiment_id

run_id = mlflow.active_run().info.run_id

try:
    cv_path = f'mlruns/{exp_id}/{run_id}/artifacts/model/cv_results.csv'
    cv_results.to_csv(cv_path, index=False) 
    print(f'Cross Validations results saved in {cv_path}')
except:
    print('Could not save Cross Validation as CSV.')


Cross Validations results saved in mlruns/3/b6f1c2dda3d04fa3aa512dd64ea021bb/artifacts/model/cv_results.csv


End the MLFlow Run:
1. Use `mlflow.end_run()` to stop the active run

In [32]:
# Enter your code here...
mlflow.end_run()

Run the MLFlow UI and get the prediction scripts (this code is provided to eliminate confusion):
1. Use `mlflow ui` if needed to find your run ID and production deployment script (it's in the model folder)
2. Update the `logged_model` with your run id.
3. Verify that predictions are made.

In [33]:
import mlflow
# logged_model = 'runs:/f28d5ca56170494b839215274ea318d0/model'
logged_model = 'runs:/b6f1c2dda3d04fa3aa512dd64ea021bb/model'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

# Make sure to use predict_proba!
loaded_model._model_impl.predict_proba(X)

array([[0.880723  , 0.11927701],
       [0.93436354, 0.06563646],
       [0.98283577, 0.01716421],
       ...,
       [0.98630804, 0.01369198],
       [0.9826373 , 0.01736271],
       [0.9744798 , 0.02552019]], dtype=float32)

# Conclusions

Congratulations! You've just:
1. Developed a Lead Scoring Model with Scikit Learn (0.81 MEAN AUC which rivals H2O's model)
2. Used sklearn to stack advanced preprocessing strategies with GridSearch
3. This translates into $1M+ Savings for an organization with 100K email subscribers.
4. The entire process is Tracked for Production in MLFlow: Models, Experiments, Cross Validation Results, and best AUC.
5. And it's deployable through MLFlow.