# Integrate MLFlow and Papermill to track ML experiments with DataRobot

This notebook outlines how to:

<ul><li>Use MLFlow with DataRobot API to track and log machine learning experiments
<ul><li>Benefit: 
Consistent comparison of results across experiments</li></ul></li>
<li>Use Papermill with DataRobot API to create artifacts from machine learning experiments to reduce effort needed for collaboration
<ul><li>Benefit: Automation of experiments to avoid errors and reduce manual effort.</li></ul></li>
    <li>Execute Jupyter notebooks with parameters like Python scripts</li>
<li>Loop through parameter combinations to run multiple projects; build a Model Factory.</li></ul>

This orchestration notebook illustrates the framework to integrate MLFlow and Papermill with the DataRobot API to run the experiment notebook with different parameters per experiment. 

<font style="color:blue">This notebook will run the `experiment_notebook.ipynb` with different parameters</font><br>

Required Python Libraries:
<ul>
    <li><a href='https://docs.datarobot.com/en/docs/api/api-quickstart/index.html'>datarobot</a></li>
    <li><a href='https://mlflow.org/docs/latest/quickstart.html'>mlflow</a></li>
    <li><a href='https://papermill.readthedocs.io/en/latest/installation.html'>papermill</a></li>
    <li><a href='https://pypi.org/project/permetrics/'>permetrics</a></li>
</ul>




## Setup

### Import libraries

`uuid` is used to generate unique identifiers for our experimentation.
`itertools` is used to generate permutations of all experiments.

In [1]:
import papermill as pm
import uuid
import itertools
import os

Use the snippet below to create requisite folders.

In [2]:
if(not os.path.isdir('./experiments_bkup')):
    os.mkdir('./experiments_bkup')

### Configure use case settings 

These are the basic settings needed to run Time Series projects through the DataRobot API. These settings have to be updated for the intended use case.

In [3]:
DR_AUTH_YAML_FILE = '~/.config/datarobot/drconfig.yaml' # yaml file with authentication details
TRAINING_DATA = './DR_Demo_Sales_Multiseries_training (1).xlsx' # location of training dataset
DATE_COL = 'Date' # datetime column
TRAINING_STOP_DATE = '01-06-2014' # cutoff date for private holdout for experiments
TRAINING_STOP_DATE_FORMAT = '%d-%m-%Y' # datetime format specifier for TRAINING_STOP_DATE 
TARGET_COL = 'Sales' # target column for the usecase
KIA_COLS = ['Marketing', 'Near_Xmas', 'Near_BlackFriday',
            'Holiday', 'DestinationEvent'] # known in advance features
IS_MULTISERIES = True # does the dataset have multiple time series
MULTISERIES_COLS = ['Store'] # if the dataset has multiple ts, columns that uniquely identify a ts.

### Scenario

There are many experiments that need to be tried in Time Series projects. The most basic ones include experimenting with multiple forecast derivation windows and enabling known in advance features. Only these two parameters can result in atleast six different experiments as shown by the example in the cell below;

### First experiment series set

This example starts with basic set of experiments to identify quickly if the dataset has any signal. You will use a combination of feature derivation windows and known in advance features to do so.

In [4]:
fdws = [35, 70, 14] # The Time Series feature derivation window parameter values to experiment
kias = [False, True] # The known in advance parameter values to experiment with

Run multiple projects for all permutations of the values from the above two parameter sets. This can be seen as a <b>"DataRobot Project Factory"</b> where you will run multiple projects using Papermill. Papermill allows us to send parameters to a Jupyter notebook and execute if for those parameters. It will also create copies of the notebook execute in a specified folder.   

In [5]:
INPUT_PATH = './experiment_notebook.ipynb'
for item in itertools.product(fdws,kias):
    UUID = str(uuid.uuid1())
    OUTPUT_PATH = './experiments_bkup/experiment_{}.ipynb'.format(UUID)
    pm.execute_notebook(input_path=INPUT_PATH,
                        output_path=OUTPUT_PATH,
                        parameters={'FDW':item[0],
                                    'KIA':item[1], 
                                    'UUID':UUID, 
                                    'DR_AUTH_YAML_FILE':DR_AUTH_YAML_FILE,
                                    'TRAINING_DATA':TRAINING_DATA,
                                    'DATE_COL':DATE_COL,
                                    'TRAINING_STOP_DATE':TRAINING_STOP_DATE,
                                    'TRAINING_STOP_DATE_FORMAT':TRAINING_STOP_DATE_FORMAT,
                                    'TARGET_COL':TARGET_COL,
                                    'KIA_COLS':KIA_COLS,
                                    'IS_MULTISERIES':IS_MULTISERIES,
                                    'MULTISERIES_COLS':MULTISERIES_COLS,
                                    'REFERENCE_NOTEBOOK':OUTPUT_PATH})

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

### Experiment results

After completion of the above set of experiments, MLFlow dashboard can be invoked for perusal of the results. Run the below cell or the contents of the cell in command line to run the MLFlow server and UI. 

In [6]:
# Ensure to stop the execution of this cell before running next cells 
!mlflow ui

[2022-12-07 15:47:24 +0530] [20341] [INFO] Starting gunicorn 20.1.0
[2022-12-07 15:47:24 +0530] [20341] [INFO] Listening at: http://127.0.0.1:5000 (20341)
[2022-12-07 15:47:24 +0530] [20341] [INFO] Using worker: sync
[2022-12-07 15:47:24 +0530] [20345] [INFO] Booting worker with pid: 20345
^C
[2022-12-07 15:58:55 +0530] [20341] [INFO] Handling signal: int
[2022-12-07 15:58:55 +0530] [20345] [INFO] Worker exiting (pid: 20345)


### Further experimentations

Once comfortable with the initial set of experiments and results, you can further expand the experiment combinations as below. The advantage of parameterization of the notebook is that you can run only the experiments that are needed and you can keep building on the experiments you already ran. <br>For example, you can run accuracy optimized blueprints set as "is false" by default if you have run that experiment in the prior cells. Time and Compute can be saved by only using the True option for the parameter in subsequent experiments.

In [7]:
# Import datarobot library for the enums
import datarobot as dr

In [8]:
fdws = [35, 14] # TS feature derivation window parameter values to experiment
kias = [False] # Known in advance parameter values to experiment
acc_opt = [True] # Enable accuracy optimized blueprints
search_int = [True] # Search for interactions between features 
mode = [dr.enums.AUTOPILOT_MODE.FULL_AUTO] # Autopilot mode values to experiment

In [9]:
INPUT_PATH = './experiment_notebook.ipynb'
for item in itertools.product(*[fdws,kias,acc_opt,search_int,mode]):
    UUID = str(uuid.uuid1())
    OUTPUT_PATH = './experiments_bkup/experiment_{}.ipynb'.format(UUID)
    pm.execute_notebook(input_path=INPUT_PATH,
                        output_path=OUTPUT_PATH,
                        parameters={'FDW':item[0],
                                    'KIA':item[1], 
                                    'ACC_OPT':item[2], 
                                    'UUID':UUID,
                                    'DR_AUTH_YAML_FILE':DR_AUTH_YAML_FILE,
                                    'TRAINING_DATA':TRAINING_DATA,
                                    'DATE_COL':DATE_COL,
                                    'TRAINING_STOP_DATE':TRAINING_STOP_DATE,
                                    'TRAINING_STOP_DATE_FORMAT':TRAINING_STOP_DATE_FORMAT,
                                    'TARGET_COL':TARGET_COL,
                                    'KIA_COLS':KIA_COLS,
                                    'IS_MULTISERIES':IS_MULTISERIES,
                                    'MULTISERIES_COLS':MULTISERIES_COLS,
                                    'REFERENCE_NOTEBOOK':OUTPUT_PATH})

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

Executing:   0%|          | 0/25 [00:00<?, ?cell/s]

In [12]:
!mlflow ui

[2022-12-07 17:04:35 +0530] [45452] [INFO] Starting gunicorn 20.1.0
[2022-12-07 17:04:35 +0530] [45452] [INFO] Listening at: http://127.0.0.1:5000 (45452)
[2022-12-07 17:04:35 +0530] [45452] [INFO] Using worker: sync
[2022-12-07 17:04:35 +0530] [45457] [INFO] Booting worker with pid: 45457
^C
[2022-12-07 17:05:36 +0530] [45452] [INFO] Handling signal: int
[2022-12-07 17:05:36 +0530] [45457] [INFO] Worker exiting (pid: 45457)
