# BLS MLFlow Example Notebook 1- Scikit-Learn & ONET
*Remy Stewart, BLS Civic Digital Fellow Summer 2022*


# 1.0 Introduction

This notebook provides a walkthrough of how to integrate MLflow into machine learning workflows aligned with comparable data science projects at the BLS. It highlights general MLflow features such as setting experiments & tracking model runs as well as parameter, metric, and artifact logging. It additionally demonstrates methods from `mlflow.sklearn`, MLflow's module designed for supporting Scikit-learn based models. For a general overview of the MLflow platform, please refer to this repository's BLS MLflow Documentation.   

## 1.1 Data

This example draws from public data sourced from the Occupational Information Network's (ONET) occupational requirements content module. ONET characterizes occupations across diverse industries by identifying the General Work Activities (**GWAs**) associated with different careers and the assorted work **tasks** that can characterize potentially multiple GWAs. The 2020 ONET occupational requirements module distinguishes 37 GWAs linked to 17,119 unique tasks. For example, the 4A3b1 GWA code representing the work activity of "Interacting with Computers" is affiliated with 259 specific tasks within ONET, such as the tasks of "Modifying existing programs to enhance efficiency" and "Perform and direct Website updates". 


## 1.2 Model Goals 
Our goal is to build a supervised classifier that can robustly identify which GWAs a given task is affiliated with. This use case features some important characteristics we'll need to account for when incoperating MLflow into our model pipeline. This is a *multi-class* problem in that there are 37 potential GWAs a task can be classified with. It is additionally a *multi-label* problem in that tasks can be referenced by multiple GWAs rather than each task only being affiliated with a single GWA. 

# 2.0 Set-Up and Connecting to MLflow 
Bearing these modeling goals in mind, let's start with our necessary model preparation steps and establish our connection to the MLflow server. 


## 2.1 Importing Packages

First, comment out the code below if you're running this script directly and don't have MLflow installed within your environment. If you're currently using the bls-mlflow conda virtual environment provided within this repository then MLflow will already be installed. 

In [None]:
# !pip install mlflow

We'll use a standard set of packages for a Scikit-learn based ML pipeline. We'll import MLflow directly as well as two of MLflow's model flavors with `mlflow.pyfunc` being its generic base flavor that specialized flavors such as `mlflow.sklearn` build from. We'll also import the `MlflowClient` from the `mlflow.tracking` module which will facilitate our interactions with the MLflow server directly within the notebook. 

We will additionally retrieve a set of helper functions designed to streamline our modeling pipeline's core steps including text preprocessing, splitting data into training and testing sets, vectorizing the work task text, and computing multi-class metrics. None of these functions draw on MLflow directly and they are reviewable from the `helpers.py` file within the same repository folder as this notebook.

In [1]:
# Standard Modules 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
import tempfile
from pprint import pprint

# Turning off UserWarnings since the penalty='none' warning will trigger many times across the logistic 
# regression grid search.
import warnings
warnings.filterwarnings("ignore", category = UserWarning)

# Sklearn Modules
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import GridSearchCV

# MLFlow Modules
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
from mlflow.tracking import MlflowClient

# Helper functions
import pyfiles

%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     /ext/home/stewart_r/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2.2 Connecting to MLflow Server 

We can then connect to the sever hosting the BLS MLflow platform by passing the following address through `mlflow.set_tracking_uri`.

In [None]:
mlflow.set_tracking_uri("http://<Remote IP>:<Port>")
mlflow.get_tracking_uri()

## 2.3 Setting Experiments & Model Runs

Experiments within MLflow are equivalent to project names for multiple runs of a model to be stored into. If you pass a name into set_experiment that isn't already catalogued within MLflow, it'll create a new experiment folder for you automatically. 

In [3]:
mlflow.set_experiment('sklearn-onet-experiment')

<Experiment: artifact_location='/ext/mlflow-artifacts/1', experiment_id='1', lifecycle_stage='active', name='sklearn-onet-experiment', tags={}>

With our experiment's name now established, we're ready to start recording our following workflow as a model run with MLflow. The instantiation of `mlflow.start_run` will begin timing the run until mlflow.end_run() is passed, so you can place this call strategically to record your average runtimes across your full modeling script. A model run will appear in the Mlflow UI as soon as you start the run and will be reported as "UNFINISHED" until you end it. 

If you pass a MLFlow method without having starting a run first, MLflow will automatically start a run for you. The baseline configuration for MLflow is to have only one run at a time per user connecting to an experiment. It's therefore easy to accidently start a run without realizing you've already done so, particularly when you're initially intergrating MLflow into your code. You can optionally add a `run_name` to label your run within the Tracking Server, but the run will be assigned a unique hash code as its underlying ID either way.

In [None]:
run = mlflow.start_run(run_name="test_run_1")
print(f"Started run {run.info.run_id}")

# 3.0 Data Preprocessing 
Let's now load in our ONET data and prepare it for our classifier. We'll draw from MLflow to log relevant information regarding our preprocessing such as metadata on our data sets we're using and our text vectorizer.   

In [5]:
onet_df = pd.read_parquet("../data/onet_task_gwa.pqt")
onet_df = onet_df[['Task', 'GWA']]
onet_df

Unnamed: 0,Task,GWA
0,"Review and analyze legislation, laws, or publi...",4A2a4
1,"Review and analyze legislation, laws, or publi...",4A4b6
2,Direct or coordinate an organization's financi...,4A4b4
3,"Confer with board members, organization offici...",4A4a2
4,Analyze operations to evaluate performance of ...,4A2a4
...,...,...
23009,Unload cars containing liquids by connecting h...,4A3a2
23010,Copy and attach load specifications to loaded ...,4A1b1
23011,Start pumps and adjust valves or cables to reg...,4A3a3
23012,"Perform general warehouse activities, such as ...",4A1b3


Our preprocessing helper function cleans up the task text by removing punctuation and stopwords along with creating multi-label encodings for each task regarding its affiliation with the 37 GWAs. You can see its resulting data frame transformation as follows: 

In [6]:
onet_df = pyfiles.helpers.preprocess(onet_df)
onet_df.head()

  onet_df['Task'] = onet_df['Task'].str.replace(r'[^\w\s]+', '')


Unnamed: 0,Task,4A1a1,4A1a2,4A1b1,4A1b2,4A1b3,4A2a1,4A2a2,4A2a3,4A2a4,...,4A4a6,4A4a7,4A4a8,4A4b3,4A4b4,4A4b5,4A4b6,4A4c1,4A4c2,4A4c3
0,Accept check containers mail large volume mail...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,Accept check containers mail parcels large vol...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,Accept credit applications verify credit refer...,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Accept music requests event guests,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,Accept payment accounts,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


A relevant metadata attribute we may be interested in tracking is the shape of this transformed data frame- referring to the number of rows and columns- before partitioning it into training and testing sets. We can use Panda's `shape` method to obtain the data frame's shape and store its returned values as a parameter logged within our model run. 

In [7]:
mlflow.log_param("Dataset Shape", onet_df.shape)

## 3.1 Logging Data Vectorizers & Train/Test Splits 

We'll now split our data into 78% of the sample devoted to training, 20% for testing, and 2% to generate predictions with after successfully registering our model into MLflow. We additionally employ Scikit-learn's `TfidfVectorizer` in the following helper function to transform the task text to be read by our model. 

Since TF-IDF is computed from the training data which can vary based on training data split sizes, data record shuffling, and/or different random seeds, it'll be helpful to log our generated `TfidfVectorizer` within our model run. Vectorizers can be saved as pickle files which would be considered as a model artifact by MLflow. Artifacts are retrieved directly from files within MLflow's `log_artifacts` method. We'll save our pickled vectorizer within a temporary local directory to be logged as an artifact. This approach can be easily adapted for other common encoders, vectorizers, and scalers within Scikit-learn and other ML libraries. 

In [9]:
task_train, task_test, gwa_train, gwa_test, task_predict, tfidf_vectorizer = pyfiles.helpers.split_and_encode(onet_df, "Task")

tempdir = tempfile.mkdtemp()
vectorizer_pickled = os.path.join(tempdir, "tfidf_vectorizer.pkl")
pickle.dump(tfidf_vectorizer, open(vectorizer_pickled, 'wb'))    
mlflow.log_artifact(vectorizer_pickled)

Logging the vectorizer ensures that we can later retrieve it with our saved models and therefore process unseen text through the same previously fitted TF-IDF vectorizer that the model was originally trained on. We may also be interested in logging the size of the split data frames themselves, particularly our training and testing set for future reference within MLflow to understand how much data our run was trained on and subsequently tested on to compute its final performance metrics. 

In [10]:
mlflow.log_param("Training Data Size", gwa_train.shape[0])

In [11]:
mlflow.log_param("Testing Data Size", gwa_test.shape[0])

# 4.0 Model Logging

We now have our data ready for our first model training and testing sequence. The following code block is where the majority of the MLflow code incorperation occurs. It's a long code chunk, so let's break it down into pieces regarding both the model architecture itself and notes on succesfully logging the function's steps into MLflow.


## 4.1 Model Framework

Our algorithm of choice will be Scikit-learn's `LogisticRegression` classifier. We'll tackle our multi-class multi-label prediction goal through a "one-vs-all" approach where we train 37 binary classifiers to individually predict whether a task is linked to their respective GWA. Tis is facilitated by the `for gwa in gwa_train.columns` for loop. 

Within a one-vs-all model design there's the chance that none of the 37 classifiers predict that a given task belong to their respective GWA. We have prior knowledge that all tasks are affiliated with at least one GWA, so in cases where none of the generated predicted probabilites by the classifier pass a more than 0.5 probability threshold for occuring, we want to take the highest probability overall as the one predicted GWA classification. The `helpers.force_prediction` function acheives this for us.


## 4.2 Parameters & Tags

When we call `mlflow.log_params` initially on our model's **parameters**, we retrieve the parameters we set originally for our classifier such as `class_weight='balanced'` and `solver='lbfgs'` along with the default hyperparameter values for any we didn't configure ourselves. 

Hyperparameter tuning is a common use case for MLflow run tracking, so we're including in our model training a small search of whether we'd like to include L2 regularization and the strength of said regularization through the `C` argument via `GridSearchCV`. We'll want to store the best parameters identified through the cross-validation search for our 37 models as additional parameters within MLflow.

MLflow can record multiple parameters at once if they're passed through a dictionary such as `best_cv_params`, in which we log each GWA model as keys and their identified best performing parameters as values. Because we're performing a grid search on a nested model- in that the logistic regression model itself is inside the `OneVsRestClassifier`- we'll need to add `estimator__` in front of all of our parameter names in the grid search dictionary so that the model passes the parameters into the logistic regression estimator rather than the `OneVsRestClassifier` in order to avoid an incompatable argument error.
    
Additionally, **Model tags** allow you to add free-form text to your recorded model run with a corresponding label. The following function demonstrates how to set an example tag within MLflow that directly follows the model parameter logging. 
    

## 4.3 Metrics

Having identified our best-fitted models following training and tuning, our script then logs a variety of model performance **metrics**. `precision_recall_fscore_support` produces four metrics that also needs to be passed as a dictionary through `mlflow.log_metrics` similar to `mlflow.log_parameters`. We therefore draw from another helper function from `helpers.py`to convert both of our variations of `precision_recall_fscore_support`'s output to a format ready for MLflow logging. Hamming score & loss are less well known metrics that are well-suited for measuring accuracy within multi-label classification tasks that draw on the concept of [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) that we additionally record. 


## 4.4 Artifacts

`log_model` following our hyperparameter tuning generates an **artifact** directory for us that includes the pickled model file, our model dependencies, and our conda virtual environment configuration. We pass our customized conda environment yaml file with `conda_env`. MLflow will use a default environment file for Scikit-learn models if no environment is specified when logging the model. We use `infer_signature` to identify the associated inputs (TF-IDF vectorized text) and outputs (binary predictions across the 37 models) of our model and include this signature as an argument to `log_model`. We additionally record our model within MLflow's Model Registry by setting a label within the `registered_model_name` argument. 

There's a few additional artifacts we may be interested in saving within our model run as well. We generate a matplotlib bar chart representing the counts of tasks assigned to different numbers of GWAs, and record its associated png file through `log_figure`. We also record this Jupyter Notebook directly as well as our helper function Python file via `log_artifact`.

With all of the MLflow functions in the following script now overviewed, let's move forward with running our model.

In [16]:
def model_run(task_train, task_test, gwa_train, gwa_test):
    
    # Setting grid search parameters that will be logged by Mlflow, storing predicted probability results and
    # best parameters after cross validation for all GWAs.
    grid_params = {"estimator__C":[1, 10, 100], "estimator__penalty":["none","l2"]}
    predicted_prob = {}
    best_cv_params = {}
        
    # Instantiate model, log model parameters, and set a model tag.
    logreg = OneVsRestClassifier(LogisticRegression(class_weight='balanced', solver='lbfgs', 
                                                    max_iter=300, random_state=607))
    cross_val = GridSearchCV(logreg, grid_params, scoring='f1_micro', cv=3)
    model_parameters = cross_val.get_params()
    mlflow.log_params(model_parameters)
    mlflow.set_tag("Example Tag", "This is a test tag.")
  
    for gwa in gwa_train.columns:
        # Fit logistic reg in One vs All multiclass multilabel task
        cross_val.fit(task_train, gwa_train[gwa])
            
        # Add to best parameters dictionary
        gwa_parameters = cross_val.best_params_
        best_cv_params[gwa] = gwa_parameters 
        
        # Estimate with best performing model
        final_model = cross_val.best_estimator_
        probabilities = [x[1] for x in final_model.predict_proba(task_test)]
        predicted_prob[gwa] = probabilities 
    
    # Ensure a GWA prediction for each task, log the best performing grid search parameters per model, generate an 
    # inferred model signature based on inputs & outputs, and log the model in MLflow's Model Registry. 
    mlflow.log_params(best_cv_params)
    predicted_prob_df = pyfiles.helpers.force_prediction(pd.DataFrame.from_dict(predicted_prob))
    signature = mlflow.models.infer_signature(task_train, predicted_prob_df)
    mlflow.sklearn.log_model(cross_val, "logreg_model", conda_env="../conda.yaml", 
                             registered_model_name="sklearn_onet", signature=signature)
    
    # MLflow requires multiple parameters within log_metrics to be passed as a dictionary. Review helpers.py for an
    # example function designed for precision_recall_fscore_support output.
        
    prf_micro = pyfiles.helpers.prf_to_dict(metrics.precision_recall_fscore_support(predicted_prob_df, gwa_test, 
                                                                            average = 'micro'), 'micro')
    prf_sample =  pyfiles.helpers.prf_to_dict(metrics.precision_recall_fscore_support(predicted_prob_df, gwa_test, 
                                                                              average = 'samples'), 'sample')
    
    accuracy_score = metrics.accuracy_score(predicted_prob_df, gwa_test)
    hamming_score_result =  pyfiles.helpers.hamming_score(predicted_prob_df, gwa_test)
    hamming_loss = metrics.hamming_loss(predicted_prob_df, gwa_test)
        
    # Logging all associated metrics into Mlflow- note the argument differences between log_metric vs. log_metrics 
    mlflow.log_metric("Accuracy", accuracy_score)
    mlflow.log_metrics(prf_micro)
    mlflow.log_metrics(prf_sample)
    mlflow.log_metric("Hamming Score", hamming_score_result)
    mlflow.log_metric("Hamming Loss", hamming_loss)
        
    # Logging classification label counts figure, current notebook, and helper Python file as artifacts
    row_sums = predicted_prob_df.sum(axis=1)
    row_sum_plot = row_sums.value_counts().plot(kind='bar')
    row_sum_plot = row_sum_plot.get_figure()
    mlflow.log_figure(row_sum_plot, "row_sum_plot.png")
    mlflow.log_artifact("sklearn_logreg_example_1.ipynb")
    mlflow.log_artifact("./pyfiles/helpers.py")

In [None]:
model_run(task_train, task_test, gwa_train, gwa_test)
mlflow.end_run()

We see through the resulting output that our model was registered by MLflow and that we obtained the figure of count of GWAs assigned to each task. Success!

# 5.0 MLflow UI

We'll now head over to `http://<Remote IP>:<Port>/` on our local browser to see the results of the model run directly. You'll arrive on the Experiments page with separated experiments on the left each logging their respective model runs. 

![ui.png](../imgs/ui.png)

## 5.1 Model Run Pages

Models with the green check next to them are logged as finished after calling `mlflow.end_run`. Clicking on the hyperlinked start time parameter will then take you to the run's associated results page where you can see the toggle drop-downs for parameters, metrics, tags, and artifacts. Artifacts contains a series of subdirectories where we have our various files such as the pickled model and its vectorizer, this notebook and its helper functions, the matplotlib graph, and the python & conda environment configurations stored. We'll now be able to retrieve any of those logged artifacts through MLflow. The model schema has also been logged and is easily referrable to the right of the artifact directory tree. This provides a quick reference regarding the required inputs for this model when it's loaded in to a script directly from MLflow. 

## 5.2 Model Registry Page

The top navigation bar within the MLflow UI allows us to toggle to the Model Registry page. 

![registry.png](../imgs/registry.png)

We see that our model has been successfully registered and includes its most recent version. Clicking on a given model will take you to its history of registered versions, which is then linked to the runs that logged each version of the model. 

# 6.0 Model Registry Features 

The Model Registry includes various configurations that we can add to our newly-registered model. We're able to interect directly with the Model Registry by initializing a `MlflowClient` instance.`client.list_registered_models` is a helpful function to review the models we have currently registered. It outputs a lot of information on our registered models at once, so we'll use a function to print out its contents to be easier to read.  

In [None]:
client = MlflowClient()

for model in client.list_registered_models():
    pprint(dict(model), indent=4)

## 6.1 Registered Model Descriptions

The Model Registry is designed to serve as a centralized location to refer to logged models across team members. It can be therefore helpful to add additional information to registered models. We'll draw from the concept of Model Cards introduced by [Mitchell et al. (2019)](https://arxiv.org/pdf/1810.03993.pdf) to add documentation to our Scikit-learn model including an overview of its classification task, the projected use case of this model, and its associated limitations. This can be easily extended to additional topics of interest such as information regarding the data set, decisions around evaluation metrics, and ethical considerations affiliated with the model.

By calling `client.update_model_version`, we're able to add an open-text description to our registered model: 

In [None]:
client.update_model_version(
    name="sklearn_onet",
    version=1,
    description="""Overview: This is a Scikit-learn based logistic regression model designed to predict whether a task is affiliated with a given General Work Activity (GWA) within the Occupational Information Network's (ONET) occupational requirement content module.
    Use Cases: This model provides one experimental framework towards the research project of building a high-performing predictive classifier for labeling the currently unlabeled work tasks collected within the Occupational Requirements Survey.
    Limitations: This is a simple model with minimal hyperparameter tuning and is therefore only moderately successful in its predictive performance.""")

Our added model description doesn't look particularly readable from `update_model_version`'s output, but you'll find it to be nicely configured underneath the model's current version page in the Model Registry UI.

## 6.2 Model Staging

You've likely noticed within the landing page of the Model Registry the "Staging" and "Production" columns as well. This MLflow feature incorperates easy tracking of the current status of models within their use at the BLS. MLflow offers three stage levels- Staging, Production, and Archived. We can shift our logged model from having no set stage to the Staging level as follows:

In [None]:
client.transition_model_version_stage(
    name="sklearn_onet",
    version=1,
    stage="Staging"
)

# 7.0 Loading Models & Generating Predictions

Let's finish up by testing MLflow's feature of easy retrieval of saved models. Because we approached our modeling task through a one-vs.-all framework that tests 37 individual models, the model that MLflow logged in our training function is the final binary classifier trained for predicting whether a task is affiliated with the 37th GWA. We could have structured our script to save all 37 models within our MLflow run, but for example's sake we'll just stick with this final model.

`mlflow.pyfunc.load_model` is the universal retrieval function within MLflow for saved models originally created through any ML library. We can easily obtain our saved model through its logged name and version within the Model Registry. Retrieving our TF-IDF vectorizer requires a bit more navigation. We create a temporary directory to store our pickled vectorizer file. We then use `client.download_artifacts` with the model run name, the pickle file name, and the directory path to download the file which is then successfully instantiated through `pickle.load`. 

In [16]:
# load in the registered model 
registered_model = mlflow.pyfunc.load_model(model_uri="models:/sklearn_onet/1")

local_dir = "/tmp/artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
    
tfidf_scaler_path = client.download_artifacts("<Model Run>", "tfidf_vectorizer.pkl", local_dir)
tfidf_scaler = pickle.load(open(tfidf_scaler_path, 'rb'))

In [17]:
tfidf_scaler

TfidfVectorizer()

During our preprocessing of the ONET data we reserved 2% of the listed tasks to test out model retrieval and prediction generation with. Let's vectorize said task test now and see our final results.  

In [18]:
task_predict_vectorized = tfidf_scaler.transform(task_predict)
task_predict_vectorized

<273x11321 sparse matrix of type '<class 'numpy.float64'>'
	with 2463 stored elements in Compressed Sparse Row format>

In [19]:
registered_model.predict(task_predict_vectorized)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)

The loaded model is producing predictions towards task affiliation with the final GWA code, so we've successfully acheived our goal of a full modeling pipeline supported by MLflow from data preprocessing, training, tuning, to prediction. 