In [1]:
%matplotlib inline

import gzip
import io
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
import statsmodels.api as sm
import xgboost as xgb
from bokeh.io import output_notebook
from bokeh.layouts import column
from bokeh.models import Band, ColumnDataSource, HoverTool, NumeralTickFormatter, Select
from bokeh.plotting import figure, gridplot, show
from fancyimpute import KNN, NuclearNormMinimization, SoftImpute, BiScaler
from s3fs import S3FileSystem
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, learning_curve, ShuffleSplit, train_test_split
from sklearn.naive_bayes import GaussianNB


RANDOM_STATE = 42
output_notebook()

# Model Review: Taking a model from exploration to production
  
<span style="display: inline-block; height:250px; width:720px"></span>
  
  
    
Presented by:  
[Andy R. Terrel, PhD](https://www.linkedin.com/in/aterrel/)  | Chief Data Scientist, [REX Inc.](https://rexhomes.com) | President, [NumFOCUS](https://numfocus.org)  

Contributions by:  
[Andy Maloney](https://linkedin.com/in/andy-maloney-a43a34195) | [John Hanley](https://linkedin.com/in/jhanley714) | REX Data Team




## Model is Done! NOW WHAT?

<div style="display: flex;">
<div style="height:250px; width:15%"></div>
<div>
<img src="../images/austin-neill-emH2e5SBifE-unsplash.jpg" width="720">
    </div>
</div>


<span style="font-size: small">Photo by <a href="https://unsplash.com/@arstyy?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Austin Neill</a> on <a href="https://unsplash.com/s/photos/ship?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>


<img src="../images/sculley-et-al_hidden-tech-debt-ml.png" width=1080>

<span style="font-size: small">Figure from Sculley, D & Holt, Gary & Golovin, Daniel & Davydov, Eugene & Phillips, Todd & Ebner, Dietmar & Chaudhary, Vinay & Young, Michael & Dennison, Dan. (2015). Hidden Technical Debt in Machine Learning Systems. NIPS. 2494-2502. <a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf">https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf</a></span>

Charting our journey:


- Understanding your models deployment

- Tracking data into and out of your model

- Detecting problems with your model

- Pulling it together in a checklist

## Understanding your models deployment

<img src="../images/guillaume-bolduc-uBe2mknURG4-unsplash.jpg" width="720">

<span style="font-size: small">Photo by <a href="https://unsplash.com/@guibolduc?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Guillaume Bolduc</a> on <a href="https://unsplash.com/s/photos/ship?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Data collection

#### **`DATA SOURCES`**

  List all data sources, _e.g._. This should help you understand what data can
  be used for the analysis/model, and how to ultimately do the ETL for the
  model.

  * Application database
  * Customer Relations Manager records
  * Web analytic events


#### PROVENANCE

  Basically you need to show _and store_ the following Entity, Activity, and Agent information about the notebook/models.
  
  <img src="../images/provbook-example.png" width=720>
  
  <span style="font-size: small">Dong Huynh,
    http://trungdong.github.io/prov-python-short-tutorial.html</span>

  #### PROVENANCE

  * Entity

    * An entity is a physical, digital, conceptual, or other kind of thing
      with some fixed aspects; entities may be real or imaginary.

  * Activity

    * An activity is something that occurs over a period of time and acts upon
      or with entities; it may include consuming, processing, transforming,
      modifying, relocating, using, or generating entities.

  * Agent

    * An agent is something that bears some form of responsibility for an
      activity taking place, for the existence of an entity, or for another
      agent's activity.

#### Testing contracts into and out of your model

- During model building, you discovered many things about features:
  + range of feature
  + distribution of feature
  + sensitivity to feature variance
- These learnings can be codified into preprocessing stages of your model
- Additionally, you can monitor the predictions of your model in the same fashion, detecting when your predictions start to be biased


### Prediction Environment


- How do you get called?
- What is the SLA your code needs to adhere to?
- What are the systems monitoring the model?

#### How do you get called


<img src="../images/richards-event-architecture.png" width=720>

<span style="font-size: small">Figure from Mark Richards, __Software Architecture Patterns__</span>

#### What is your speed layer

<img src="../images/mapr-lambda-architecture.png" width=720>


<span style="font-size: small">Figure by MapR</span>

#### Who is monitoring you?


<img src="../images/terrel-breakdown-of-on-node-monitors.png" width=720>

<span style="font-size: small">Figure by Andy Terrel</span>

## How did you get there ?!


<img src="../images/noaa-3duT-54VuK8-unsplash.jpg" width=720>


<span style="font-size: small">Photo by <a href="https://unsplash.com/@noaa?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">NOAA</a> on <a href="https://unsplash.com/s/photos/ship?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Data provenance

In [1]:
class Extract:
    """This is an example of an Agent. This object is going to be a recipe
    for the extraction of data. It will not do anything by itself, since the
    extraction is done through the _extract method that must be written by a
    class that inherits this object.
    """

    def _extract(self, *args, **kwargs):
        raise NotImplementedError('To be implemented by the inheriting class.')

    def extract(self, *args, **kwargs):
        extracted_data = self._extract(*args, **kwargs)
        return extracted_data

In [2]:
class Transform:
    """This is another Agent that we will use to transform any extracted data."""

    def _transform(self, *args, **kwargs):
        raise NotImplementedError('To be implemented by the inheriting class.')

    def transform(self, *args, **kwargs):
        transformed_data = self._transform(*args, **kwargs)
        return transformed_data

In [3]:
class Load:
    """Just as in the Extract and Transform objects, we create a Load object
    that is a yet another Agent. We will use this data loading object to build
    another data set object that will be used to create entities for our
    model(s).
    """

    def _load(self, *args, **kwargs):
        raise NotImplementedError('To be implemented by the inheriting class.')

    def load(self, *args, **kwargs):
        loaded_data = self._load(*args, **kwargs)
        return loaded_data

In [5]:
class CSVDataSet:
    """This example will show how to persist a data set locally and on S3. This
    is an Agent that orchestrates the loading of data set and the
    creation of the data set object.

    Again, this object will not do anything useful since it requires bucket
    names and file names to be created, as well as the _load method.
    """

    name = None
    data_set_name = None
    data_set_file_extension = 'csv'
    file_compression = 'gz'
    data_set_file_name = None
    bucket_local = None
    bucket_remote = None
    bucket_prefix = None
    data_set_path_local = None
    data_set_path_remote = None
    s3 = S3FileSystem()

NameError: name 'S3FileSystem' is not defined

In [None]:
    def cache_df(self, df, where):
        print(f'Caching data set {where}.')
        data_set_buffer = io.StringIO()
        df.to_csv(data_set_buffer, index=False)
        data_set_buffer.seek(0)
        gzipped_data_set_buffer = io.BytesIO()
        with gzip.GzipFile(mode='w', fileobj=gzipped_data_set_buffer) as file_:
            file_.write(bytes(data_set_buffer.getvalue(), 'utf-8'))

        if where == 'locally':
            with open(self.data_set_path_local, 'wb') as file_:
                file_.write(gzipped_data_set_buffer.getvalue())

        elif where == 'remotely':
            with self.s3.open(self.data_set_path_remote, 'wb') as file_:
                file_.write(gzipped_data_set_buffer.getvalue())


In [None]:
    def _load_data_set(self, *args, **kwargs):
        if 'overwrite_cache' in kwargs:
            overwrite_cache = kwargs.get('overwrite_cache', False)
        else:
            overwrite_cache = False
            
        ...

        if data_set_exists_locally:
            print('Loading the data set from the local cache.')
            self.df = pd.read_csv(self.data_set_path_local, low_memory=False)
            if not data_set_exists_remotely:
                self.cache_df(df=self.df, where='remotely')
                
        ...

    def load_data_set(self, *args, **kwargs):
        self._load_data_set(*args, **kwargs)

    def is_data_set_cached(self):
        return not self.df.empty

In [6]:
class IrisDataExtractor(Extract):
    """This is an example of an Activity. It will require the inheritance of
    the agent called Extract. Here we will write the actual extraction of the
    data. This object could perform the generation of an Entity, if the
    extraction is a time-consuming process.
    """

    def _extract(self):
        iris_data = load_iris()
        return iris_data

In [7]:
class IrisDataTransformer(Transform):
    """As was done with the data extraction, here we will write the data
    transformation. This is another example of an Activity. It should be
    noted that this is not a generic transformation class. It is dependent
    on the extracted Iris data set. If you make this class a generic data
    transformation, then you need to make sure that any assumptions about the
    model are described further in the process. Making this "generic" could be
    useful if you need to make many models with different assumptions in them
    that are all based on the same data.

    If you do not need to make this generic, then it is a good idea to place
    all assumptions about the model and data in this object.
    """

    def _transform(self, data):
        df = pd.DataFrame(
            data=data['data'],
            columns=data['feature_names'],
        )
        name_dict = {i: name for i, name in enumerate(data['target_names'])}
        df['iris_name'] = data['target']
        df['iris_name'] = df['iris_name'].apply(lambda index: name_dict[index])
        return df

In [8]:
class IrisDataLoader(Load):
    """This specific example of an Activity also sets the dataframe attribute
    as an Entity to this object. This may seem unusual right now, but it will
    make sense when used in conjunction with the DataSet object below.
    """

    df = pd.DataFrame()

    def _load(self):
        extractor = IrisDataExtractor()
        extracted_data = extractor.extract()
        transformer = IrisDataTransformer()
        transformed_data = transformer.transform(extracted_data)
        self.df = transformed_data.copy()
        return self.df

NameError: name 'pd' is not defined

In [9]:
class IrisDataSet(CSVDataSet, IrisDataLoader):

    bucket_local = '../data/'
    bucket_remote = 'org-rex-data'
    bucket_prefix = 'research/rex-analysis/amaloney'

    def __init__(self):
        self.generate_names_with_cache_paths()

NameError: name 'CSVDataSet' is not defined

### Showing your Data Prep

### Write down the meta-model

**objective** – a sentence or two on what your model or analysis aims at.

**KPIs* – List all key performance indicators. For example:

- Model predicts X.
- Further actions that need to be taken to acquire data for the model.
- Out of sample predictions show that...
- Y is an outlier when these assumptions are made...



### Write down the meta-model

**inputs**
- Tell us what the model reads, e.g. postgres_db_uri , or an S3 location like https://s3.console.aws.amazon.com/s3/buckets/bar/baz
- What is the provenance? Directly querying a System of Record? Or a subsequent journey? Can we reproduce it?

**outputs** 
- Where will inferences go? Stdout, file, S3, a table? 
- Should we be worried about overwriting some frozen output?

### Write down the meta-model

**assumptions**

- Are assumptions explicitly written down in a ReadMe? 
- The most common violations 
  - assuming independence over observations that aren't independent, or 
  - implicitly assuming some statistical distribution such as normality.

- Do assumptions actually hold in practice, in the observed data? 

**benchmark** 

- What competing model are you comparing performance to? 
- Is it the best available?

**bias-variance tradeoff**
- What does the learning curve say about overfitting?

**transformation** 
- How is data transformed during ETL? 
- Filtering? 
- Missing attributes? 
- Imputation?

### The Review Checklist

#### Summary

Author will create a request file, and Reviewer will create a result file, which are permanently added to the repo.

#### Outcome

At end of review, we’ll have learned whether Reviewer believes the model (believes it is useful), and finds it easy to use.

#### Prerequisites

- You have a model checked into source control. Good! 
- You have an **Objective**, in the form of a README.md or similar file. It, too, is checked into the repo.
- You have observations. Since they are “big”, too big to conveniently feed to the elephant named Git, they can be found in S3, or perhaps in an RDBMS table. They are frozen, they shall not change in the next week or two. Consider putting them in an S3 object or DB table that has an iso8601 date as part of the name, e.g. foo-2020-04-14.

#### Prerequisites (continued)

- You have predictions, model outputs that are stored somewhere. They are frozen, just like the input observations. Perhaps they are “small” and may conveniently be stored in a git repo. Or perhaps they are big and are more conveniently stored in S3 or table. If the table contains additional rows, be sure your ReadMe or review-request shows how to query just the rows that matter for this review.
- You think the model is mature enough for review. Consider running it through a brief code review beforehand. 
- Consider doing a dry run, where Author pretends he is Reviewer and verifies the task is feasible.

#### Author process

- Pick a reviewer. That’s a reviewer, a single reviewer, just one, as there will be some effort involved. Count not to two. We can invite more to the party later.

- Create a new git feature branch (along with a jira ticket) for this review. Not for code development. Just for review.

- Commit and push a file full of review-specific instructions to the reviewer, with a name like doc/2020-04-14-review-request.md, or in toplevel dir, or whatever is convenient for your repo. Invite the reviewer to tweak an aspect of training or prediction, something that won’t take days of compute time. (Consider testing it yourself! For “big” models, consider producing model plus toy_model, which processes small input in less than an hour. There’s lots of things the reviewer might do – your job is to make it so easy that many of them are quickly accomplished.)

- Push to Bitbucket and send your reviewer a PR in the usual way. Make no further edits. This branch is no longer yours – it belongs exclusively to the single reviewer you nominated.

#### Reviewer process (continued)

- Checkout the feature / review branch.
- Create and commit an empty file with a name like doc/2020-04-16-review-result.md.
- Read the Objective, found in README.md or wherever the review-request explains it may be found. Add a sentence or two to your review-result file, describing whether the Objective is clear and seems relevant to stakeholders.
- Read the frozen model inputs, a few individual records plus stat summaries that Author helpfully made it easy to view. Write a sentence or two describing whether inputs seem to match reality, and match the Objective.
- Read the frozen model outputs. Write a sentence or two describing whether they seem to match reality, and match the Objective. For each of these three, a simple “makes sense,   yes, I agree” will suffice.

#### Reviewer process (continued)

- Add a sentence or two describing transformation(s) done during ETL, and whether they seem reasonable. You should be starting with copy-n-paste of a sentence the Author helpfully put in the ReadMe.
- Read the model code, if you like, and append comments to the review-result file. This part is optional – code review should have been handled prior to model review, perhaps involving same reviewer. The code must respect frozen inputs and outputs, leaving them untouched. It must be able to send reviewer’s output to a new S3 object, table, or similar.
- Retrain the model from frozen inputs, or at least retrain toy_model. Timebox to one hour running time, and add a paragraph to review-result. Write “took too long” if model or toy_model get stuck at this stage.
- Reproduce the frozen model outputs. Timebox to one hour running time, and add a paragraph to review-result.

#### Reviewer process (continued)
- Read the review-request instructions and tweak the model in the suggested way. Timebox to one hour, and add a paragraph to review-result.
- Tweak training or prediction in a way you find interesting. Timebox to one hour, and add a paragraph to review-result.
- Pick an example prediction error, or a summary description of how the error is distributed. Add a paragraph describing why the error makes sense, or does not.
- Revisit the Objective. Add a paragraph relating it to the model. Describe Next Steps, things the Reviewer feels offer opportunity for improvement, based on what we learned during review.
- Do final commit on review-result, push branch. Invite others / comment within the PR if you wish. Click Approve on the PR, and click Merge down to develop.

## ZOMG It's down!!!

<img src="../images/casey-horner-y7jrFSlVZAQ-unsplash-cropped.jpg" width=720>


<span style="font-size: small">Photo by <a href="https://unsplash.com/@mischievous_penguins?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Casey Horner</a> on <a href="https://unsplash.com/s/photos/ship?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Define healthchecks and alerts

<img src="../images/terrel-system-diagram.png" width=720>

<span style="font-size:small">Figure by Andy R. Terrel</span>

### Use model servers

<img src="../images/bentoml-readme-header.jpeg" width=360>

<img style="background: black;padding: 5px" src="../images/MLflow-logo-final-white-TM.png" width=360>

<img src="../images/redis-ai.png" width=360>

## Finally time to start again!

<img src="../images/whoisbenjamin-ApJp5Nk24a0-unsplash.jpg" width=720>

<span style="font-size: small">Photo by <a href="https://unsplash.com/@whoisbenjamin?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">@whoisbenjamin</a> on <a href="https://unsplash.com/s/photos/ships-indian-ocean?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>