<a href="https://colab.research.google.com/github/datarobot-community/DRU-MLOps/blob/master/17May2021 - MLOps_III_DRUM_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLOps III - DRUM Laboratory

 In this notebook we will

* Build a simple regression model using Scikit-Learn
* Use DRUM to test & validate the model
* Use DRUM to score data in batch mode


## Use case to be addressed:

We will build a regression model to predict median value of owner-occupied homes prices in the Boston area.

Let's begin by uploading a few resources we will need:

1. Training set: **boston_housing.csv**
2. Scoring set: **boston_housing_inference.csv**
3. Requirements file: **colab_requirements.txt**
4. File with hooks used by the model: **custom.py**

In [1]:
from google.colab import files
uploaded = files.upload()

Saving boston_housing_inference.csv to boston_housing_inference.csv
Saving boston_housing.csv to boston_housing.csv
Saving colab_requirements.txt to colab_requirements.txt
Saving custom.py to custom.py


In [2]:
!ls

boston_housing.csv	      colab_requirements.txt  sample_data
boston_housing_inference.csv  custom.py


Let's install the Python modules we need using the requirements file:

In [3]:
!pip install -r colab_requirements.txt -q

[K     |████████████████████████████████| 276kB 16.0MB/s 
[K     |████████████████████████████████| 8.7MB 20.5MB/s 
[K     |████████████████████████████████| 276kB 46.7MB/s 
[K     |████████████████████████████████| 148.9MB 82kB/s 
[K     |████████████████████████████████| 61kB 9.1MB/s 
[K     |████████████████████████████████| 204kB 56.9MB/s 
[K     |████████████████████████████████| 153kB 54.6MB/s 
[K     |████████████████████████████████| 51kB 7.8MB/s 
[K     |████████████████████████████████| 788kB 23.5MB/s 
[K     |████████████████████████████████| 61kB 9.4MB/s 
[K     |████████████████████████████████| 102kB 12.7MB/s 
[K     |████████████████████████████████| 808kB 46.8MB/s 
[K     |████████████████████████████████| 552kB 49.1MB/s 
[?25h  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
  Building wheel for strictyaml (setup.py) ... [?25l[?25hdone
  Building wheel for progress (setup.py) ... [?25l[?25hdone
  Building wheel for memory-profiler (setup.py)

# 1.- Model Training

We will now build a very simple Scikit-Learn Regression model using the boston_housing prices dataset.

In [4]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import pickle
import datetime

## load data

df = pd.read_csv('boston_housing.csv')
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [5]:
## set features and target

X = df.drop('MEDV', axis=1)
y = df['MEDV']

## train the model
rf = RandomForestRegressor(n_estimators = 20)
rf.fit(X,y)

## serialize the model

with open('rf.pkl', 'wb') as pkl:
    pickle.dump(rf, pkl)

print("Done!")    

Done!


# 2.- Model Testing

We will now use DRUM to test how the model performs by computing latency times and memory usage for several different test case sizes. A report is generated after this process is completed.



In [6]:
%%sh 
drum perf-test 

Preparing test data...



Running test case: 72 bytes - 1 samples, 100 iterations
Running test case: 0.1MB - 1447 samples, 50 iterations
Running test case: 10MB - 144742 samples, 5 iterations
Running test case: 50MB - 723711 samples, 1 iterations

  size     samples   iters    min     avg     max    used (MB)   total (MB)
72 bytes         1     100   0.008   0.008   0.012     513.230    13021.062
0.1MB         1447      50   0.014   0.014   0.018     516.723    13021.062
10MB        144742       5   0.619   0.633   0.656     580.734    13021.062
50MB        723711       1   3.169   3.169   3.169     731.633    13021.062


2021-04-27 14:18:39.926033: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
tput: terminal attributes: No such device or address



# 3.- Model Validation: Handling of Missing Values

We will now validate the model to detect and address issues before deployment. It’s highly encouraged that you run these tests, which are the same ones that DataRobot performs automatically before deploying models.

Especifically, DRUM will test null values imputation by setting each feature in the dataset to "missing" and then feeding the features to the model. We will send the results to **validation.log**

In [7]:
%%sh 
drum validation 

2021-04-27 14:34:40.337159: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
2021-04-27 14:34:44.193043: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
2021-04-27 14:34:47.565384: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
2021-04-27 14:34:50.988590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
2021-04-27 14:34:54.325527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)
2021-04-27 14:34:57.678957: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so

In [8]:
! cat validation.log

   Predictions
0       24.955
1       21.640
2       34.695
3       33.650
4       34.445
5       27.760
6       21.480
7       21.135
8       18.120
   Predictions
0       25.130
1       21.895
2       34.795
3       33.670
4       35.265
5       27.760
6       22.000
7       20.975
8       17.385
   Predictions
0       24.955
1       22.245
2       34.675
3       33.670
4       35.265
5       27.760
6       23.830
7       20.975
8       17.175
   Predictions
0       24.955
1       21.895
2       34.795
3       33.670
4       35.265
5       27.760
6       22.000
7       20.975
8       17.385
   Predictions
0       26.795
1       22.025
2       34.445
3       33.365
4       34.975
5       27.620
6       21.630
7       20.840
8       17.260
   Predictions
0       22.315
1       18.270
2       23.365
3       24.485
4       25.320
5       24.680
6       18.565
7       19.580
8       17.240
   Predictions
0       25.495
1       23.010
2       35.295
3       34.265
4       35.365
5       27

# 4.- Batch Scoring with DRUM
<a id="setup_complete"></a>

We want to use our model to make predictions; to do this, we'll leverage DRUM and its ability to natively handle our Scikit-Learn model. All we need to do is tell DRUM where the model resides and what data we wish to score.  

DRUM provides native support for many frameworks. To use DRUM with model frameworks that are not supported out-of-the box, we'll just need to create some custom hooks so DRUM.  In this example, we'll explain some very simple custom hooks and provide links to more complex examples.  

In [9]:
%%sh
drum score 

2021-04-27 14:41:04.131795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  defaults = yaml.load(f)


Let's have a look at the predictions:

In [10]:
pd.read_csv("predictions.csv").head()

Unnamed: 0,Predictions
0,24.955
1,21.895
2,34.795
3,33.67
4,35.265
