# Building a Custom Task

This notebook walks through the general workflow for building a custom task. We'll also demonstrate how to then deploy your custom task to a cloud b

## Note
The final sections of this tutorial require that you have access to Cloud DataRobot (app.datarobot.com or app.eu.datarobot.com)

## Agenda
In this tutorial, we'll learn:
1. How to create a custom task using simple python classes
2. How to test your python class
3. How to use the drum cli tools to test out your custom task 
4. How to use the DataRobot API to deploy your custom task to the DataRobot cloud for use in projects
5. How to insert a custom task on the DataRobot cloud into a blueprint

## Setup and Requirements [ In Progress]
This tutorial assumes a few things about your filepath and prior work. 

**Firstly, you need a feature flag enabled:**

Secondly, you should have a folder at the path `~/datarobot-user-models/`. If you put the folder in a different location, make sure you update the `TESTING_PATH` variable. This folder should contain 4 things:
1. A folder containing your properly configured custom environment.     
    In this example, it's named `public_dropin_environments/python3_pytorch/`
    
    
2. A folder containing your properly-configured custom model.     
    In this example, it's named `model_templates/python3_pytorch/`
    
    
3. The current version of the DataRobot Python Client.
    - Installation instructions for the client can be found here: [DataRobot Python Client Docs](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.20.0/setup/getting_started.html#installation)
    - Full documentation for the client can be found here: [DataRobot Python Client Docs](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.20.0/index.html)


4. A test dataset that you can use to test predictions from your custom model.     
    In this example, it's stored at `tests/testdata/juniors_3_year_stats_regression.csv`

It also assumes that you have access to app.datarobot.com.
If you use another version of DataRobot - use appropriate credentials and URL.


## Configuring Models and Environments
For more information on how to properly configure custom models and environments, read the README of our [DataRobot User Models repository](https://github.com/datarobot/datarobot-user-models).

# Building a custom task

First, we need to import a few things

In [1]:
from pathlib import Path
import pandas as pd
import tensorflow as tf
import pickle

import keras.models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor

from sklearn.pipeline import Pipeline


Now let's build a neural network! First we'll lay out the code, then we'll walk through it

In [16]:
from datarobot_drum.custom_task_interfaces import RegressionEstimatorInterface

class CustomTask(RegressionEstimatorInterface):
    
    def fit(self, X, y, row_weights=None, **kwargs):
        tf.random.set_seed(1234)
        input_dim, output_dim = len(X.columns), 1

        model = Sequential(
            [
                Dense(
                    input_dim, activation="relu", input_dim=input_dim, kernel_initializer="normal"
                ),
                Dense(input_dim // 2, activation="relu", kernel_initializer="normal"),
                Dense(output_dim, kernel_initializer="normal"),
            ]
        )
        model.compile(loss="mse", optimizer="adam", metrics=["mae", "mse"])

        callback = EarlyStopping(monitor="loss", patience=3)
        model.fit(
            X, y, epochs=20, batch_size=8, validation_split=0.33, verbose=1, callbacks=[callback]
        )

        # Attach the model to our object for future use
        self.estimator = model
        return self

    
    def save(self, artifact_directory):

        # If your estimator is not pickle-able, you can serialize it using its native method,
        # i.e. in this case for keras we use model.save, and then set the estimator to none
        keras.models.save_model(self.estimator, Path(artifact_directory) / "model.h5")

        # Helper method to handle serializing, via pickle, the CustomTask class
        self.save_task(artifact_directory, exclude=["estimator"])

        return self


    @classmethod
    def load(cls, artifact_directory):
        
        # Helper method to load the serialized CustomTask class
        custom_task = cls.load_task(artifact_directory)
        custom_task.estimator = keras.models.load_model(Path(artifact_directory) / "model.h5")

        return custom_task
    

    def predict(self, X, **kwargs):
        # Note how the regression estimator only outputs one column, so no explicit column names are needed
        return pd.DataFrame(data=self.estimator.predict(X))



There's a lot above, but the key idea is that we have 4 hooks: fit, save, load, and predict. DataRobot will use these hooks automatically to run our custom task. As you can probably guess, these hooks run in a specific order: first we train (fit) a model, then we serialize it (see the section [below]), then we load (i.e. deserialize) it again, and finally we then make predictions. 

One thing to note is that the above CustomTask is simply a python class, which means we can also add helper methods or have functions / classes in a helper file that we import. The more complex your CustomTask, the more it probably makes sense to import a separate helper file to keep things simple. See [here] for an example of directly using helper methods in the class and [here] for using a separate helper file. 

## Training our model with Fit

Now let's actually use the class above. Since this is an ordinary python class, all we need to do is build an object and we can test it out to ensure our methods work! First, let's grab a dataset and then separate out the target column

In [17]:
df = pd.read_csv("tests/testdata/juniors_3_year_stats_regression.csv")

y = df['Grade 2014']
X = df.drop(labels=['Grade 2014'], axis=1)

Now let's train our model!

In [18]:
task = CustomTask()

In [19]:
task = task.fit(X,y)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


## Saving and Loading our Custom Task

You may be wondering why we need to save our model only to immediately load it again to make predictions. The reason is that model training and prediction, also known as inference or scoring, are distinct tasks that may have very different resource requirements. As an extreme example, one of DataRobot's proprietary models is a genetic algorithm known as Eureqa. Training this model can take some time, as it iterates through hundreds, thousands, or even millions of mathematical transformations. However the output of this model is a simple mathematical equation, which can run almost instantly on very modest computational resources. So during training we want to allocate a high amount of compute, but while the model is making predictions, e.g. it is deployed and waiting to receive new data, we want a far lower amount of compute allocated.  

The way we achieve this is by using Docker containers, which we can think of as extremely lightweight virtual machines. This allows our training container to have significantly different computational resources allocated than our prediction container. But since the training and prediction steps are in separate containers, we need a way to move trained models and other useful artifacts, e.g. class labels, between them. The solution is to write out the artifacts to disk, i.e. serialize them. So our save method at the end of training will write out the model to a shared file storage location and then our load method at the beginning of making predictions will read the artifacts into memory, i.e. deserialize them. 


In [20]:
task.save(".")

<__main__.CustomTask at 0x18677de90>

In [21]:
task = task.load(".")

In [22]:
task.predict(X)

Unnamed: 0,0
0,27.046207
1,25.635019
2,26.916660
3,25.292889
4,28.178623
...,...
1472,24.894941
1473,26.109985
1474,31.427782
1475,23.208033


Let's take a look more deeply at the save method:

In [23]:
??task.save

[0;31mSignature:[0m [0mtask[0m[0;34m.[0m[0msave[0m[0;34m([0m[0martifact_directory[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Serializes the object and stores it in `artifact_directory`

Parameters
----------
artifact_directory: str
    Path to the directory to save the serialized artifact(s) to.

Returns
-------
self
[0;31mSource:[0m   
    [0;32mdef[0m [0msave[0m[0;34m([0m[0mself[0m[0;34m,[0m [0martifact_directory[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m        [0;31m# If your estimator is not pickle-able, you can serialize it using its native method,[0m[0;34m[0m
[0;34m[0m        [0;31m# i.e. in this case for keras we use model.save, and then set the estimator to none[0m[0;34m[0m
[0;34m[0m        [0mkeras[0m[0;34m.[0m[0mmodels[0m[0;34m.[0m[0msave_model[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mestimator[0m[0;34m,[0m [0mPath[0m[0;34m([0m[0martifact_directory[0m[0;34m)[0m [0;3

As we can see, we use two distinct functions to save our model. First, we use the keras function save_model to save our self.estimator, i.e. the trained model from the fit function. Then we use a built in helper function, save_task. Let's look at save task quickly:



In [24]:
??task.save_task

[0;31mSignature:[0m [0mtask[0m[0;34m.[0m[0msave_task[0m[0;34m([0m[0martifact_directory[0m[0;34m,[0m [0mexclude[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0msave_task[0m[0;34m([0m[0mself[0m[0;34m,[0m [0martifact_directory[0m[0;34m,[0m [0mexclude[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""[0m
[0;34m        Helper function that abstracts away pickling the CustomTask object. It also can[0m
[0;34m        automatically set previously serialized variables to None, e.g. when using keras you likely[0m
[0;34m        want to serialize self.estimator using model.save() or keras.models.save_model() and then[0m
[0;34m        pass in exclude='estimator'[0m
[0;34m[0m
[0;34m        Parameters[0m
[0;34m        ----------[0m
[0;34m        artifact_directory: str[0m
[0;34m            Path to the directory to save the serialized artifact(s) to

Don't worry about understanding every line above. The key point is that we set everything passed in the exclude parameter to None, then we use a the standard python library pickle to serialize the CustomTask object. The reason we do this is flexibility. There are a wide array of python ML frameworks, e.g. keras / tensorflow, pytorch, xgboost, etc. Many of these frameworks, particularly those around neural networks, have their own serialization functions that handle all the complexities around storing weights, archiectures, etc. 

So the recommended pattern is to save your estimator using your framework's serialization function, e.g. keras.models.save_model above, and then use the helper function save_task we provide to save the rest of your CustomTask object. 

If we look at the load method, we see that we simply reverse the order. First we use the helper function load_task to load our CustomTask object using pickle, then we load our estimator into self.estimator using the keras function load_model:

In [25]:
??task.load

[0;31mSignature:[0m [0mtask[0m[0;34m.[0m[0mload[0m[0;34m([0m[0martifact_directory[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Deserializes the object stored within `artifact_directory`.

Parameters
----------
artifact_directory: str
    Path to the directory to save the serialized artifact(s) to.

Returns
-------
cls
    The deserialized object
[0;31mSource:[0m   
    [0;34m@[0m[0mclassmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mload[0m[0;34m([0m[0mcls[0m[0;34m,[0m [0martifact_directory[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m[0m
[0;34m[0m        [0;31m# Helper method to load the serialized CustomTask class[0m[0;34m[0m
[0;34m[0m        [0mcustom_task[0m [0;34m=[0m [0mcls[0m[0;34m.[0m[0mload_task[0m[0;34m([0m[0martifact_directory[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mcustom_task[0m[0;34m.[0m[0mestimator[0m [0;34m=[0m [0mkeras[0m[0;34m.[0m[0mmodels[0m[0;34m.[0

It's important to understand that some python ML frameworks, notably Sklearn, use pickle for their serialization as well. This means we don't need to write our own save / load functions in our CustomTask, as the default functions will simply pickle everyting including the model. The below example is from the template [here] and shows how this looks for a simple sklearn model. You can see that the whole CustomTask is around 10 lines of code. Pretty neat, huh?

In [27]:
class CustomTask(RegressionEstimatorInterface):
    def fit(self, X, y, row_weights=None, **kwargs):
        
        self.estimator = Ridge()
        self.estimator.fit(X, y)

        return self

    def predict(self, X, **kwargs):

        return pd.DataFrame(data=self.estimator.predict(X))



## Making Predictions with the Correct Output

A CustomTask currently has to output a pandas dataframe where the rows are the samples and the columns the predictions. 

As you may have noticed, all of examples so far have been regressors, i.e. outputting a single, numeric prediction. So our rows are just the number of samples and we have a single column (which doesn't need a specific name). We can see that our CustomTasks above inherit from the RegressionEstimatorInterface, which enforces this behavior.

We can use the exact same behavior when we are creating an anomaly CustomTask, because the output is again a single numeric column representing how anomalous each sample is. There is a corresponding AnomalyEstimatorInterface


Things are a little trickier though when we need to create a binary or multiclass estimator. That's because we'll need to align the columns to the classes. Keep in mind that our CustomTask will run inside a DataRobot blueprint, which will be given a list of classes in the target. Let's take a look at an example binary estimator:

In [30]:
from datarobot_drum.custom_task_interfaces import BinaryEstimatorInterface

class CustomTask(BinaryEstimatorInterface):
    def fit(self, X, y, row_weights=None, **kwargs):
        
        self.estimator = DecisionTreeClassifier()
        self.estimator.fit(X, y)

        return self

    def predict_proba(self, X, **kwargs):

        # Note that binary estimators require two columns in the output, the positive and negative class labels
        # So we need to pass in the the class names derived from the estimator as column names OR
        # we can use the class labels from DataRobot stored in
        # kwargs['positive_class_label'] and kwargs['negative_class_label']
        return pd.DataFrame(data=self.estimator.predict_proba(X), columns=self.estimator.classes_)

The first thing to notice is that because we're outputting probabilities, we define a predict_proba instead of a predict function. The second thing to notice is that we have to provide the column names of our dataframe, and they have to align to the classes of our dataset. If you look at the fit function, you'll notice we directly pass in the target column y. This will have our target labels and these will be passed to our model as we train it, i.e. self.estimator.fit()


For binary classification, DataRobot requires there to be 2 columns: the positive class prediction and the negative class prediction 
(which is the inverse). Obviously, these two numbers should sum up to 1.0. 

For frameworks that output 2 classes, like sklearn, we cna simply use the classes stored by the sklearn model itself, i.e. self.estimator.classes_

Some frameworks, such as pytorch, instead only output one column (typically the positive class probability). In those cases we have to derive the negative class column, as seen below:

In [32]:
    def predict_proba(self, X, **kwargs):
        """Since pytorch only outputs a single probability, i.e. the probability of the positive class,
         we use the class labels passed in kwargs to label the columns"""
        data_tensor = torch.from_numpy(X.values).type(torch.FloatTensor)
        predictions = self.estimator(data_tensor).cpu().data.numpy()
        
        predictions = pd.DataFrame(predictions, columns=[kwargs["positive_class_label"]])

        # The negative class probability is just the inverse of what the model predicts above
        predictions[kwargs["negative_class_label"]] = (
            1 - predictions[kwargs["positive_class_label"]]
        )
        return predictions


Multiclass is slightly more challenging. Here we'll need to output the probability of each class as a separate column. If our framework stores. the classes, like many sklearn models, we can use the same exact same approach as above with binary classification. If the framework doesn't store the classes, e.g. pytorch, then we'll need to store the classes during the fit step on the self object so it can be used as the columns (NOTE: the save & load methods are excluded below to focus in on the unique aspects of multiclass):

In [35]:
from datarobot_drum.custom_task_interfaces import MulticlassEstimatorInterface


class CustomTask(MulticlassEstimatorInterface):
    def fit(self, X, y, row_weights=None, **kwargs):
        """Note how we encode the class labels and store them on self to be used in the predict hook"""
        self.lb = LabelEncoder().fit(y)
        y = self.lb.transform(y)

        # For reproducible results
        torch.manual_seed(0)

        self.estimator, optimizer, criterion = build_classifier(X, len(self.lb.classes_))
        train_classifier(X, y, self.estimator, optimizer, criterion)
        

    def predict_proba(self, X, **kwargs):
        """Note how the column names come from the encoded class labels in the fit hook above"""
        data_tensor = torch.from_numpy(X.values).type(torch.FloatTensor)
        predictions = self.estimator(data_tensor).cpu().data.numpy()

        # Note that multiclass estimators require one column per class in the output
        # So we need to pass in the the class names derived from the estimator as column names.
        return pd.DataFrame(data=predictions, columns=self.lb.classes_)


## Transformers vs. Estimators

So far we've focused on Estimators, which output a prediction. We can also create transforms, which manipulate the data and pass it along to (eventually) an estimator. Note: this final estimator can either be a built in DataRobot task or a CustomTask.

The key difference between estimators and transforms is that instead of a predict function we have a transform function. The transform function also returns a dataframe, but instead of predictions it returns the transformed data. Note that while you can certainly create or remove columns, the transform function must output the same number of rows, e.g. you can't create new synthetic data.

## Understanding the CustomTask Interface (Optional)

# How to upload a model via the web application

TODO: mention what they'll need to copy into custom.py (have a separate folder for this example so they can see the difference between notebook land and custom.py)