In this notebook we are going to create a universal UDF SET script that will use a trained scikit-learn model for making a prediction. In order to use this script one has to create and train a scikit-learn model or a pipeline and upload its pickle file into the BucketFS. The list of features supplied to the UDF should match those used in model training and/or pre-processing. The script emits the prediction labels. The output of the script can be multi-dimensional. It works similarly in regression and classification scenarios.

To communicate with Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

Prior to using this notebook one needs to [create the database schema](setup_db.ipynb).


In [1]:
# TODO: Move this to a separate configuration notebook. Here we just need to load this configuration from a store.
from dataclasses import dataclass

@dataclass
class SandboxConfig:
    EXTERNAL_HOST_NAME = "192.168.124.93"
    HOST_PORT = "8888"

    @property
    def EXTERNAL_HOST(self):
        return f"""{self.EXTERNAL_HOST_NAME}:{self.HOST_PORT}"""

    USER = "sys"
    PASSWORD = "exasol"
    BUCKETFS_PORT = "6666"
    BUCKETFS_USER = "w"
    BUCKETFS_PASSWORD = "write"
    BUCKETFS_USE_HTTPS = False
    BUCKETFS_SERVICE = "bfsdefault"
    BUCKETFS_BUCKET = "default"

    @property
    def EXTERNAL_BUCKETFS_HOST(self):
        return f"""{self.EXTERNAL_HOST_NAME}:{self.BUCKETFS_PORT}"""

    @property
    def BUCKETFS_URL_PREFIX(self):
        return "https://" if self.BUCKETFS_USE_HTTPS else "http://"

    @property
    def BUCKETFS_PATH(self):
        # Filesystem-Path to the read-only mounted BucketFS inside the running UDF Container
        return f"/buckets/{self.BUCKETFS_SERVICE}/{self.BUCKETFS_BUCKET}"

    SCRIPT_LANGUAGE_NAME = "PYTHON3_60"
    UDF_FLAVOR = "python3-ds-EXASOL-6.0.0"
    UDF_RELEASE= "20190116"
    UDF_CLIENT = "exaudfclient" # or for newer versions of the flavor exaudfclient_py3
    SCHEMA = "IDA"

    @property
    def SCRIPT_LANGUAGES(self):
        return f"""{self.SCRIPT_LANGUAGE_NAME}=localzmq+protobuf:///{self.BUCKETFS_SERVICE}/
            {self.BUCKETFS_BUCKET}/{self.UDF_FLAVOR}?lang=python#buckets/{self.BUCKETFS_SERVICE}/
            {self.BUCKETFS_BUCKET}/{self.UDF_FLAVOR}/exaudf/{self.UDF_CLIENT}""";

    @property
    def connection_params(self):
        return {"dns": self.EXTERNAL_HOST, "user": self.USER, "password": self.PASSWORD, "compression": True}

    @property
    def params(self):
        return {
            "script_languages": self.SCRIPT_LANGUAGES,
            "script_language_name": self.SCRIPT_LANGUAGE_NAME,
            "schema": self.SCHEMA,
            "BUCKETFS_PORT": self.BUCKETFS_PORT,
            "BUCKETFS_USER": self.BUCKETFS_USER,
            "BUCKETFS_PASSWORD": self.BUCKETFS_PASSWORD,
            "BUCKETFS_USE_HTTPS": self.BUCKETFS_USE_HTTPS,
            "BUCKETFS_BUCKET": self.BUCKETFS_BUCKET,
            "BUCKETFS_PATH": self.BUCKETFS_PATH
        }

conf = SandboxConfig()

In [3]:
import textwrap
import pyexasol
from stopwatch import Stopwatch

stopwatch = Stopwatch()

# Create script to test the model
sql = textwrap.dedent("""\
CREATE OR REPLACE PYTHON3 SET SCRIPT
{schema!i}.SKLEARN_PREDICT(...)
EMITS(...) AS

# Generic scikit-learn predictor that runs a prediction for a data batch.
# Loads a scikit-learn model or a pipeline from the specified file. Calls its `predict` method
# passing to it all provided data columns. Emits sample IDs and the output of the model.
#
# Note that the model should not include features' names!
# 
# Input columns:
#    [0]:  Full BucketFS path to the model file;
#    [1]:  Sample ID, can be the ROWID of the test batch.
#    [2+]: Feature columns.
#
# Output columns:
#    [0]:  Sample ID copied from the input.
#    [1+]: Model output.

import pickle
import pandas as pd

def run(ctx):
    # Load model from EXABucket
    with open(ctx[0], 'rb') as f:
        model = pickle.load(f)

    # Stream the data through the model to reduce the required main memory of the UDF.
    # This allows running the UDF on larger datasets.
    while True:
        # Read the input skipping the first column which holds the model path.
        X_pred = ctx.get_dataframe(num_rows=1000, start_col=1)
        if X_pred is None:
            break

        # Call the model to get the predictions. Omit the first column in the input
        # which holds the sample ids.
        df_features = X_pred.drop(X_pred.columns[0], axis=1)
        y_pred = model.predict(df_features)

        # Combine predictions with the sample ids.
        df_rowid = X_pred[X_pred.columns[0]].reset_index(drop=True)
        df_pred = pd.concat((df_rowid, pd.DataFrame(y_pred)), axis=1)

        # Output data
        ctx.emit(df_pred)
/
""")

with pyexasol.connect(dsn=conf.EXTERNAL_HOST, user=conf.USER, password=conf.PASSWORD, compression=True) as conn:
    conn.execute(query=sql, query_params=conf.params)

print(f"Creating prediction script took: {stopwatch}")

Creating prediction script took: 106.19ms
