In this notebook we are going to create a universal UDF SET script that will use a trained scikit-learn model for making a prediction. In order to use this script one has to create and train a scikit-learn model or a pipeline and upload its pickle file into the BucketFS. The list of features supplied to the UDF should match those used in model training and/or pre-processing. The script emits the prediction labels. The output of the script can be multi-dimensional. It works similarly in regression and classification scenarios.

To communicate with Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

Prior to using this notebook one needs to [create the database schema](setup_db.ipynb).


In [4]:
# TODO: Move this to a separate configuration notebook. Here we just need to load this configuration from a store.
EXASOL_EXTERNAL_HOST_NAME = "192.168.124.93"
EXASOL_HOST_PORT = "8888"
EXASOL_EXTERNAL_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_HOST_PORT}"""
EXASOL_USER = "sys"
EXASOL_PASSWORD = "exasol"
EXASOL_BUCKETFS_PORT = "6666"
EXASOL_EXTERNAL_BUCKETFS_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_BUCKETFS_PORT}"""
EXASOL_BUCKETFS_USER = "w"
EXASOL_BUCKETFS_PASSWORD = "write"
EXASOL_BUCKETFS_USE_HTTPS = False
EXASOL_BUCKETFS_URL_PREFIX = "https://" if EXASOL_BUCKETFS_USE_HTTPS else "http://"
EXASOL_BUCKETFS_SERVICE = "bfsdefault"
EXASOL_BUCKETFS_BUCKET = "default"
EXASOL_BUCKETFS_PATH = f"/buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}" # Filesystem-Path to the read-only mounted BucketFS inside the running UDF Container
EXASOL_SCRIPT_LANGUAGE_NAME = "PYTHON3_60"
EXASOL_UDF_FLAVOR = "python3-ds-EXASOL-6.0.0"
EXASOL_UDF_RELEASE= "20190116"
EXASOL_UDF_CLIENT = "exaudfclient" # or for newer versions of the flavor exaudfclient_py3
EXASOL_SCRIPT_LANGUAGES = f"{EXASOL_SCRIPT_LANGUAGE_NAME}=localzmq+protobuf:///{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}?lang=python#buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}/exaudf/{EXASOL_UDF_CLIENT}";
EXASOL_SCHEMA = "IDA"

connection_params = {"dns": EXASOL_EXTERNAL_HOST, "user": EXASOL_USER, "password": EXASOL_PASSWORD, "compression": True}

params = {
    "script_languages": EXASOL_SCRIPT_LANGUAGES,
    "script_language_name": EXASOL_SCRIPT_LANGUAGE_NAME,
    "schema": EXASOL_SCHEMA,
    "EXASOL_BUCKETFS_PORT": EXASOL_BUCKETFS_PORT,
    "EXASOL_BUCKETFS_USER": EXASOL_BUCKETFS_USER,
    "EXASOL_BUCKETFS_PASSWORD": EXASOL_BUCKETFS_PASSWORD,
    "EXASOL_BUCKETFS_USE_HTTPS": EXASOL_BUCKETFS_USE_HTTPS,
    "EXASOL_BUCKETFS_BUCKET": EXASOL_BUCKETFS_BUCKET,
    "EXASOL_BUCKETFS_PATH": EXASOL_BUCKETFS_PATH
}

In [5]:
import textwrap
import pyexasol
from stopwatch import Stopwatch

stopwatch = Stopwatch()

# Create script to test the model
sql = textwrap.dedent("""\
CREATE OR REPLACE PYTHON3 SET SCRIPT
{schema!i}.SKLEARN_PREDICT(...)
EMITS(...) AS

# Generic scikit-learn predictor that runs a prediction for a data batch.
# Loads a scikit-learn model or a pipeline from the specified file. Calls its `predict` method
# passing to it all provided data columns. Emits sample IDs and the output of the model.
#
# Note that the model should not include features' names!
# 
# Input columns:
#    [0]:  Full BucketFS path to the model file;
#    [1]:  Sample ID, can be the ROWID of the test batch.
#    [2+]: Feature columns.
#
# Output columns:
#    [0]:  Sample ID copied from the input.
#    [1+]: Model output.

import pickle
import pandas as pd

def run(ctx):
    # Load model from EXABucket
    with open(ctx[0], 'rb') as f:
        model = pickle.load(f)

    # Stream the data through the model to reduce the required main memory of the UDF.
    # This allows running the UDF on larger datasets.
    while True:
        # Read the input skipping the first column which holds the model path.
        X_pred = ctx.get_dataframe(num_rows=1000, start_col=1)
        if X_pred is None:
            break

        # Call the model to get the predictions. Omit the first column in the input
        # which holds the sample ids.
        df_features = X_pred.drop(X_pred.columns[0], axis=1)
        y_pred = model.predict(df_features)

        # Combine predictions with the sample ids.
        df_rowid = X_pred[X_pred.columns[0]].reset_index(drop=True)
        df_pred = pd.concat((df_rowid, pd.DataFrame(y_pred)), axis=1)

        # Output data
        ctx.emit(df_pred)
/
""")

with pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True) as conn:
    conn.execute(query=sql, query_params=params)

print(f"Creating prediction script took: {stopwatch}")

Creating prediction script took: 38.46ms
