# Generic Prediction UDF

In this notebook, we are going to create a universal UDF SET script that will use a trained scikit-learn model for making a prediction. In order to use this script one has to create and train a scikit-learn model or a pipeline and upload its pickle file into the BucketFS. The list of features supplied to the UDF should match those used in model training and/or pre-processing. The script emits the prediction labels. The output of the script can be multi-dimensional. It works similarly in regression and classification scenarios.

To communicate with the Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## Create UDF

In [None]:
import textwrap
from exasol.connections import open_pyexasol_connection
from stopwatch import Stopwatch

stopwatch = Stopwatch()

# Create script to test the model
sql = textwrap.dedent("""\
CREATE OR REPLACE PYTHON3 SET SCRIPT
{schema!q}.SKLEARN_PREDICT(...)
EMITS(...) AS

# Generic scikit-learn predictor that runs a prediction for a data batch.
# Loads a scikit-learn model or a pipeline from the specified file. Calls its `predict` method
# passing to it all provided data columns. Emits sample IDs and the output of the model.
#
# Note that the model should not include features' names!
# 
# Input columns:
#    [0]:  Full BucketFS path to the model file;
#    [1]:  Sample ID, can be the ROWID of the test batch.
#    [2+]: Feature columns.
#
# Output columns:
#    [0]:  Sample ID copied from the input.
#    [1+]: Model output.

import pickle
import pandas as pd

def run(ctx):
    # Load model from EXABucket
    with open(ctx[0], 'rb') as f:
        model = pickle.load(f)

    # Stream the data through the model to reduce the required main memory of the UDF.
    # This allows running the UDF on larger datasets.
    while True:
        # Read the input skipping the first column which holds the model path.
        X_pred = ctx.get_dataframe(num_rows=1000, start_col=1)
        if X_pred is None:
            break

        # Call the model to get the predictions. Omit the first column in the input
        # which holds the sample IDs.
        df_features = X_pred.drop(X_pred.columns[0], axis=1)
        y_pred = model.predict(df_features)

        # Combine predictions with the sample IDs.
        df_rowid = X_pred[X_pred.columns[0]].reset_index(drop=True)
        df_pred = pd.concat((df_rowid, pd.DataFrame(y_pred)), axis=1)

        # Output data
        ctx.emit(df_pred)
/
""")

with open_pyexasol_connection(sb_config, compression=True) as conn:
    conn.execute(query=sql, query_params={'schema': sb_config.SCHEMA})

print(f"Creating prediction script took: {stopwatch}")