# Predicting with Regression Model

In this notebook, we will use a <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">`scikit-learn`</a> model created earlier to predict the age of an abalone from its physical measurements and sex. You can find more information about the problem domain <a href="https://archive.ics.uci.edu/dataset/1/abalone" target="_blank" rel="noopener">here</a>.

We will be using a generic prediction UDF script. To execute queries and load data from the Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Create generic scikit-learn prediction UDF script](sklearn_predict_udf.ipynb).
2. [Train a model on the Abalone dataset](sklearn_train_abalone.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## Run predictions

Let's make predictions on the data we have in table `ABALONE_TEST`. This table also includes a column with ground truth labels. We will use it to assess the performance of our predictor. In the code below we will add the ROWID to the output columns (as required by the generic prediction UDF). This will allow us to link the result to the ground truth.

In [None]:
from exasol.connections import open_pyexasol_connection, get_udf_bucket_path
from stopwatch import Stopwatch

target_column = 'RINGS'
bfs_model_path = get_udf_bucket_path(sb_config) + '/abalone_svm_model.pkl'
params = {'schema': sb_config.SCHEMA, 'test_table': 'ABALONE_TEST', 'model_path': bfs_model_path}

stopwatch = Stopwatch()

with open_pyexasol_connection(sb_config, compression=True) as conn:
    # Get the list of feature columns
    sql = 'SELECT * FROM {schema!q}.{test_table!q} LIMIT 1'
    df_tmp = conn.export_to_pandas(query_or_table=sql, query_params=params)
    params['column_names'] = [f'[{c}]' for c in df_tmp.columns if c != target_column]

    # Get the predictions for all rows in the TEST table calling the prediction UDF.
    # Provide the model path and the row ID in the first two parameters.
    sql = f'SELECT {{schema!q}}.SKLEARN_PREDICT({{model_path!s}}, ROWID, {{column_names!r}}) ' \
        f'emits ([sample_id] DECIMAL(20,0), [{target_column}] DOUBLE) FROM {{schema!q}}.{{test_table!q}}'
    df_pred = conn.export_to_pandas(query_or_table=sql, query_params=params)

print(f"Getting predictions took: {stopwatch}")
df_pred.head()

## Evaluate predictions

We are going to check the performance of our predictor by linking the results to the ground truth labels and computing some regression metrics. This should give us similar results to what we have seen in the [training notebook](sklearn_train_abalone.ipynb).

In [None]:
import pandas as pd
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt

# Get the ground truth labels for the test set.
with open_pyexasol_connection(sb_config, compression=True) as conn:
    sql = f'SELECT ROWID AS [sample_id], [{target_column}] FROM {{schema!q}}.{{test_table!q}}'
    df_true = conn.export_to_pandas(query_or_table=sql, query_params=params)

# Merge predictions and the ground truth on the sample ID.
df_res = pd.merge(left=df_true, right=df_pred, on='sample_id', suffixes=['_true', '_pred'])

print('Mean absolute error:', mean_absolute_error(df_res[f'{target_column}_true'], df_res[f'{target_column}_pred']))
print('Mean squared error:', mean_squared_error(df_res[f'{target_column}_true'], df_res[f'{target_column}_pred']))
print('Explained variance:', explained_variance_score(df_res[f'{target_column}_true'], df_res[f'{target_column}_pred']))