# Training Regression Model

In this notebook, we will train a simple regression model for predicting the age of an abalone from its physical measurements and sex.You can find more information about the problem domain <a href="https://archive.ics.uci.edu/dataset/1/abalone" target="_blank" rel="noopener">here</a>.

We will train the model in this notebook using <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">`scikit-learn`</a>, on the training data we are going to export from the database.

To execute queries and load data from the Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).
2. [Load the Abalone data](../data/data_abalone.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## Load data

First, we will export data into a pandas DataFrame and split it into training and validation sets.

In [None]:
from exasol.connections import open_pyexasol_connection
from sklearn.model_selection import train_test_split
from stopwatch import Stopwatch

stopwatch = Stopwatch()

with open_pyexasol_connection(sb_config, compression=True) as conn:
    df = conn.export_to_pandas(query_or_table=(sb_config.SCHEMA, 'ABALONE_TRAIN'))

X, y = df.drop(columns='RINGS'), df['RINGS']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

print(f"Loading the data took: {stopwatch}")

## Analyze data

Let's look at the features. We will first check what physical measurements have predictive power. For that, we will compute mutual information between the measurements' input columns and the target.

In [None]:
from sklearn.feature_selection import mutual_info_regression

X_meas = X_train.drop(columns='SEX')
mi = mutual_info_regression(X_meas, y_train)
dict(zip(X_meas.columns, mi))


Now let's see if SEX is a good predictor. We will do the ANOVA test and print the p-value.

In [None]:
from sklearn.feature_selection import f_classif

f_classif(y_train.to_frame(), X_train['SEX'])[1][0]

## Train model

Let's make a pipeline. We will use all features in the input. We will do One Hot Encoding of the SEX column and normalize all others, including the target. Let's use the Support Vector Machine as the regression model. We will drop the column names in the inputs, as they will not be available in the prediction UDF.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR

# Create the pipeline.
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), [1, 2, 3, 4, 5, 6, 7]),
        ("cat", OneHotEncoder(), [0]),
    ]
)
regressor = SVR(kernel='rbf')
model = Pipeline(
    steps=[("preprocessor", preprocessor), ('regressor', regressor)]
)
model = TransformedTargetRegressor(regressor=model, transformer=StandardScaler())

stopwatch = Stopwatch()

# Train the model.
model.fit(X_train.values, y_train.values)

print(f"Training of the model took: {stopwatch}")

## Evaluate model

Let's see what prediction performance we've got, printing some regression metrics.

In [None]:
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error

y_pred = model.predict(X_valid.values)

print('Mean absolute error:', mean_absolute_error(y_valid, y_pred))
print('Mean squared error:', mean_squared_error(y_valid, y_pred))
print('Explained variance:', explained_variance_score(y_valid, y_pred))

## Upload model into BucketFS

Now, let's upload the model into the BucketFS so that it can be used for making predictions in SQL queries. To communicate with BucketFS we will be using the <a href="https://exasol.github.io/bucketfs-python/" target="_blank" rel="noopener">`bucketfs-python`</a> module. 

In [None]:
import pickle
from exasol.connections import open_bucketfs_connection

MODEL_FILE = 'abalone_svm_model.pkl'

stopwatch = Stopwatch()

# Connect to the BucketFS service
bucket = open_bucketfs_connection(sb_config)

# Serialize the model into a byte-array and upload it to the BucketFS, 
# where it will be saved in the file with the specified name.
bucket.upload(MODEL_FILE, pickle.dumps(model))

print(f"Uploading the model took: {stopwatch}")

Now we are ready to use this model in our SQL queries. This will be demonstrated in the [following notebook](sklearn_predict_abalone.ipynb)