In this notebook we will train a simple regression model for predicting the age of an abalone from its physical measurements and sex.You can find more information about the problem domain <a href="https://archive.ics.uci.edu/dataset/1/abalone" target="_blank" rel="noopener">here</a>.

We will train the model in this notebook using <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">`scikit-learn`</a>, on the training data we are going to export from the database.

To execute queries and load data from Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

Prior to using this notebook one needs to complete the follow steps:
1. [Create the database schema](../setup_db.ipynb).
2. [Load the Abalone data](../data/data_abalone.ipynb).

In [1]:
# TODO: Move this to a separate configuration notebook. Here we just need to load this configuration from a store.
from dataclasses import dataclass

@dataclass
class SandboxConfig:
    EXTERNAL_HOST_NAME = "192.168.124.93"
    HOST_PORT = "8888"

    @property
    def EXTERNAL_HOST(self):
        return f"""{self.EXTERNAL_HOST_NAME}:{self.HOST_PORT}"""

    USER = "sys"
    PASSWORD = "exasol"
    BUCKETFS_PORT = "6666"
    BUCKETFS_USER = "w"
    BUCKETFS_PASSWORD = "write"
    BUCKETFS_USE_HTTPS = False
    BUCKETFS_SERVICE = "bfsdefault"
    BUCKETFS_BUCKET = "default"

    @property
    def EXTERNAL_BUCKETFS_HOST(self):
        return f"""{self.EXTERNAL_HOST_NAME}:{self.BUCKETFS_PORT}"""

    @property
    def BUCKETFS_URL_PREFIX(self):
        return "https://" if self.BUCKETFS_USE_HTTPS else "http://"

    @property
    def BUCKETFS_PATH(self):
        # Filesystem-Path to the read-only mounted BucketFS inside the running UDF Container
        return f"/buckets/{self.BUCKETFS_SERVICE}/{self.BUCKETFS_BUCKET}"

    SCRIPT_LANGUAGE_NAME = "PYTHON3_60"
    UDF_FLAVOR = "python3-ds-EXASOL-6.0.0"
    UDF_RELEASE= "20190116"
    UDF_CLIENT = "exaudfclient" # or for newer versions of the flavor exaudfclient_py3
    SCHEMA = "IDA"

    @property
    def SCRIPT_LANGUAGES(self):
        return f"""{self.SCRIPT_LANGUAGE_NAME}=localzmq+protobuf:///{self.BUCKETFS_SERVICE}/
            {self.BUCKETFS_BUCKET}/{self.UDF_FLAVOR}?lang=python#buckets/{self.BUCKETFS_SERVICE}/
            {self.BUCKETFS_BUCKET}/{self.UDF_FLAVOR}/exaudf/{self.UDF_CLIENT}""";

    @property
    def connection_params(self):
        return {"dns": self.EXTERNAL_HOST, "user": self.USER, "password": self.PASSWORD, "compression": True}

    @property
    def params(self):
        return {
            "script_languages": self.SCRIPT_LANGUAGES,
            "script_language_name": self.SCRIPT_LANGUAGE_NAME,
            "schema": self.SCHEMA,
            "BUCKETFS_PORT": self.BUCKETFS_PORT,
            "BUCKETFS_USER": self.BUCKETFS_USER,
            "BUCKETFS_PASSWORD": self.BUCKETFS_PASSWORD,
            "BUCKETFS_USE_HTTPS": self.BUCKETFS_USE_HTTPS,
            "BUCKETFS_BUCKET": self.BUCKETFS_BUCKET,
            "BUCKETFS_PATH": self.BUCKETFS_PATH
        }

conf = SandboxConfig()

First we will export data into a pandas DataFrame and split it into training and validation sets.

In [2]:
import pyexasol
from sklearn.model_selection import train_test_split
from stopwatch import Stopwatch

stopwatch = Stopwatch()

with pyexasol.connect(dsn=conf.EXTERNAL_HOST, user=conf.USER, password=conf.PASSWORD, compression=True) as conn:
    df = conn.export_to_pandas(query_or_table=(conf.SCHEMA, 'ABALONE_TRAIN'))

X, y = df.drop(columns='RINGS'), df['RINGS']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

print(f"Loading the data took: {stopwatch}")

Loading the data took: 1.27s


Let's look at the features. We will first check what physical measurements have predictive power. For that we will compute mutual information between the measurements' input columns and the target.

In [4]:
from sklearn.feature_selection import mutual_info_regression

X_meas = X_train.drop(columns='SEX')
mi = mutual_info_regression(X_meas, y_train)
dict(zip(X_meas.columns, mi))


{'LENGTH': 0.3777307983714149,
 'DIAMETER': 0.4130225409481696,
 'HEIGHT': 0.3730705618531607,
 'WHOLE_WEIGHT': 0.3839492388004473,
 'SHUCKED_WEIGHT': 0.33404766440572065,
 'VISCERA_WEIGHT': 0.35472141609994523,
 'SHELL_WEIGHT': 0.4472509397786135}

Now let's see if SEX is a good predictor. We will do the ANOVA test and print the p-value.

In [6]:
from sklearn.feature_selection import f_classif

f_classif(y_train.to_frame(), X_train['SEX'])[1][0]

2.5467010907188578e-126

Let's make a pipeline. We will use all features in the input. We will do One Hot Encoding of the SEX column and normalize all others, including the target. Let's use the Support Vector Machine as the regression model. We will drop the column names in the inputs, as they will not be available in the prediction UDF.

In [7]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR

# Create the pipeline.
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), [1, 2, 3, 4, 5, 6, 7]),
        ("cat", OneHotEncoder(), [0]),
    ]
)
regressor = SVR(kernel='rbf')
model = Pipeline(
    steps=[("preprocessor", preprocessor), ('regressor', regressor)]
)
model = TransformedTargetRegressor(regressor=model, transformer=StandardScaler())

stopwatch = Stopwatch()

# Train the model.
model.fit(X_train.values, y_train.values)

print(f"Training of the model took: {stopwatch}")

Training of the model took: 168.23ms


Let's see what prediction performance we've got, printing some regression metrics.

In [8]:
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error

y_pred = model.predict(X_valid.values)

print('Mean absolute error:', mean_absolute_error(y_valid, y_pred))
print('Mean squared error:', mean_squared_error(y_valid, y_pred))
print('Explained variance:', explained_variance_score(y_valid, y_pred))

Mean absolute error: 1.3993917946085437
Mean squared error: 4.266767837896115
Explained variance: 0.5731816112246741


Now, let's upload the model into the BucketFS, so that it can be used for making predictions in SQL queries. To communicate with BucketFS we will be using the <a href="https://exasol.github.io/bucketfs-python/" target="_blank" rel="noopener">`bucketfs-python`</a> module. 

In [9]:
import pickle
from exasol.bucketfs import Service

MODEL_FILE = 'abalone_svm_model.pkl'

# Setup the connection parameters.
buckfs_url = f'{conf.BUCKETFS_URL_PREFIX}{conf.EXTERNAL_BUCKETFS_HOST}'
buckfs_credentials = {conf.BUCKETFS_BUCKET: {'username': conf.BUCKETFS_USER, 'password': conf.BUCKETFS_PASSWORD}}

stopwatch = Stopwatch()

# Connect to the BucketFS service and navigate to the bucket of choice.
bucketfs = Service(buckfs_url, buckfs_credentials)
bucket = bucketfs[conf.BUCKETFS_BUCKET]

# Serialize model into a byte-array and upload it to the BucketFS, 
# where it will be saved in the file with the specified name.
bucket.upload(MODEL_FILE, pickle.dumps(model))

print(f"Uploading the model took: {stopwatch}")

Uploading the model took: 605.55ms


Now we are ready to use this model in our SQL queries. This will be demonstrated in the [following notebook](sklearn_predict_abalone.ipynb)