In this notebook we will train a simple regression model for predicting the age of an abalone from its physical measurements and sex.You can find more information about the problem domain <a href="https://archive.ics.uci.edu/dataset/1/abalone" target="_blank" rel="noopener">here</a>.

We will train the model in this notebook using <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">`scikit-learn`</a>, on the training data we are going to export from the database.

To execute queries and load data from Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

Prior to using this notebook one needs to complete the follow steps:
1. [Create the database schema](../setup_db.ipynb).
2. [Load the Abalone data](../data/data_abalone.ipynb).

In [2]:
# TODO: Move this to a separate configuration notebook. Here we just need to load this configuration from a store.
EXASOL_EXTERNAL_HOST_NAME = "192.168.124.93"
EXASOL_HOST_PORT = "8888"
EXASOL_EXTERNAL_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_HOST_PORT}"""
EXASOL_USER = "sys"
EXASOL_PASSWORD = "exasol"
EXASOL_BUCKETFS_PORT = "6666"
EXASOL_EXTERNAL_BUCKETFS_HOST = f"""{EXASOL_EXTERNAL_HOST_NAME}:{EXASOL_BUCKETFS_PORT}"""
EXASOL_BUCKETFS_USER = "w"
EXASOL_BUCKETFS_PASSWORD = "write"
EXASOL_BUCKETFS_USE_HTTPS = False
EXASOL_BUCKETFS_URL_PREFIX = "https://" if EXASOL_BUCKETFS_USE_HTTPS else "http://"
EXASOL_BUCKETFS_SERVICE = "bfsdefault"
EXASOL_BUCKETFS_BUCKET = "default"
EXASOL_BUCKETFS_PATH = f"/buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}" # Filesystem-Path to the read-only mounted BucketFS inside the running UDF Container
EXASOL_SCRIPT_LANGUAGE_NAME = "PYTHON3_60"
EXASOL_UDF_FLAVOR = "python3-ds-EXASOL-6.0.0"
EXASOL_UDF_RELEASE= "20190116"
EXASOL_UDF_CLIENT = "exaudfclient" # or for newer versions of the flavor exaudfclient_py3
EXASOL_SCRIPT_LANGUAGES = f"{EXASOL_SCRIPT_LANGUAGE_NAME}=localzmq+protobuf:///{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}?lang=python#buckets/{EXASOL_BUCKETFS_SERVICE}/{EXASOL_BUCKETFS_BUCKET}/{EXASOL_UDF_FLAVOR}/exaudf/{EXASOL_UDF_CLIENT}";
EXASOL_SCHEMA = "IDA"

connection_params = {"dns": EXASOL_EXTERNAL_HOST, "user": EXASOL_USER, "password": EXASOL_PASSWORD, "compression": True}

params = {
    "script_languages": EXASOL_SCRIPT_LANGUAGES,
    "script_language_name": EXASOL_SCRIPT_LANGUAGE_NAME,
    "schema": EXASOL_SCHEMA,
    "EXASOL_BUCKETFS_PORT": EXASOL_BUCKETFS_PORT,
    "EXASOL_BUCKETFS_USER": EXASOL_BUCKETFS_USER,
    "EXASOL_BUCKETFS_PASSWORD": EXASOL_BUCKETFS_PASSWORD,
    "EXASOL_BUCKETFS_USE_HTTPS": EXASOL_BUCKETFS_USE_HTTPS,
    "EXASOL_BUCKETFS_BUCKET": EXASOL_BUCKETFS_BUCKET,
    "EXASOL_BUCKETFS_PATH": EXASOL_BUCKETFS_PATH
}

First we will export data into a pandas DataFrame and split it into training and validation sets.

In [73]:
import pyexasol
from sklearn.model_selection import train_test_split
from stopwatch import Stopwatch

stopwatch = Stopwatch()

with pyexasol.connect(dsn=EXASOL_EXTERNAL_HOST, user=EXASOL_USER, password=EXASOL_PASSWORD, compression=True) as conn:
    df = conn.export_to_pandas(query_or_table=(EXASOL_SCHEMA, 'ABALONE_TRAIN'))

X, y = df.drop(columns='RINGS'), df['RINGS']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

print(f"Loading the data took: {stopwatch}")

Loading the data took: 1.16s


Let's look at the features. We will first check what physical measurements have predictive power. For that we will compute mutual information between the measurements' input columns and the target.

In [76]:
from sklearn.feature_selection import mutual_info_regression

X_meas = X_train.drop(columns='SEX')
mi = mutual_info_regression(X_meas, y_train)
dict(zip(df_measurements.columns, mi))


{'LENGTH': 0.3843467581506941,
 'DIAMETER': 0.4029831219029347,
 'HEIGHT': 0.37975381266846053,
 'WHOLE_WEIGHT': 0.3956317866384804,
 'SHUCKED_WEIGHT': 0.3260761868093418,
 'VISCERA_WEIGHT': 0.36861377572116716,
 'SHELL_WEIGHT': 0.4156639387513801}

Now let's see if SEX is a good predictor. We will do the ANOVA test and print the p-value.

In [77]:
f_classif(y_train.to_frame(), X_train['SEX'])[1][0]

2.839928432549384e-117

Let's make a pipeline. We will use all features in the input. We will do One Hot Encoding of the SEX column and normalize all others, including the target. Let's use the Support Vector Machine as the regression model. We will drop the column names in the inputs, as they will not be available in the prediction UDF.

In [98]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR

# Create the pipeline.
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), [1, 2, 3, 4, 5, 6, 7]),
        ("cat", OneHotEncoder(), [0]),
    ]
)
regressor = SVR(kernel='rbf')
model = Pipeline(
    steps=[("preprocessor", preprocessor), ('regressor', regressor)]
)
model = TransformedTargetRegressor(regressor=model, transformer=StandardScaler())

stopwatch = Stopwatch()

# Train the model.
model.fit(X_train.values, y_train.values)

print(f"Training of the model took: {stopwatch}")

Training of the model took: 172.10ms


Let's see what prediction performance we've got, printing some regression metrics.

In [101]:
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error

y_pred = model.predict(X_valid.values)

print('Mean absolute error:', mean_absolute_error(y_valid, y_pred))
print('Mean squared error:', mean_squared_error(y_valid, y_pred))
print('Explained variance:', explained_variance_score(y_valid, y_pred))

Mean absolute error: 1.4784716097061217
Mean squared error: 4.375086550238619
Explained variance: 0.5503658142734189


Now, let's upload the model into the BucketFS, so that it can be used for making predictions in SQL queries. To communicate with BucketFS we will be using the <a href="https://exasol.github.io/bucketfs-python/" target="_blank" rel="noopener">`bucketfs-python`</a> module. 

In [102]:
import pickle
from exasol.bucketfs import Service

MODEL_FILE = 'abalone_svm_model.pkl'

# Setup the connection parameters.
buckfs_url = f'{EXASOL_BUCKETFS_URL_PREFIX}{EXASOL_EXTERNAL_BUCKETFS_HOST}'
buckfs_credentials = {EXASOL_BUCKETFS_BUCKET: {'username': EXASOL_BUCKETFS_USER, 'password': EXASOL_BUCKETFS_PASSWORD}}

stopwatch = Stopwatch()

# Connect to the BucketFS service and navigate to the bucket of choice.
bucketfs = Service(buckfs_url, buckfs_credentials)
bucket = bucketfs[EXASOL_BUCKETFS_BUCKET]

# Serialize model into a byte-array and upload it to the BucketFS, 
# where it will be saved in the file with the specified name.
bucket.upload(MODEL_FILE, pickle.dumps(model))

print(f"Uploading the model took: {stopwatch}")

Uploading the model took: 304.44ms


Now we are ready to use this model in our SQL queries. This will be demonstrated in the [following notebook](sklearn_predict_abalone.ipynb)