# Training Classification Model

In this notebook, we will train a very simple classification model for labeling Cherenkov radiation shower images. The images will be classified as those caused by primary gammas (signal) and those initiated by cosmic rays in the upper atmosphere (background). You can find more information about the problem domain <a href="https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope" target="_blank" rel="noopener">here</a>.

We will train the model in this notebook using <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">`scikit-learn`</a>, on the training data we are going to export from the database.

To execute queries and load data from the Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).
2. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## Load data

First, we will export data into a pandas DataFrame.

In [None]:
from exasol.connections import open_pyexasol_connection
from stopwatch import Stopwatch

stopwatch = Stopwatch()

with open_pyexasol_connection(sb_config, compression=True) as conn:
    df = conn.export_to_pandas(query_or_table=(sb_config.SCHEMA, 'TELESCOPE_TRAIN'))

print(f"Loading the data took: {stopwatch}")

## Train model

The data has no missing values. In order to keep things simple we will be using a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier" target="_blank" rel="noopener">`Decision Tree Classifier`</a> algorithm which requires little in terms of pre-processing for this dataset.

In [None]:
from sklearn import tree
from sklearn.model_selection import train_test_split

# Split the dataset into train and validation sets. Use all available features columns.
X, y = df.drop(columns='CLASS'), df['CLASS']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

stopwatch = Stopwatch()

# Create and train the model.
model = tree.DecisionTreeClassifier()
model.fit(X_train, y_train)

print(f"Training took: {stopwatch}")

## Evaluate model

Let's evaluate the model using the validation set.
The results may not look particularly impressive but it's OK. We are aiming for simplicity and clarity, not the best prediction performance.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Make the predictions on the validation set.
y_pred = model.predict(X_valid)

# Build and display the confusion matrix.
cm = confusion_matrix(y_valid, y_pred, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()

## Upload model into BucketFS

Now, let's upload the model into the BucketFS so that it can be used for making classification in SQL queries. To communicate with BucketFS we will be using the <a href="https://exasol.github.io/bucketfs-python/" target="_blank" rel="noopener">`bucketfs-python`</a> module. 

In [None]:
import pickle
from exasol.connections import open_bucketfs_connection

MODEL_FILE = 'telescope_tree_model.pkl'

stopwatch = Stopwatch()

# Connect to the BucketFS service
bucket = open_bucketfs_connection(sb_config)

# Serialize the model into a byte-array and upload it to the BucketFS, 
# where it will be saved in the file with the specified name.
bucket.upload(MODEL_FILE, pickle.dumps(model))

print(f"Uploading the model took: {stopwatch}")

Now we are ready to use this model in our SQL queries. This will be demonstrated in the [following notebook](sklearn_predict_telescope.ipynb)