# Train an ML model on Exasol Data

In this tutorial, you will load the data from Azure Blob Storage, and run a Python script as an AzureML job to preprocess the data and train a simple scikit-learn model. Then You will register the trained model with AzureML for further use.

## Prerequisites
You completed the [previous part of this tutorial series](ConnectAzureMLtoExasol.ipynb) and therefore have:
 - A running AzureML compute instance
 - An Azure Storage account
 - The [Scania Trucks](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge) dataset loaded into Azure Blob Storage


## Python script for training the model

We will use a Python script to create and train a SciKit-Learn model on the data we loaded from Exasol. You can find the script [here](main.py).
The script loads the data from the files we saved in the Azure Blob Storage, does data preprocessing to combat the unbalanced nature of the dataset and removes empty values so the training can work properly.
Then, it creates a simple SciKit-Learn model and trains it on the data. The model is evaluated using the test dataset and registered in the AzureML Workspace using MLflow.

This script creates a model that only uses Python packages available in Exasol Saas UDFs natively. This means you can upload this model directly to your exasol Database and run it using an UDF. If your own models use different packages but you still need to run them on the cluster directly you need to [build and install yout own Script-Language Container](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/adding_new_packages_script_languages.htm). Information on which packages are supported out of the box can be found [here](https://docs.exasol.com/saas/database_concepts/udf_scripts/python3.htm).


## Prepare AzureML studio to run the Python script

This notebook is meant to be run in AzureML Studio, so upload it to your Notebooks, open it and select your compute instance in the drop-down menu at the top of your notebook. The same steps can be achieved by accessing AzureML using remote scripts, but for demonstration purposes we use AzureML Studio here.

First, we install some AzureML functionality.

In [None]:
!pip install azure-identity
!pip install azure-ai-ml==1.3.0

Then, we create an MLClient for accessing our AzureML jobs programmatically. For this we need our AzureML subscription id, resource group name and workspace name. If you are not sure what your resource group name is, you can find it by clicking your subscription in the top left oft AzureML Studio
Make sure to use the workspace you set up in the previous tutorial.

![](img_src/resource_group.png)

In [None]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<your subscription id>",               # change
    resource_group_name="<your resource group name>",       # change
    workspace_name="<your workspace name>",                 # change
)

### Create a new Python Environment

To run our Python script we need to create a new environment and install the required dependencies. For this, we first create a new directory called "dependencies".

In [None]:
#make env
import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

In order for our model to be usable in the Exasol Saas Database later, we need to make sure the SciKit-learn version we use matches the version in Saas.

In [None]:
%%writefile {dependencies_dir}/conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - scikit-learn=1.0.2
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0


Next, we will create a new environment to run our job in. We will use the new dependencies file and use an Ubuntu image as the base for our environment. Then we will create the new environment on our *MLClient*.

In [1]:
from azure.ai.ml.entities import Environment
custom_env_name = "<Name your environment here>"    # change

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for AzureML tutorial",
    tags={"scikit-learn": "1.0.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}."
)

ModuleNotFoundError: No module named 'azure'

## Run the Python script

Now we need to create an AzureML job with the following inputs:

 - The path to the Python script
 - A command to run the script
 - Information which AzureML Compute and Environment to use

This job will be used to run our Python script on our Compute using the environment we created in the step before.
The script takes links to the data files we loaded ino Azure Blob Storage in the previous tutorial as input. You can find these links by naviating to your data files in your data store and clicking the kebab menu besides each file. A drop down menu will open where you can select the "Copy URI" option. This opens a pop-up window where you can copy the link to the file.
![](img_src/get_data_link.png)

This opens a pop-up window where you can copy the link to the file.
![](img_src/get_data_link_2.png)

Also don't forget to change the variables for your Compute.


In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes


job = command(
    inputs=dict(
        train_data=Input(
            type=AssetTypes.URI_FILE,
            path="< link to training data file >",       # change
        ),
        test_data=Input(
            type=AssetTypes.URI_FILE,
            path="< link to test data file >",       # change
        ),
        learning_rate=0.05
    ),
    code=".",  # location of source code
    command="python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --learning_rate ${{inputs.learning_rate}}",
    environment=pipeline_job_env,
    compute="<your_compute_name>",                      # change
    experiment_name="<experiment_name>",                # change
    display_name="<experiment_name_>",                  # change
)


Now, we can run the script on our compute instance. A link will show up below, which you can click on to see the job details and output logs.

In [None]:
ml_client.create_or_update(job)

Here is the Confusion Matrix of our trained model.


|            | predicted neg  | predicted pos  |
|------------|----------------|----------------|
|actual neg  |        14841   | 784            |
|actual pos  |           13   | 362            |

The model has a total cost of 14340 according to the ida-score we implemented in accordance to the problem description of the Scania Trucks dataset.

## Save the trained model

The script will directly register the trained model in your AzureML Workspace, so you can use it to run inference in AzureML. It will also save the model in the output of the job. From there, you can extract it to run it in your Exasol cluster. You can find your registered model under the Assets, Model entry in the AzureML Studio menu on the left.

![](img_src/registered_model.png)

Now that we have trained and registered a model on the data we imported from our Exasol Saas instance, we can move on to the
[next part](InvokeModelFromExasolDBwithUDF.ipynb), where we will use this model from with in our Exasol Cluster to classify some data.