# Train an ML model on Exasol Data

In this Turtorial You will load the data from Azure Blobstorage, and run a python script as an AzureML job to do some data preprocessing and train a simple tensorflow model. Then You will register the trained model with AzureML for further use.


## Prerequisites
You completed the [previous part of this tutorial series](ConnectAzureMLtoExasol.ipynb) and therefore have:
 - running AzureML compute instance
 - Azure storage account
 - [Scania Trucks](https://archive.ics.uci.edu/ml/datasets/IDA2016Challenge) dataset loaded into Azure Blobstore


## Python script for training the model

We will use a python script to create and train a tensorflow model on the data we loaded from Exasol. You can finde the script [here](main.py).
The script loads the data from the files we saved in the Azure Blobstore, does some data preprocessing to combat the unbalanced nature of the dataset and remove empty values so the TensorFlow backpropagation can work properly. Then it creates a simple TensorFlow model and trains it on the data. The mode is evaluated using the test dataset and saved to the job output.

This script creates a model that only uses Python packages available in Exasol Saas UDFs natively. This means you can upload this model directly to your exasol cluster and run it on the cluster using an UDF. If your own models use different packages but you still need to run them on the cluster directly you need to [build and install you own Script-Language Container](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/adding_new_packages_script_languages.htm). Information on which packages are supported out of the box can be found [here](https://docs.exasol.com/saas/database_concepts/udf_scripts.htm).


## Prepare AzureML studio to run the Python script

This notebook is meant to be run in AzureML Studio, so upload it to yor Notebooks, open it and select your compute instance in the drop-down menu at the top of your notebook.
The same things could be archived accessing AzureML using remote scripts, but for demonstration purposes we use AzureML Studio here.

First, we install some AzureML functionality.

In [None]:
!pip install azure-identity
!pip install azure-ai-ml==1.3.0

Then, we create an MLClient for accessing our AzureML jobs programmatically. For this we need our AzureML subscription id, resource group name and workspace name. Make sure to use the workspace you set up in the previous tutorial.

# TODO explain how get resource group name?

In [None]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<your subscription id>",               # change
    resource_group_name="<your resource group name>",       # change
    workspace_name="<your workspace name>",                 # change
)

### Create a new Python Environment

In order to run our Pyton script we need to create a new environment and install some dependencies. For this we first create a new directory called "dependencies".

In [None]:
#make env
import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

In [None]:
%%writefile {dependencies_dir}/conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - tensorflow
  - pip:
    - inference-schema[numpy-support]==1.3.0

Next we will create a new environment to run our job in. We will use the new dependencies file and use an ubuntu images as the base for our environment. Then we will create the new environment on our ml_client.

In [None]:
from azure.ai.ml.entities import Environment
custom_env_name = "<Name your environment here>"    # change

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for azureML tut",
    tags={"scikit-learn": "0.24.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}."
)

## Run the Python script
Now we need to create an AzureML job with some inputs, the path to the needed code, a command to run the code and information on which AzureML Compute and environment to use.
This job will be used to run our Python script on our Compute using the environment we created in the step before.
The script takes links to the data files we loaded ino Azure Blobstorage in the previous tutorial as input. You can find these links by
# TODO add description and images
Also don't forget to change the variables for your Compute.


In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input


job = command(
    inputs=dict(
        train_data=Input(
            type="uri_file",
            path="< link to training data file >",       # change
        test_data=Input(
            type="uri_file",
            path="< link to test data file >",           # change
        ),
        validation_data=Input(
            type="uri_file",
            path="< link to validation data file >",     # change
        ),
        learning_rate=0.001,
    ),
    code=".",  # location of source code, change if script not in same directory as this notebook

    command="python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --validation_data ${{inputs.validation_data}} --learning_rate ${{inputs.learning_rate}} ",
    environment=pipeline_job_env,
    compute="<your_compute_name>",                      # change
    experiment_name="<experiment_name>",                # change
    display_name="<experiment_name_>",                  # change
)

Now we can run the script on our compute instance. A link will show up below which you can click on to see the job details and output logs.

In [None]:
ml_client.create_or_update(job)

## Save the trained model

The Python script saves the trained model in the output files of the AzureML job. In order to register the model with AzureML we need to wait for the job to complete, and then we can create an AzureML Model instance from the tensorflow model.

You can find the name of your most recent job in the output of the step above. Copy and paste it into the cde below in order to get the trained model from this specific job.

In [None]:
job_name = "sad_glass_j5n9vtm0t3"                   # change each run!
registered_model_name = "trucks_defaults_model"

# stream the output and wait until the job is finished
ml_client.jobs.stream(job_name)

# refresh the latest status of the job after streaming
job_out = ml_client.jobs.get(name=job_name)

from azure.ai.ml.entities import Model

if job_out.status == "Completed":
    # lets get the model from this run
    model = Model(
        # the script stores the model as "model"
        path="azureml://jobs/${{job_name}}/outputs/artifacts/paths/outputs/model/",
        name="${{registered_model_name}}",
        description="Model created from run.",
        type="custom_model",
    )

Finally, we can register the model we retrieved from the job output with AzureML. This will allow us to access the model in the future and use it for inference in AzureML or download it.

In [None]:
registered_model = ml_client.models.create_or_update(model=model)

For us the model we trained on the Scania trucks dataset had a testing accuracy of 97,78 %.

|        | 0 prediction | 1 prediction |
|--------|--------------|--------------|
| 0 fact | tn 15538     | fp 87        |
| 1 fact | fn 268       | tp 107       |

You can find your registered model in
# tODO explain and add pic of rgistered model in azureML here

next part,