# Mask Detection Demo - Training and Evaluation (1 / 3)
The following example demonstrates a training workflow - building and training a model for classifing whether a person is wearing a mask or not. The training is auto-logged to both Tensorbaord and MLRun, and easily distributed using Horovod. Post training we will run an evaluation to check our model's performance, updating his loggings as a part of a routine test.

1. [Setup the Project](#section_1)
2. [Download the Data](#section_2)
3. [Write the Training and Evaluation Code](#section_3)
4. [Create the MLRun Function](#section_4)
5. [Run Training and Evaluation](#section_5)
6. [Run Distributed Training Using Horovod](#section_6)

Before we begin, **please select the desired framework** (comment and uncomment the below lines as needed):

In [1]:
framework = "tf-keras"

<a id="section_1"></a>
## 1. Setup the Project

Create a project using `mlrun.get_or_create_project` (ensuring loading it in case it already exists), creating the paths where we'll store the project's artifacts:

In [2]:
import mlrun
import os

# Set our project's name and directory:
project_name = f"{framework}-mask-detection"
project_dir = os.path.abspath("./")

# Create the project:
project = mlrun.get_or_create_project(project_name, project_dir, user_project=True)

> 2021-11-05 11:14:31,306 [info] loaded project tf-keras-mask-detection from MLRun DB


A project in MLRun is based on MLRun Functions it can run. In this notebook we will see 2 ways we can create a MLRun Function:
* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in [section 4](#section_4)).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://docs.mlrun.org/en/latest/load-from-marketplace.html) - a functions hub intended to be a centralized location for open source contributions of function components (will be used for downloading the data in [section 2](#section_2)).

<a id="section_2"></a>
## 2. Download the Data

### 2.1. Import a Function

We will download the images using `open_archive` - a function from MLRun's functions marketplace. We will import the fucntion using `mlrun.import_function` and describe it to get the function's documentation:

In [3]:
# Import the function:
open_archive_function = mlrun.import_function("hub://open_archive")

# Print the function's documentation:
open_archive_function.doc()

function: open-archive
Open a file/object archive into a target directory
default handler: open_archive
entry points:
  open_archive: Open a file/object archive into a target directory

Currently supports zip and tar.gz
    context(MLClientCtx)  - function execution context, default=
    archive_url(DataItem)  - url of archive file, default=
    subdir(str)  - path within artifact store where extracted files are stored, default=content
    key(str)  - key of archive contents in artifact store, default=content
    target_path(str)  - file system path to store extracted files (use either this or subdir), default=None


### 2.2. Run the Function - Download the Images

* **Function handlers**: We'll download the images by running the function using the `open_archive` handler as noted in the function's documentation. MLRun function is a collection of code and the handlers are the functions headers inside it. Every function with a context (type: `mlrun.MLClientCtx`) can be used as a handler.
* **Passing parameters**: MLRun function expects two types of parameters: inputs (type: `mlrun.DataItem`) and parameters. As noted in the function's documentation, we can see the `archive_url` is an `mlrun.DataItem`, so it should be passed in the `inputs` attribute of the `run` function. The others are passed via the `parameters` attribute.
* Notice we use the `local` argument and pass it as `True`. That means we will run the function locally and not on a pod. Using `local` is a convinient way for debugging the code.

For more information regarding MLRun functions, context and data items, refer to [MLRun's documentation](https://docs.mlrun.org/en/latest/index.html).

In [4]:
# Setup the archive url for downloading the dataset images:
archive_url = f"{mlrun.mlconf.default_samples_path}data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip"

# Set the path to download the images data to:
dataset_path = os.path.abspath('./Dataset')

# Run the function using the 'open_archive' handler:
open_archive_run = open_archive_function.run(
    name='download_data',
    handler='open_archive',
    inputs={'archive_url': archive_url},
    params={'target_path': dataset_path},
    local=True
)

> 2021-11-05 11:14:31,570 [info] starting run download_data uid=38e7ecb39354461688967ea66250b20a DB=http://mlrun-api:8080
> 2021-11-05 11:14:32,976 [info] downloading https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip to local temp file


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...6250b20a,0,Nov 05 11:14:32,completed,download_data,v3io_user=guylkind=owner=guylhost=guyl-jupyter-bdbbcc6cc-x2mzd,archive_url,target_path=/User/demos/mask-detection/tf-keras/Dataset,,content





> 2021-11-05 11:14:42,497 [info] run executed, status=completed


<a id="section_3"></a>
## 3. Write the Training and Evaluation Code

### TF.Keras

The code is taken from the python file [training-and-evaluation.py](tf-keras/training-and-evaluation.py). It is classic and straightforward, we: 
1. Use `_get_datasets` for getting the training and validation datasets (on evaluation - the evaluation dataset).
2. Use `_get_model` to build our classifier - simple transfer learning from MobileNetV2.
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `tf.keras`: 

```python
# Apply MLRun's interface for tf.keras:
mlrun_tf_keras.apply_mlrun(model=model, context=context, ...)
```

With just one line of code, it seamlessly provides:
* **Automatic logging**: auto-log your training and model to both **Tensorboard** and **MLRun**. Additional settings can be passed onto this method to gain extra logging capabilities, like:
  * Weights histograms and distributions
  * Weights statistics
  * Weights images (working in progress)
  * Edit static and dynamic hyperparameters tracking
  * Logging frequency and more
* **Distributed training with Horovod**: Horovod will be initialized and used automatically if the MLRun Function's `kind` attribute is equal to `"mpijob"`, there won't be any additional changes needed to the original code! More on that later in [section 6](#section_6)

In addition, in the `evaluate` method code, we use the `mlrun.frameworks.tf_keras.TFKerasModelHandler` class. This class supports loading, saving and logging `tf.keras` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects.

We suggest reading the documentation for further use, or like in this example, use the default settings.

<a id="section_4"></a>
## 4. Create the MLRun Function

We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in the above mentioned python file. Notice our MLRun Function will have two handlers: `train` and `evaluate`.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [5]:
# Create the function parsing the given file code using 'code_to_function':
training_and_evaluation = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="main_function",
    kind="job",
    image="mlrun/ml-models"
)

# Mount it:
training_and_evaluation.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7fb0061b0190>

<a id="section_5"></a>
## 5. Run Training and Evaluation

### 5.1. Train the Model:

We will run the training as a job using the `train` handler. We will pass the desired hyperparameters and keep the returning run object in order to pass the trained model to the evaluation later on. Notice now the `local` is `False` (this is its default value).

In [6]:
training_run = training_and_evaluation.run(
    name="training",
    handler="train",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2021-11-05 11:14:42,564 [info] starting run training uid=9cf33258b0d14af7905c9e42c7ebb4ca DB=http://mlrun-api:8080
> 2021-11-05 11:14:42,919 [info] Job is running in the background, pod: training-ztv6z
2021-11-05 11:14:47.946076: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-05 11:14:47.946120: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-05 11:14:49.029148: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-05 11:14:49.029380: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-05 11:14:49.029403: W tensorflow/stream_exec

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...c7ebb4ca,0,Nov 05 11:14:49,completed,training,v3io_user=guylkind=jobowner=guylhost=training-ztv6z,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.03330254554748535training_accuracy=1.00042724609375validation_loss=0.045975986454221934validation_accuracy=0.9891304439968533,training_loss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_loss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_loss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_loss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_loss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_loss_epoch_3.htmlvalidation_accuracy_epoch_3.htmlloss_summary.htmlaccuracy_summary.htmllr.html.htmlmask_detector.zipmask_detector





> 2021-11-05 11:16:24,671 [info] run executed, status=completed


When the training is done, there will be a list of all the <span style="background:lightgreen">artifacts created</span> in MLRun during the training run. All the (hopfully smooth) loss and metrics graphs we all love will be in both MLRun and Tensorboard, as well as the model weights and custom objects.

### 5.2. Evaluate the Model:

Evaluating the model requires, you guessed it, the trained model. In order to get the model, we will use the training run object and get the model artifact by his name (as seen in the artifacts list generated above).

In [7]:
evaluation_run = training_and_evaluation.run(
    name="evaluating",
    handler="evaluate",
    params={
        "model_path": training_run.outputs['mask_detector'],  # <- Take the model we trained from the previous MLRun function via the run object.
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32
    }
)

> 2021-11-05 11:16:24,677 [info] starting run evaluating uid=e2cbe6f50ca441bc8f739aafb008da24 DB=http://mlrun-api:8080
> 2021-11-05 11:16:24,870 [info] Job is running in the background, pod: evaluating-9b72n
2021-11-05 11:16:29.720341: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-05 11:16:29.720385: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-05 11:16:30.869887: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-05 11:16:30.870121: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-05 11:16:30.870146: W tensorflow/stream_

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...b008da24,0,Nov 05 11:16:31,completed,evaluating,v3io_user=guylkind=jobowner=guylhost=evaluating-9b72n,,model_path=store://artifacts/tf-keras-mask-detection-guyl/mask_detector:9cf33258b0d14af7905c9e42c7ebb4cadataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32,model_path=store://artifacts/tf-keras-mask-detection-guyl/mask_detector:9cf33258b0d14af7905c9e42c7ebb4cadataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32evaluation_loss=0.03784840055849663evaluation_accuracy=0.9905523255813954,evaluation_loss_epoch_1.htmlevaluation_accuracy_epoch_1.html





> 2021-11-05 11:17:08,579 [info] run executed, status=completed


We will now save our project with the registered functions for using them later in the pipeline:

<a id="section_6"></a>
## 6. Run Distributed Training Using Horovod

Now we can see the second benefit of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us. All is needed to be done, is create our function with `kind="mpijob"`.

Notice: for thie demo, in order to use GPUs in training, set the `use_gpu` variable to `True`. This will later assign the required configurations to use the GPUs and pass the correct image to support GPUs (image with CUDA libraries).

In [8]:
# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = False

# Create the MLRun Function:
distributed_training_function = mlrun.code_to_function(
    filename=os.path.join(framework, "training-and-evaluation.py"),
    name="distributed-training",
    handler="train",
    kind="mpijob",
    image="mlrun/ml-models-gpu" if use_gpu else "mlrun/ml-models",
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more (if `use_gpu` is `True` we will setup 2 workers with 1 GPU per worker):

In [9]:
# Setup the desired configurations:
distributed_training_function.spec.replicas = 2
if use_gpu:
    # Select the number of GPUs per replica:
    distributed_training_function.gpus(1)
else:
    distributed_training_function.with_requests(cpu=2)

# Mount it:
distributed_training_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7fb0067dc2d0>

Call run, and notice each epoch is shorter as we now have 2 workers instead of 1. As the 2 workers will print a lot of outputs we would rather wait for completion and then show the results. For that, we will pass `watch=False` and use the run objects function `wait_for_completion` and `show`. 

In order to see the logs, you are welcome to go into the UI by clicking the blue hyperlink <span style="color:blue">**click here**</span> after running the function and see the logs there:

In [10]:
# Run the training job:
distributed_training_run = distributed_training_function.run(
    name="trainer-mpijob-run",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3,
    },
    watch=False,  # <- Turn off the logs.
)

# Wait for complition and show the results. 
distributed_training_run.wait_for_completion()
distributed_training_run.show()

> 2021-11-05 11:17:08,644 [info] starting run trainer-mpijob-run uid=97030ea191084664951837a38c677ea0 DB=http://mlrun-api:8080
> 2021-11-05 11:17:15,997 [info] MpiJob trainer-mpijob-run-346306b6 launcher pod trainer-mpijob-run-346306b6-launcher state active


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...8c677ea0,0,Nov 05 11:17:08,running,trainer-mpijob-run,v3io_user=guylkind=mpijobowner=guyl,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,,





> 2021-11-05 11:17:16,019 [info] run executed, status=running


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...8c677ea0,0,Nov 05 11:17:21,completed,trainer-mpijob-run,v3io_user=guylkind=mpijobowner=guylmlrun/job=trainer-mpijob-run-346306b6host=trainer-mpijob-run-346306b6-worker-0,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32epochs=3lr=0.00015999999595806003training_loss=0.07539021968841553training_accuracy=0.96875validation_loss=0.04837816291385227validation_accuracy=0.9963767793443468,training_loss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_loss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_loss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_loss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_loss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_loss_epoch_3.htmlvalidation_accuracy_epoch_3.htmlloss_summary.htmlaccuracy_summary.htmllr.html.htmlmask_detector.zipmask_detector
