# Mask Detection Demo - Training and Evaluation (1 / 3)
The following example demonstrates a training workflow - building and training a model for classifing whether a person is wearing a mask or not. The training is auto-logged to both Tensorbaord and MLRun, and easily distributed using Horovod. Post training we will run an evaluation to check our model's performance, updating his loggings as a part of a routine test.

1. [Setup the Project](#section_1)
2. [Download the Data](#section_2)
3. [Write the Code](#section_3)
4. [Create the MLRun Function](#section_4)
5. [Run Training and Evaluation](#section_5)
6. [Run Distributed Training Using Horovod](#section_6)

<a id="section_1"></a>
## 1. Setup the Project

Create a project using `mlrun.new_project`, creating the paths where we'll store the project's artifacts:

In [1]:
import mlrun
import os

# Set our project's name and directory:
project_name = "tf-keras-mask-detection"
project_dir = os.path.abspath("./")

# Create the project:
project = mlrun.new_project(project_name, project_dir, user_project=True)

A project in MLRun is based on MLRun Functions it can run. In this notebook we will see 2 ways we can create a MLRun Function:
* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in [section 4](#section_4)).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://docs.mlrun.org/en/latest/load-from-marketplace.html) - a functions hub intended to be a centralized location for open source contributions of function components (will be used for downloading the data in [section 2](#section_2)).

<a id="section_2"></a>
## 2. Download the Data

### 2.1. Import a Function

We will download the images using `open_archive` - a function from MLRun's functions marketplace. We will import the fucntion using `mlrun.import_function` and describe it to get the function's documentation:

In [2]:
# Import the function:
open_archive_function = mlrun.import_function("hub://open_archive")

# Print the function's documentation:
open_archive_function.doc()

function: open-archive
Open a file/object archive into a target directory
default handler: open_archive
entry points:
  open_archive: Open a file/object archive into a target directory

Currently supports zip and tar.gz
    context(MLClientCtx)  - function execution context, default=
    archive_url(DataItem)  - url of archive file, default=
    subdir(str)  - path within artifact store where extracted files are stored, default=content
    key(str)  - key of archive contents in artifact store, default=content
    target_path(str)  - file system path to store extracted files (use either this or subdir), default=None


### 2.2. Run the Function - Download the Images

* **Function handlers**: We'll download the images by running the function using the `open_archive` handler as noted in the function's documentation. MLRun function is a collection of code and the handlers are the functions headers inside it. Every function with a context (type: `mlrun.MLClientCtx`) can be used as a handler.
* **Passing parameters**: MLRun function expects two types of parameters: inputs (type: `mlrun.DataItem`) and parameters. As noted in the function's documentation, we can see the `archive_url` is an `mlrun.DataItem`, so it should be passed in the `inputs` attribute of the `run` function. The others are passed via the `parameters` attribute.
* Notice we use the `local` argument and pass it as `True`. That means we will run the function locally and not on a pod. Using `local` is a convinient way for debugging the code.

For more information regarding MLRun functions, context and data items, refer to [MLRun's documentation](https://docs.mlrun.org/en/latest/index.html).

In [3]:
# Setup the archive url for downloading the dataset images:
archive_url = f"{mlrun.mlconf.default_samples_path}data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip"

# Set the path to download the images data to:
dataset_path = os.path.abspath('./Dataset')

# Run the function using the 'open_archive' handler:
open_archive_run = open_archive_function.run(
    name='download_data',
    handler='open_archive',
    inputs={'archive_url': archive_url},
    params={'target_path': dataset_path},
    local=True
)

> 2021-11-02 12:27:55,324 [info] starting run download_data uid=c2aa623c8bed4b649f5431035be78783 DB=http://mlrun-api:8080
> 2021-11-02 12:27:55,591 [info] downloading https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip to local temp file


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...5be78783,0,Nov 02 12:27:55,completed,download_data,v3io_user=guylkind=owner=guylhost=guyl-jupyter-bdbbcc6cc-x2mzd,archive_url,target_path=/User/demos/mask-detection/tf-keras/Dataset,,content





> 2021-11-02 12:28:05,634 [info] run executed, status=completed


<a id="section_3"></a>
## 3. Write the Code

Our code is classic and straightforward, we: 
1. Use `_get_datasets` for getting the training and validation datasets (on evaluation - the evaluation dataset).
2. Use `_get_model` to build our classifier - simple transfer learning from MobileNetV2.
3. Call `train` to train the model.
4. Call `evaluate` to evaluate the model.

Taking this code one step further is **MLRun**'s framework for `tf.keras`: 

```python
# Apply MLRun's interface for tf.keras:
mlrun_tf_keras.apply_mlrun(model=model, context=context, ...)
```

With just one line of code, it seamlessly provides:
* **Automatic logging**: auto-log your training and model to both **Tensorboard** and **MLRun**. Additional settings can be passed onto this method to gain extra logging capabilities, like:
  * Weights histograms and distributions
  * Weights statistics
  * Weights images (working in progress)
  * Edit static and dynamic hyperparameters tracking
  * Logging frequency and more
* **Distributed training with Horovod**: Horovod will be initialized and used automatically if the MLRun Function's `kind` attribute is equal to `"mpijob"`, there won't be any additional changes needed to the original code! More on that later in [section 6](#section_6)

In addition, in the `evaluate` method code, we use the `mlrun.frameworks.tf_keras.TFKerasModelHandler` class. This class supports loading, saving and logging `tf.keras` models with ease, enabling easy versioning of the model and his results, artifacts and custom objects.

We suggest reading the documentation for further use, or like in this example, use the default settings.

In [4]:
# mlrun: start-code

In [5]:
import os
import pathlib
import zipfile

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

import tensorflow as tf
from tensorflow import keras
for gpu in tf.config.experimental.list_physical_devices("GPU"):
    tf.config.experimental.set_memory_growth(gpu, True)

import mlrun
import mlrun.frameworks.tf_keras as mlrun_tf_keras
from mlrun.frameworks.tf_keras import TFKerasModelHandler

In [6]:
def _get_datasets(
    dataset_path: str,
    batch_size: int,
    is_evaluation: bool = False,
):
    # Build the dataset:
    images = []
    labels = []
    for label, directory in enumerate(["with_mask", "without_mask"]):
        images_directory = os.path.join(dataset_path, directory)
        images_files = [os.path.join(images_directory, file) for file in os.listdir(images_directory)]
        for image_file in images_files:
            if not os.path.isfile(image_file):
                continue
            image = keras.preprocessing.image.load_img(image_file, target_size=(224, 224))
            image = keras.preprocessing.image.img_to_array(image)
            image = keras.applications.mobilenet_v2.preprocess_input(image)
            images.append(image)
            labels.append(label)

    # Convert the images and labels to NumPy arrays
    images = np.array(images, dtype="float32")
    labels = np.array(labels)

    # Perform one-hot encoding on the labels:
    labels = LabelBinarizer().fit_transform(labels)
    labels = keras.utils.to_categorical(labels)
    
    # Check if its an evaluation, if so, use the entire data:
    if is_evaluation:
        return (images, labels)
    
    # Split the dataset into training and validation sets:
    x_train, x_test, y_train, y_test = train_test_split(
        images,
        labels,
        test_size=0.2,
        stratify=labels,
        random_state=42,
    )

    # Construct the training image generator for data augmentation:
    image_data_generator = keras.preprocessing.image.ImageDataGenerator(
        rotation_range=20,
        zoom_range=0.15,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.15,
        horizontal_flip=True,
        fill_mode="nearest",
    )

    return (
        image_data_generator.flow(x_train, y_train, batch_size=batch_size),
        (x_test, y_test),
    )

In [7]:
def _get_model() -> keras.Model:
    # The model will be based on MobileNetV2:
    base_model = keras.applications.MobileNetV2(
        weights="imagenet",
        include_top=False,
        input_tensor=keras.layers.Input(shape=(224, 224, 3)),
    )

    # Construct the head of the model that will be placed on top of the the base model:
    head_model = base_model.output
    head_model = keras.layers.AveragePooling2D(pool_size=(7, 7))(head_model)
    head_model = keras.layers.Flatten(name="flatten")(head_model)
    head_model = keras.layers.Dense(128, activation="relu")(head_model)
    head_model = keras.layers.Dropout(0.5)(head_model)
    head_model = keras.layers.Dense(2, activation="softmax")(head_model)

    # Place the head FC model on top of the base model (this will become the actual model we will train):
    model = keras.Model(
        name="mask_detector", 
        inputs=base_model.input, 
        outputs=head_model
    )

    # Loop over layers in the base model and freeze them so they will not be updated during the first training process:
    for layer in base_model.layers:
        layer.trainable = False

    return model

In [18]:
def train(
    context: mlrun.MLClientCtx,
    archive_url: mlrun.DataItem,
    dataset_path: str,
    batch_size: int = 32,
    lr: float = 1e-4,
    epochs: int = 3,
):
    # Get the datasets:
    training_set, validation_set = _get_datasets(
        dataset_path=dataset_path,
        batch_size=batch_size
    )

    # Get the model:
    model = _get_model()

    # Apply MLRun's interface for tf.keras:
    mlrun_tf_keras.apply_mlrun(model=model, model_name='mask_detector', context=context)

    # Initialize the optimizer:
    optimizer = keras.optimizers.Adam(lr=lr)

    # Compile the model:
    model.compile(
        optimizer=optimizer,
        loss="categorical_crossentropy",
        metrics=["accuracy"],
    )

    # Train the head of the network:
    model.fit(
        training_set,
        validation_data=validation_set,
        epochs=epochs,
        callbacks=[keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)],
        steps_per_epoch=35,
    )

In [19]:
def evaluate(
    context: mlrun.MLClientCtx, 
    model_path: str,
    dataset_path: str, 
    batch_size: int, 
):
    # Get the dataset:
    x, y = _get_datasets(
        dataset_path=dataset_path,
        batch_size=batch_size,
        is_evaluation=True
    )

    # Load the model using MLRun's model handler:
    model_handler = TFKerasModelHandler(model_name='mask_detector', model_path=model_path, context=context)
    model_handler.load()

    # Apply MLRun's interface for tf.keras:
    mlrun_tf_keras.apply_mlrun(model=model_handler.model, model_name='mask_detector', model_path=model_path, context=context)

    # Evaluate:
    model_handler.model.evaluate(
        x=x, 
        y=y, 
        batch_size=batch_size
    )

In [20]:
# mlrun: end-code

<a id="section_4"></a>
## 4. Create the MLRun Function

We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in this notebook. The comments `# mlrun: start-code` and `# mlrun: end-code`, are marking what code will turn into our MLRun Function. Notice our MLRun Function will have two handlers: `train` and `evaluate`.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [21]:
# Create the function parsing this notebook's code using 'code_to_function':
main_function = mlrun.code_to_function(
    name="main_function",
    kind="job",
    image="mlrun/ml-models"
)

<a id="section_5"></a>
## 5. Run Training and Evaluation

### 5.1. Train the Model:

We will run the training as a job using the `train` handler. We will pass the desired hyperparameters and keep the returning run object in order to pass the trained model to the evaluation later on. Notice now the `local` is `False` (this is its default value).

In [12]:
main_function.apply(mlrun.platforms.auto_mount())
training_run = main_function.run(
    name="training",
    handler="train",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2021-11-02 12:28:32,774 [info] starting run training uid=446598a3791d4486906fc1c607a8d933 DB=http://mlrun-api:8080
> 2021-11-02 12:28:32,966 [info] Job is running in the background, pod: training-g48kk
2021-11-02 12:29:32.287028: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-02 12:29:32.287068: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-02 12:29:33.398481: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-02 12:29:33.398697: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-02 12:29:33.398722: W tensorflow/stream_exec

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...07a8d933,0,Nov 02 12:29:33,completed,training,v3io_user=guylkind=jobowner=guylhost=training-g48kk,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.051856279373168945training_accuracy=1.0004768371582031validation_loss=0.04408383038308886validation_accuracy=0.9891304439968533,training_loss_epoch_1.htmltraining_accuracy_epoch_1.htmlvalidation_loss_epoch_1.htmlvalidation_accuracy_epoch_1.htmltraining_loss_epoch_2.htmltraining_accuracy_epoch_2.htmlvalidation_loss_epoch_2.htmlvalidation_accuracy_epoch_2.htmltraining_loss_epoch_3.htmltraining_accuracy_epoch_3.htmlvalidation_loss_epoch_3.htmlvalidation_accuracy_epoch_3.htmlloss_summary.htmlaccuracy_summary.htmllr.html.htmlmask_detector.zipmask_detector





> 2021-11-02 12:31:04,762 [info] run executed, status=completed


When the training is done, there will be a list of all the <span style="background:lightgreen">artifacts created</span> in MLRun during the training run. All the (hopfully smooth) loss and metrics graphs we all love will be in both MLRun and Tensorboard, as well as the model weights and custom objects.

### 5.2. Evaluate the Model:

Evaluating the model requires, you guessed it, the trained model. In order to get the model, we will use the training run object and get the model artifact by his name (as seen in the artifacts list generated above).

In [13]:
evaluation_run = main_function.run(
    name="evaluating",
    handler="evaluate",
    params={
        "model_path": training_run.outputs['mask_detector'],
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32
    }
)

> 2021-11-02 12:31:04,767 [info] starting run evaluating uid=500be45f9284407fac097d5b187e20d3 DB=http://mlrun-api:8080
> 2021-11-02 12:31:04,947 [info] Job is running in the background, pod: evaluating-k4dk4
2021-11-02 12:31:10.429591: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-02 12:31:10.429632: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-02 12:31:11.535007: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-02 12:31:11.535239: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-11-02 12:31:11.535263: W tensorflow/stream_

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...187e20d3,0,Nov 02 12:31:11,completed,evaluating,v3io_user=guylkind=jobowner=guylhost=evaluating-k4dk4,,model_path=store://artifacts/tf-keras-mask-detection-guyl/mask_detector:446598a3791d4486906fc1c607a8d933dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32,model_path=store://artifacts/tf-keras-mask-detection-guyl/mask_detector:446598a3791d4486906fc1c607a8d933dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32evaluation_loss=0.037020824676336245evaluation_accuracy=0.9898255813953488,evaluation_loss_epoch_1.htmlevaluation_accuracy_epoch_1.html





> 2021-11-02 12:31:48,676 [info] run executed, status=completed


We will now save our project with the registered functions for using them later in the pipeline:

<a id="section_6"></a>
## 6. Run Distributed Training Using Horovod

Now we can see the second benefit of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us.

All is needed to be done, is create our function with `kind="mpijob"`:

In [14]:
# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = False

# Create the MLRun Function:
distributed_training_function = mlrun.code_to_function(
    name="distributed-training",
    handler="train",
    kind="mpijob",
    image="mlrun/ml-models-gpu" if use_gpu else "mlrun/ml-models",
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more. We will setup 2 workers with 1 GPU per worker:

In [15]:
# Setup the desired configurations:
distributed_training_function.spec.replicas = 2
if use_gpu:
    distributed_training_function.gpus(1)
else:
    distributed_training_function.with_requests(cpu=2)
distributed_training_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7fb28226c2d0>

Call run, and notice each epoch is shorter as we now have 2 workers instead of 1.

In [16]:
# Run the training job:
distributed_training_run = distributed_training_function.run(
    name="trainer-mpijob-run",
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 6,
    },
    watch=False,
)

# Print the progress in steps as the 2 workers will print a lot of tf outputs...
import time
from IPython.display import clear_output

while(distributed_training_run.state() not in ['completed', 'error']):
    time.sleep(3)
    clear_output(wait=True)
    distributed_training_run.show()

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tf-keras-mask-detection-guyl,...98d77052,0,Nov 02 12:32:42,completed,trainer-mpijob-run,v3io_user=guylkind=mpijobowner=guylmlrun/job=trainer-mpijob-run-689e6b6ehost=trainer-mpijob-run-689e6b6e-worker-1,,dataset_path=/User/demos/mask-detection/tf-keras/Datasetbatch_size=32lr=0.0001epochs=6,,
