# Mask Detection Demo - Training (1 / 2)
The following example demonstrates a training workflow - building and training a model for classifing whether a person is wearing a mask or not. The training is auto-logged to both Tensorbaord and MLRun, and easily distributed using Horovod.

#### Key Technologies:
- [**Tensorflow-Keras**](https://www.tensorflow.org/api_docs/python/tf/keras) to train the model
- [**Horovod**](https://horovod.ai/) to run distributed training
- [**MLRun**](https://www.mlrun.org/) to orchestrate the process

#### Credits:

* The model is trained on a dataset containing images of people with or without masks. The data used was taken from Prajna Bhandary, [github link](https://github.com/prajnasb/observations). 
* The training code is taken from Adrian Rosebrock, COVID-19: Face Mask Detector with OpenCV, Keras/TensorFlow, and Deep Learning, PyImageSearch, [page link](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/), accessed on 29 June 2021

#### Table of Contents:
1. [Setup the Project and Environment](#section_1)
2. [Write the Training Code](#section_2)
3. [Create the Training Function](#section_3)
4. [Run Training](#section_4)

<a id="section_1"></a>
## 1. Setup the Project and Environment
Create a new project, set the environment and create the paths where we'll store the project's artifacts:

In [1]:
import mlrun
import os

# Create the project:
project_name='mask-detection'
project_dir = os.path.abspath('./')
project = mlrun.new_project(project_name, project_dir)

# Set the environment:
mlrun.set_environment(project=project.metadata.name)

# Setup the archive url for downloading the dataset images:
archive_url = "https://s3.wasabisys.com/iguazio/data/prajnasb-generated-mask-detection/prajnasb_generated_mask_detection.zip"



<a id="section_2"></a>
## 2. Write the Training Code

Our training code is classic and straightforward, we: 
1. Use `get_datasets` for downloading the images and initializing our datasets.
2. Use `get_model` to build our classifier - simple transfer learning from MobileNetV2.
3. Call `train` to train the model.

Taking this code one step further is **MLRun**'s framework for tf.keras. With just one line of code, it seamlessly provides automatic logging and enable distributed training with Horovod:

```python
# Apply MLRun's interface for tensorflow.keras:
mlrun_keras.apply_mlrun(model=model, context=context)
```

We use our interface to wrap your model methods and insert our callbacks, enabling logging to both Tensorboard and MLRun. Additional settings can be passed onto this method to gain extra logging capabilities, like:

* Weights histograms and distributions
* Weights statistics
* Weights images (working in progress)
* Edit static and dynamic hyperparameters tracking
* Logging frequency and more

We suggest reading the documentation for further use, or like in this example, use the default settings.

In [2]:
# mlrun: start-code

In [3]:
import os
import pathlib
import zipfile

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

import tensorflow as tf
from tensorflow import keras
for gpu in tf.config.experimental.list_physical_devices("GPU"):
    tf.config.experimental.set_memory_growth(gpu, True)

import mlrun
import mlrun.frameworks.keras as mlrun_keras

In [4]:
def get_datasets(
    archive_url: mlrun.DataItem,
    dataset_path: str,
    batch_size: int,
    train_test_split_ratio: float,
):
    # Download the dataset images if needed:
    os.makedirs(dataset_path, exist_ok=True)
    dataset_directory_size = sum([
        f.stat().st_size
        for f in pathlib.Path(dataset_path).glob("**/*")
        if f.is_file()
    ])
    if dataset_directory_size == 0:
        # Download it:
        zip_file = archive_url.local()
        # Extract it:
        zipfile.ZipFile(zip_file, "r").extractall(dataset_path)

    # Build the dataset:
    images = []
    labels = []
    for label, directory in enumerate(["with_mask", "without_mask"]):
        images_directory = os.path.join(dataset_path, directory)
        for image_file in os.listdir(images_directory):
            image = keras.preprocessing.image.load_img(
                os.path.join(images_directory, image_file), target_size=(224, 224)
            )
            image = keras.preprocessing.image.img_to_array(image)
            image = keras.applications.mobilenet_v2.preprocess_input(image)
            images.append(image)
            labels.append(label)

    # Convert the images and labels to NumPy arrays
    images = np.array(images, dtype="float32")
    labels = np.array(labels)

    # Perform one-hot encoding on the labels:
    labels = LabelBinarizer().fit_transform(labels)
    labels = keras.utils.to_categorical(labels)

    # Split the dataset into training and validation sets:
    x_train, x_test, y_train, y_test = train_test_split(
        images,
        labels,
        test_size=train_test_split_ratio,
        stratify=labels,
        random_state=42,
    )

    # Construct the training image generator for data augmentation:
    image_data_generator = keras.preprocessing.image.ImageDataGenerator(
        rotation_range=20,
        zoom_range=0.15,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.15,
        horizontal_flip=True,
        fill_mode="nearest",
    )

    return (
        image_data_generator.flow(x_train, y_train, batch_size=batch_size),
        (x_test, y_test),
    )

In [5]:
def get_model() -> keras.Model:
    # The model will be based on MobileNetV2:
    base_model = keras.applications.MobileNetV2(
        weights="imagenet",
        include_top=False,
        input_tensor=keras.layers.Input(shape=(224, 224, 3)),
    )

    # Construct the head of the model that will be placed on top of the the base model:
    head_model = base_model.output
    head_model = keras.layers.AveragePooling2D(pool_size=(7, 7))(head_model)
    head_model = keras.layers.Flatten(name="flatten")(head_model)
    head_model = keras.layers.Dense(128, activation="relu")(head_model)
    head_model = keras.layers.Dropout(0.5)(head_model)
    head_model = keras.layers.Dense(2, activation="softmax")(head_model)

    # Place the head FC model on top of the base model (this will become the actual model we will train):
    model = keras.Model(inputs=base_model.input, outputs=head_model)

    # Loop over layers in the base model and freeze them so they will not be updated during the first training process:
    for layer in base_model.layers:
        layer.trainable = False

    return model

In [6]:
def train(
    context: mlrun.MLClientCtx,
    archive_url: mlrun.DataItem,
    dataset_path: str = os.path.abspath('./Dataset'),
    batch_size: int = 32,
    lr: float = 1e-4,
    epochs: int = 3,
):
    # Get the datasets:
    training_set, validation_set = get_datasets(
        archive_url=archive_url,
        dataset_path=dataset_path,
        batch_size=batch_size,
        train_test_split_ratio=0.2,
    )

    # Get the model:
    model = get_model()

    # Apply MLRun's interface for tensorflow.keras:
    mlrun_keras.apply_mlrun(model=model, context=context)

    # Initialize the optimizer:
    optimizer = keras.optimizers.Adam(lr=lr)

    # Compile the model:
    model.compile(
        loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"],
    )

    # Train the head of the network:
    model.fit(
        training_set,
        validation_data=validation_set,
        epochs=epochs,
        callbacks=[keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)],
        steps_per_epoch=35,
    )

In [7]:
# mlrun: end-code

<a id="section_3"></a>
## 3. Create the Training Function

We will use MLRun's `code_to_function` to get our code from this notebook. Notice the comments `# mlrun: start-code` and `# mlrun: end-code`, these are marking what code to turn into a MLRun function.

We wish to run the training first as a Job, so we will set the `kind` parameter to `"job"`.

In [8]:
training_function = mlrun.code_to_function(
    name="job-trainer",
    handler="train",
    kind="job",
    image="guyliguazio/ml-models-gpu-066:tf243",
    with_doc=False
)

<a id="section_4"></a>
## 4. Run Training

### 4.1. Train Locally:

First, we will run the training locally setting `local` to `True`. 

In [9]:
training_run = training_function.run(
    name="job-trainer-local-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=True
)

> 2021-08-03 08:48:40,261 [info] starting run job-trainer-local-run uid=bc57dfb4c0e744caa0177e47c75fe10e DB=http://mlrun-api:8080
Epoch 1/3
Epoch 2/3
Epoch 3/3


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection,...c75fe10e,0,Aug 03 08:48:40,completed,job-trainer-local-run,v3io_user=adminkind=owner=adminhost=jupyter-guyl-5fbdd9b8c-xg82f,archive_url,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.025029420852661133training_accuracy=1.0004119873046875validation_loss=0.037291023466322154validation_accuracy=0.9891304439968533,loss_summary.htmlaccuracy_summary.htmllr.htmlmodel.h5model


to track results use .show() or .logs() or in CLI: 
!mlrun get run bc57dfb4c0e744caa0177e47c75fe10e --project mask-detection , !mlrun logs bc57dfb4c0e744caa0177e47c75fe10e --project mask-detection
> 2021-08-03 08:50:07,532 [info] run executed, status=completed


### 4.2. Train with Kubernetes Job:

Now, we will run the training as a job, so we set the `local` parameter we used before to `False`.

In [10]:
training_function.apply(mlrun.platforms.auto_mount())
training_run = training_function.run(
    name="job-trainer-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3
    },
    local=False
)

> 2021-08-03 08:50:07,539 [info] starting run job-trainer-run uid=4dc847d5348e42868743ed5acd95c9d9 DB=http://mlrun-api:8080
> 2021-08-03 08:50:07,691 [info] Job is running in the background, pod: job-trainer-run-bcj8k
2021-08-03 08:50:12.122084: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-08-03 08:50:13.118334: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-08-03 08:50:13.119482: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-08-03 08:50:13.153344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-03 08:50:13.153937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:1e.

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection,...cd95c9d9,0,Aug 03 08:50:13,completed,job-trainer-run,v3io_user=adminkind=jobowner=adminhost=job-trainer-run-bcj8k,archive_url,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32epochs=3lr=9.999999747378752e-05training_loss=0.07963061332702637training_accuracy=1.0003585815429688validation_loss=0.03511989116668701validation_accuracy=0.9963767793443468,loss_summary.htmlaccuracy_summary.htmllr.htmlmodel.h5model


to track results use .show() or .logs() or in CLI: 
!mlrun get run 4dc847d5348e42868743ed5acd95c9d9 --project mask-detection , !mlrun logs 4dc847d5348e42868743ed5acd95c9d9 --project mask-detection
> 2021-08-03 08:51:05,424 [info] run executed, status=completed


### 4.3. Train with Horovod:

Now we can see the second of MLRun, we can **distribute** our model **training** across **multiple workers** (i.e., perform distributed training), assign **GPUs**, and more. We don't need to bother with Dockerfiles or K8s YAML configuration files — MLRun does all of this for us.

All is needed to be done, is create our function with `kind="mpijob"`:

In [8]:
training_function = mlrun.code_to_function(
    name="mpijob-trainer",
    handler="train",
    kind="mpijob",
    image="guyliguazio/ml-models-gpu-066:tf243",
    with_doc=False
)

We can set additional configurations for our run like image, workers, gpus and more. We will setup 4 workers with 1 GPU per worker:

In [9]:
# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = True

# Setup the desired configurations:
training_function.spec.replicas = 4
if use_gpu:
    training_function.gpus(1)
else:
    training_function.with_requests(cpu=4)
training_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.mpijob.v1.MpiRuntimeV1 at 0x7f47e3d69910>

Call run, and notice each epoch is shorter as we now have 4 workers instead of 1.

In [10]:
# Run the training job:
training_run = training_function.run(
    name="trainer-mpijob-run",
    inputs={
        "archive_url": archive_url
    },
    params={
        "dataset_path": os.path.abspath('./Dataset'),
        "batch_size": 32,
        "lr": 1e-4,
        "epochs": 3,
    },
    watch=False,
)

# Print the progress in steps as the 4 workers will print a lot of tf outputs...
import time
from IPython.display import clear_output

while(training_run.state() not in ['completed', 'error']):
    time.sleep(3)
    clear_output(wait=True)
    training_run.show()

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
mask-detection,...a37a8e22,0,Aug 03 09:09:43,completed,trainer-mpijob-run,v3io_user=adminkind=mpijobowner=adminmlrun/job=trainer-mpijob-run-f932eb02host=trainer-mpijob-run-f932eb02-worker-0,archive_url,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32lr=0.0001epochs=3,dataset_path=/User/demos/mask-detection/Datasetbatch_size=32epochs=3lr=0.0002799999783746898training_loss=0.0443190336227417training_accuracy=1.0validation_loss=0.06190176804860433validation_accuracy=0.9782608879937066,loss_summary.htmlaccuracy_summary.htmllr.htmlmodel.h5model
