# **Amazon Lookout for Equipment** - Demonstration on an anonymized expander dataset
*Part 3: Model training*

In [None]:
BUCKET = '<YOUR_BUCKET_NAME_HERE>'
PREFIX = 'data'

## Initialization
---
Following the data preparation notebook, this repository should now be structured as follow:
```
/lookout-equipment-demo
|
+-- data/
|   |
|   +-- labelled-data/
|   |   \-- labels.csv
|   |
|   \-- training-data/
|       \-- expander/
|           |-- subsystem-01
|           |   \-- subsystem-01.csv
|           |
|           |-- subsystem-02
|           |   \-- subsystem-02.csv
|           |
|           |-- ...
|           |
|           \-- subsystem-24
|               \-- subsystem-24.csv
|
+-- dataset/
|   |-- labels.csv
|   |-- tags_description.csv
|   |-- tags_list.txt
|   |-- timeranges.txt
|   \-- timeseries.zip
|
+-- notebooks/
|   |-- 1_data_preparation.ipynb
|   |-- 2_dataset_creation.ipynb
|   |-- 3_model_training.ipynb              <<< This notebook <<<
|   |-- 4_model_evaluation.ipynb
|   \-- 5_inference_scheduling.ipynb
|
+-- utils/
    |-- lookout_equipment_utils.py
    \-- lookoutequipment.json
```

### Imports

In [None]:
%%sh
pip -q install --upgrade pip
pip -q install --upgrade awscli boto3 sagemaker
aws configure add-model --service-model file://../utils/lookoutequipment.json --service-name lookoutequipment

In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import boto3
import os
import pandas as pd
import pytz
import sagemaker
import sys
import time
import uuid
import warnings

from datetime import datetime

# Helper functions for managing Lookout for Equipment API calls:
sys.path.append('../utils')
import lookout_equipment_utils as lookout

### Parameters

In [None]:
warnings.filterwarnings('ignore')

DATA       = os.path.join('..', 'data')
LABEL_DATA = os.path.join(DATA, 'labelled-data')
TRAIN_DATA = os.path.join(DATA, 'training-data', 'expander')

ROLE_ARN = sagemaker.get_execution_role()
REGION_NAME = boto3.session.Session().region_name

Based on our previous analysis, we will use the following time ranges:

* **Train set:** 1st January 2015 - 31st August 2015: Lookout for Equipment needs at least 180 days of training data. March is one of the anomaly period tagged in the label, so this should not change the modeling behaviour.
* **Test set:** 1st September 2015 - 30th November 2015 *(this test set should include both normal and abnormal data to evaluate our model on)*

In [None]:
# Loading time ranges:
timeranges_fname = os.path.join(DATA, 'timeranges.txt')
with open(timeranges_fname, 'r') as f:
    timeranges = f.readlines()
    
training_start   = pd.to_datetime(timeranges[0][:-1])
training_end     = pd.to_datetime(timeranges[1][:-1])
evaluation_start = pd.to_datetime(timeranges[2][:-1])
evaluation_end   = pd.to_datetime(timeranges[3][:-1])

print(f'Training period: from {training_start} to {training_end}')
print(f'Evaluation period: from {evaluation_start} to {evaluation_end}')

dataset_fname = os.path.join(DATA, 'dataset_name.txt')
with open(dataset_fname, 'r') as f:
    DATASET_NAME = f.readline()
    
print('Dataset used:', DATASET_NAME)

In [None]:
lookout_client = lookout.get_client(region_name=REGION_NAME)

## Model training
---

In [None]:
def train_lookout_equipment_model(
    sampling_rate, 
    model_name, 
    training_start, 
    training_end, 
    evaluation_start, 
    evaluation_end, 
    unsupervised=False, 
    schema=None
):
    TARGET_SAMPLING_RATE = sampling_rate

    TRAINING_DATA_START_TIME   = training_start.to_pydatetime()
    TRAINING_DATA_END_TIME     = training_end.to_pydatetime()
    EVALUATION_DATA_START_TIME = evaluation_start.to_pydatetime()
    EVALUATION_DATA_END_TIME   = evaluation_end.to_pydatetime()

    LABEL_DATA_SOURCE_BUCKET   = BUCKET
    LABEL_DATA_SOURCE_PREFIX   = f'{PREFIX}/labelled-data/'
    labels_input_config = dict()
    labels_input_config['S3InputConfiguration'] = dict([
        ('Bucket', LABEL_DATA_SOURCE_BUCKET),
        ('Prefix', LABEL_DATA_SOURCE_PREFIX)
    ])

    MODEL_NAME = model_name
    
    client_token = uuid.uuid4().hex
    create_model_request = {
        'ModelName': MODEL_NAME,
        'DatasetName': DATASET_NAME,
        'ClientToken': client_token,
        'DataPreProcessingConfiguration': {
            'TargetSamplingRate': TARGET_SAMPLING_RATE
        },
        'TrainingDataStartTime': TRAINING_DATA_START_TIME,
        'TrainingDataEndTime': TRAINING_DATA_END_TIME,
        'EvaluationDataStartTime': EVALUATION_DATA_START_TIME,
        'EvaluationDataEndTime': EVALUATION_DATA_END_TIME
    }
    
    if unsupervised == False:
        create_model_request.update({
            'RoleArn': ROLE_ARN,
            'LabelsInputConfiguration': labels_input_config
        })
        
    if schema is not None:
        DATA_SCHEMA_FOR_MODEL = lookout.create_data_schema(schema)
        data_schema_for_model = {
            'InlineDataSchema': DATA_SCHEMA_FOR_MODEL,
        }
        create_model_request['DatasetSchema'] = data_schema_for_model

    lookout_client = lookout.get_client(region_name=REGION_NAME)
    return lookout_client.create_model(**create_model_request)

In [None]:
MODEL_NAME = 'lookout-demo-model-v1'
model_response = train_lookout_equipment_model(
    sampling_rate='PT5M',
    model_name=MODEL_NAME,
    training_start=training_start, 
    training_end=training_end, 
    evaluation_start=evaluation_start, 
    evaluation_end=evaluation_end,
    unsupervised=False,
)

A training is now in progress as captured by the console:
    
![Training in progress](../assets/model-training-in-progress.png)

Use the following cell to capture the model training progress. **This model should take an hour to be trained.** Key drivers for training time are:
* Number of labels in the label dataset (if provided)
* Number of datapoints. This number depends on the sampling rate, the number of time series and the time range.

In [None]:
describe_model_response = lookout_client.describe_model(ModelName=MODEL_NAME)

status = describe_model_response['Status']
while status == 'IN_PROGRESS':
    time.sleep(60)
    describe_model_response = lookout_client.describe_model(ModelName=MODEL_NAME)
    status = describe_model_response['Status']
    print(str(pd.to_datetime(datetime.now(pytz.timezone("Europe/Paris"))))[:19], "| Model training:", status)

A model is now training and we can visualize the results of the back testing on the evaluation window selected at the beginning on this notebook:

![Training complete](../assets/model-training-complete.png)

## Conclusion
---
In this notebook, we use the dataset created in part 2 of this notebook series and trained a Lookout for Equipment model.

From here you can either head:
* To the next notebook where we will **extract the evaluation data** for this model and use it to perform further analysis on the model results.
* Or to the **inference scheduling notebook** where we will start the model, feed it some new data and catch the results.