# **Amazon Lookout for Equipment** - Getting started
*Part 2 - Dataset creation*

## Initialization
---
This repository is structured as follow:

```sh
. lookout-equipment-demo
|
├── data/
|   ├── interim                          # Temporary intermediate data are stored here
|   ├── processed                        # Finalized datasets are usually stored here
|   |                                    # before they are sent to S3 to allow the
|   |                                    # service to reach them
|   └── raw                              # Immutable original data are stored here
|
├── getting_started/
|   ├── 1_data_preparation.ipynb
|   ├── 2_dataset_creation.ipynb         <<< THIS NOTEBOOK <<<
|   ├── 3_model_training.ipynb
|   ├── 4_model_evaluation.ipynb
|   ├── 5_inference_scheduling.ipynb
|   └── 6_cleanup.ipynb
|
└── utils/
    └── lookout_equipment_utils.py
```

### Notebook configuration update
Amazon Lookout for Equipment being a very recent service, we need to make sure that we have access to the latest version of the AWS Python packages. If you see a `pip` dependency error, check that the `boto3` version is ok: if it's greater than 1.17.48 (the first version that includes the `lookoutequipment` API), you can discard this error and move forward with the next cell:

In [None]:
#!pip install --quiet --upgrade boto3 awscli aiobotocore botocore sagemaker tqdm

import boto3
print(f'boto3 version: {boto3.__version__} (should be >= 1.17.48 to include Lookout for Equipment API)')

# Restart the current notebook to ensure we take into account the previous updates:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

boto3 version: 1.17.99 (should be >= 1.17.48 to include Lookout for Equipment API)


### Imports

In [1]:
import boto3
import config
import os
import pandas as pd
import sagemaker
import sys
import time

from datetime import datetime

# Helper functions for managing Lookout for Equipment API calls:
sys.path.append('../utils')
import lookout_equipment_utils as lookout

In [2]:
PROCESSED_DATA = os.path.join('..', 'data', 'processed', '97978334-97be-4e45-bf5d-84d9e535a601')
TRAIN_DATA     = os.path.join(PROCESSED_DATA, 'training-data')

ROLE_ARN       = sagemaker.get_execution_role()
REGION_NAME    = boto3.session.Session().region_name
DATASET_NAME   = config.DATASET_NAME
BUCKET         = config.BUCKET
PREFIX         = config.PREFIX_TRAINING

In [3]:
# List of the directories from the training data 
# directory: each directory corresponds to a subsystem:
components = []
for root, dirs, files in os.walk(f'{TRAIN_DATA}'):
    for subsystem in dirs:
        if subsystem != '.ipynb_checkpoints':
            components.append(subsystem)
        
components

['centrifugal-pump']

## Create a dataset
---

### Create data schema

First we need to setup the schema of your dataset. In the cell below, we define `DATASET_COMPONENT_FIELDS_MAP`. `DATASET_COMPONENT_FIELDS_MAP` is a Python dictonary (hashmap). The key of each entry in the dictionary is the `Component` name, and the value of each entry is a list of column names. The column names must exactly match the header in your CSV files. The order of the column names also need to exactly match:

```json
DATASET_COMPONENT_FIELDS_MAP = {
    "Component1": ['Timestamp', 'Tag1', 'Tag2',...],
    "Component2": ['Timestamp', 'Tag1', 'Tag2',...]
    ...
    "ComponentN": ['Timestamp', 'Tag1', 'Tag2',...]
}
```

We also need to make sure the component name **matches exactly** the name of the folder in S3 (everything is **case sensitive**). As an example, when creating the data schema for the example we are using here, we will build a the dictionary that will look like this:
```json
DATASET_COMPONENT_FIELDS_MAP = {
    "centrifugal-pump": ['Timestamp', 'Sensor0', 'Sensor1',... , 'Sensor29']
}
```
The following cell builds this map, then convert it into a JSON schema that follows the following format, which is ready to be processed by Lookout for Equipment:

```json
{
  "Components": [
    {
      "ComponentName": "centrifugal-pump",
      "Columns": [
        {"Name": "Timestamp", "Type": "DATETIME"},
        {"Name": "Sensor0", "Type": "DOUBLE"},
        {"Name": "Sensor1", "Type": "DOUBLE"},
        {"Name": "Sensor2", "Type": "DOUBLE"},
        {"Name": "Sensor3", "Type": "DOUBLE"},
          
        ...
          
        {"Name": "Sensor29", "Type": "DOUBLE"}
      ]
    }
  ]
}
```

In [4]:
DATASET_COMPONENT_FIELDS_MAP = dict()
for subsystem in components:
    subsystem_tags = ['Timestamp']
    for root, _, files in os.walk(f'{TRAIN_DATA}/{subsystem}'):
        for file in files:
            fname = os.path.join(root, file)
            current_subsystem_df = pd.read_csv(fname, nrows=1)
            subsystem_tags = subsystem_tags + current_subsystem_df.columns.tolist()[1:]

        DATASET_COMPONENT_FIELDS_MAP.update({subsystem: subsystem_tags})
        
        
lookout_dataset = lookout.LookoutEquipmentDataset(
    dataset_name=DATASET_NAME,
    component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
    region_name=REGION_NAME,
    access_role_arn=ROLE_ARN
)

If you wanted to use the console, the following string would be the one to use to configure the **dataset schema**:

![Dataset creation with schema](assets/dataset-schema.png)

In [5]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

{'Components': [{'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
                             {'Name': 'Sensor0', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor1', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor2', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor3', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor4', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor5', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor6', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor7', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor8', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor9', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor10', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor11', 'Type': 'DOUBLE'},
                             {'Name': 'Sensor24', 'Type': 'DOUBLE'},
                             {'Name': 'Se

### Create the dataset
The following method encapsulate the [**CreateDataset**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_CreateDataset.html) API:

```python
lookout_client.create_dataset(
    DatasetName=self.dataset_name,
    DatasetSchema={
        'InlineDataSchema': "schema"
    }
)
```

In [6]:
lookout_dataset.create()

Dataset "97978334-97be-4e45-bf5d-84d9e535a601" does not exist, creating it...



{'DatasetName': '97978334-97be-4e45-bf5d-84d9e535a601',
 'DatasetArn': 'arn:aws:lookoutequipment:us-east-1:593512547852:dataset/97978334-97be-4e45-bf5d-84d9e535a601/52d6f9f9-cebe-4a15-bcca-3bea29e618a8',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': 'c75ef9a7-96a9-47a1-974c-ab9d6e547b86',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c75ef9a7-96a9-47a1-974c-ab9d6e547b86',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '218',
   'date': 'Fri, 23 Jul 2021 01:51:06 GMT'},
  'RetryAttempts': 0}}

The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:

![Dataset created](assets/dataset-created.png)

## Ingest data into a dataset
---
Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:

In [7]:
ROLE_ARN, BUCKET, PREFIX, DATASET_NAME

('arn:aws:iam::593512547852:role/L4ESagemaker-SageMakerIamRole-RLE9HD9GRB46',
 'l4e-sitewise-d7bf5fa0',
 '97978334-97be-4e45-bf5d-84d9e535a601/training-data/',
 '97978334-97be-4e45-bf5d-84d9e535a601')

Launch the ingestion job in the Lookout for Equipment dataset: the following method encapsulates the [**StartDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_StartDataIngestionJob.html) API:

```python
lookout_client.start_data_ingestion_job(
    DatasetName=DATASET_NAME,
    RoleArn=ROLE_ARN, 
    IngestionInputConfiguration={ 
        'S3InputConfiguration': { 
            'Bucket': BUCKET,
            'Prefix': PREFIX
        }
    }
)
```

In [8]:
response = lookout_dataset.ingest_data(BUCKET, PREFIX)

The ingestion is launched. With this amount of data (around 50 MB), it should take between less than 5 minutes:

![dataset_schema](assets/dataset-ingestion-in-progress.png)

We use the following cell to monitor the ingestion process by calling the [**DescribeDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_DescribeDataIngestionJob.html) API every 60 seconds:

In [9]:
# Get the ingestion job ID and status:
data_ingestion_job_id = response['JobId']
data_ingestion_status = response['Status']

# Wait until ingestion completes:
print("=====Polling Data Ingestion Status=====\n")
lookout_client = lookout.get_client(region_name=REGION_NAME)
print(str(pd.to_datetime(datetime.now()))[:19], "|", data_ingestion_status)

while data_ingestion_status == 'IN_PROGRESS':
    time.sleep(60)
    describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
    data_ingestion_status = describe_data_ingestion_job_response['Status']
    print(str(pd.to_datetime(datetime.now()))[:19], "|", data_ingestion_status)
    
print("\n=====End of Polling Data Ingestion Status=====")

=====Polling Data Ingestion Status=====

2021-07-23 01:51:51 | IN_PROGRESS
2021-07-23 01:52:51 | IN_PROGRESS
2021-07-23 01:53:51 | IN_PROGRESS
2021-07-23 01:54:51 | IN_PROGRESS
2021-07-23 01:55:51 | SUCCESS

=====End of Polling Data Ingestion Status=====


In case any issue arise, you can inspect the API response available as a JSON document:

In [10]:
lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)

{'JobId': '92288628733de3630a014d30a7bc9cc6',
 'DatasetArn': 'arn:aws:lookoutequipment:us-east-1:593512547852:dataset/97978334-97be-4e45-bf5d-84d9e535a601/52d6f9f9-cebe-4a15-bcca-3bea29e618a8',
 'IngestionInputConfiguration': {'S3InputConfiguration': {'Bucket': 'l4e-sitewise-d7bf5fa0',
   'Prefix': '97978334-97be-4e45-bf5d-84d9e535a601/training-data/'}},
 'RoleArn': 'arn:aws:iam::593512547852:role/L4ESagemaker-SageMakerIamRole-RLE9HD9GRB46',
 'CreatedAt': datetime.datetime(2021, 7, 23, 1, 51, 40, 252000, tzinfo=tzlocal()),
 'Status': 'SUCCESS',
 'ResponseMetadata': {'RequestId': 'da597933-299d-482b-b73c-df63690ac7ea',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'da597933-299d-482b-b73c-df63690ac7ea',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '476',
   'date': 'Fri, 23 Jul 2021 01:57:22 GMT',
   'connection': 'close'},
  'RetryAttempts': 0}}

The ingestion should now be complete as can be seen in the console:

![Ingestion done](assets/dataset-ingestion-done.png)

## Conclusion
---

In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**