# **Amazon Lookout for Equipment** - Getting started
*Part 2 - Dataset creation*

## Initialization
---
This repository is structured as follow:

```sh
. lookout-equipment-demo
|
├── data/
|   ├── interim                          # Temporary intermediate data are stored here
|   ├── processed                        # Finalized datasets are usually stored here
|   |                                    # before they are sent to S3 to allow the
|   |                                    # service to reach them
|   └── raw                              # Immutable original data are stored here
|
├── getting_started/
|   ├── 1_data_preparation.ipynb
|   ├── 2_dataset_creation.ipynb         <<< THIS NOTEBOOK <<<
|   ├── 3_model_training.ipynb
|   ├── 4_model_evaluation.ipynb
|   ├── 5_inference_scheduling.ipynb
|   └── 6_cleanup.ipynb
|
└── utils/
    └── lookout_equipment_utils.py
```

### Notebook configuration update

In [None]:
!pip install --quiet --upgrade sagemaker tqdm lookoutequipment

### Imports

In [None]:
import boto3
import config
import os
import pandas as pd
import pprint
import sagemaker
import sys
import time

from datetime import datetime

# SDK / toolbox for managing Lookout for Equipment API calls:
import lookoutequipment as lookout

In [None]:
PROCESSED_DATA = os.path.join('..', 'data', 'processed', 'getting-started')
TRAIN_DATA     = os.path.join(PROCESSED_DATA, 'training-data')

ROLE_ARN       = sagemaker.get_execution_role()
REGION_NAME    = boto3.session.Session().region_name
DATASET_NAME   = config.DATASET_NAME
BUCKET         = config.BUCKET
PREFIX         = config.PREFIX_TRAINING

## Create a dataset
---

### Create data schema

In [None]:
lookout_dataset = lookout.LookoutEquipmentDataset(
    dataset_name=DATASET_NAME,
    component_root_dir=TRAIN_DATA,
    access_role_arn=ROLE_ARN
)

If you wanted to use the console, the following string would be the one to use to configure the **dataset schema**:

![Dataset creation with schema](assets/dataset-schema.png)

In [None]:
pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

The following method encapsulate the [**CreateDataset**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_CreateDataset.html) API:

```python
lookout_client.create_dataset(
    DatasetName=self.dataset_name,
    DatasetSchema={
        'InlineDataSchema': "schema"
    }
)
```

In [None]:
lookout_dataset.create()

The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:

![Dataset created](assets/dataset-created.png)

## Ingest data into a dataset
---
Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:

In [None]:
ROLE_ARN, BUCKET, PREFIX, DATASET_NAME

Launch the ingestion job in the Lookout for Equipment dataset: the following method encapsulates the [**StartDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_StartDataIngestionJob.html) API:

```python
lookout_client.start_data_ingestion_job(
    DatasetName=DATASET_NAME,
    RoleArn=ROLE_ARN, 
    IngestionInputConfiguration={ 
        'S3InputConfiguration': { 
            'Bucket': BUCKET,
            'Prefix': PREFIX
        }
    }
)
```

In [None]:
response = lookout_dataset.ingest_data(BUCKET, PREFIX)

The ingestion is launched. With this amount of data (around 50 MB), it should take between less than 5 minutes:

![dataset_schema](assets/dataset-ingestion-in-progress.png)

We use the following cell to monitor the ingestion process by calling the following method, which encapsulates the [**DescribeDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_DescribeDataIngestionJob.html) API and runs it every 60 seconds:

In [None]:
lookout_dataset.poll_data_ingestion(sleep_time=60)

In case any issue arise, you can inspect the API response available as a JSON document:

In [None]:
lookout_dataset.ingestion_job_response

The ingestion should now be complete as can be seen in the console:

![Ingestion done](assets/dataset-ingestion-done.png)

## Conclusion
---

In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**