# **Amazon Lookout for Equipment** - Getting started
*Part 1 - Data preparation*

## Initialization
---
This repository is structured as follow:

```sh
. lookout-equipment-demo
|
├── data/
|   ├── interim                          # Temporary intermediate data are stored here
|   ├── processed                        # Finalized datasets are usually stored here
|   |                                    # before they are sent to S3 to allow the
|   |                                    # service to reach them
|   └── raw                              # Immutable original data are stored here
|
├── getting_started/
|   ├── 1_data_preparation.ipynb         <<< THIS NOTEBOOK <<<
|   ├── 2_dataset_creation.ipynb
|   ├── 3_model_training.ipynb
|   ├── 4_model_evaluation.ipynb
|   ├── 5_inference_scheduling.ipynb
|   └── 6_cleanup.ipynb
|
└── utils/
    └── lookout_equipment_utils.py
```

### Notebook configuration update
Amazon Lookout for Equipment being a very recent service, we need to make sure that we have access to the latest version of the AWS Python packages. If you see a `pip` dependency error, check that the `boto3` version is ok before moving forward.

In [None]:
!pip install --quiet --upgrade boto3 tqdm tsia

import boto3
print(f'boto3 version: {boto3.__version__} (should be >= 1.17.48 to include Lookout for Equipment API)')

# Restart the current notebook to ensure we take into account the previous updates:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### Imports

In [None]:
import boto3
import config
import os
import pandas as pd

from botocore.client import ClientError

### Parameters
Let's first check if the bucket name is defined, if it exists and if we have access to it from this notebook:

In [None]:
BUCKET = config.BUCKET

if BUCKET == '<<YOUR_BUCKET>>':
    raise Exception('Please update your Amazon S3 bucket name in the config.py file located at the root of this repository and restart the kernel for this notebook.')
    
else:
    # Check access to the configured bucket:
    try:
        s3_resource = boto3.resource('s3')
        s3_resource.meta.client.head_bucket(Bucket=BUCKET)
        print(f'Bucket "{BUCKET}" exists')
        
    # Expose error reason:
    except ClientError as error:
        error_code = int(error.response['Error']['Code'])
        if error_code == 403:
            raise Exception(f'Bucket "{BUCKET}" is private: access is forbidden!')
            
        elif error_code == 404:
            raise Exception(f'Bucket "{BUCKET}" does not exist!')

In [None]:
RAW_DATA       = os.path.join('..', 'data', 'raw')
TMP_DATA       = os.path.join('..', 'data', 'interim')
PROCESSED_DATA = os.path.join('..', 'data', 'processed')
LABEL_DATA     = os.path.join(PROCESSED_DATA, 'label-data')
TRAIN_DATA     = os.path.join(PROCESSED_DATA, 'training-data')

os.makedirs(TMP_DATA,         exist_ok=True)
os.makedirs(RAW_DATA,         exist_ok=True)
os.makedirs(PROCESSED_DATA,   exist_ok=True)
os.makedirs(LABEL_DATA,       exist_ok=True)
os.makedirs(TRAIN_DATA,       exist_ok=True)

ORIGINAL_DATA = 's3://lookout-equipment-getting-started/raw/lookout-equipment.zip'

## Downloading data
---
Downloading and unzipping the getting started dataset locally on this instance:

In [None]:
data_exists = os.path.exists(os.path.join(TMP_DATA, 'sensors-data', 'impeller', 'component2_file1.csv'))
raw_data_exists = os.path.exists(os.path.join(RAW_DATA, 'lookout-equipment.zip'))

if data_exists:
    print('Dataset already available locally, nothing to do.')
    print(f'Dataset is available in {TMP_DATA}.')
    
else:
    if not raw_data_exists:
        print('Raw data not found, downloading it')
        if ORIGINAL_DATA[:5] == 's3://':
            !aws s3 cp $ORIGINAL_DATA $RAW_DATA/lookout-equipment.zip
        elif ORIGINAL_DATA[:4] == 'http':
            !wget $ORIGINAL_DATA -O $RAW_DATA/lookout-equipment.zip
        
    print('Unzipping raw data...')
    !unzip $RAW_DATA/lookout-equipment.zip -d $TMP_DATA
    print(f'Done: dataset now available in {TMP_DATA}.')

## Preparing time series data
---
The time series data are available in the `sensors-data` directory. The industrial asset we are looking at is a [centrifugal pump](https://en.wikipedia.org/wiki/Centrifugal_pump). Such a pump is used to move a fluid by transfering the rotational energy provided by a motor to hydrodynamic energy:

<img src="assets/centrifugal_pump_annotated.png" alt="Centrifugal pump" style="width: 658px"/>

<div style="text-align: center"><i>Warman centrifugal pump in a coal preparation plant application</i>, by Bernard S. Janse, licensed under <a href="https://creativecommons.org/licenses/by/2.5/deed.fr">CC BY 2.5</a></div>

On a pump such as the one displayed in the photo above, the fluid enters at its axis (the black pipe arriving at the "eye" of the impeller. Measurements can be taken around the four main components of the centrifugal pump:
* The **impeller** (hidden into the round white casing above): this component consists of a series of curved vanes (blades)
* The drive **shaft** arriving at the impeller axis (the "eye")
* The **motor** connected to the impeller by the drive shaft (on the other end of the black pipe above)
* The **volute** chamber, offseted on the right compared to the impeller axis: this creates a curved funnel win a decreasing cross-section area towards the pump outlet (at the top of the white pipe above)

In the dataset provided, other sensors not located on one of these component are positionned at the **pump** level.

**Let's load the content of each CSV file (we have one per component) and build a single CSV file with all the sensors:** we will obtain a dataset with 10 months of data (spanning from `2019-01-01` to `2019-10-27`) for 30 sensors (`Sensor0` to `Sensor29`) with a 1-minute sampling rate:

In [None]:
%%time

# Loops through each subfolder of the original dataset:
sensor_df_list = []
tags_description_dict = dict()
for root, dirs, files in os.walk(os.path.join(TMP_DATA, 'sensors-data')):
    # Reads each file and set the first column as an index:
    for f in files:
        print('Processing:', os.path.join(root, f))
        df = pd.read_csv(os.path.join(root, f))
        df = df.set_index('Timestamp')
        sensor_df_list.append(df)
        
        component = root.split('/')[-1]
        current_sensors = df.columns.tolist()
        current_sensors = dict(zip(current_sensors, [component] * len(current_sensors)))
        tags_description_dict = {**tags_description_dict, **current_sensors}
        
# Concatenate into a single dataframe:
equipment_df = pd.concat(sensor_df_list, axis='columns')
equipment_df = equipment_df.reset_index()
equipment_df['Timestamp'] = pd.to_datetime(equipment_df['Timestamp'])
equipment_df = equipment_df[[
    'Timestamp', 'Sensor0', 'Sensor1', 'Sensor2', 'Sensor3', 'Sensor4',
    'Sensor5', 'Sensor6', 'Sensor7', 'Sensor8', 'Sensor9', 'Sensor10',
    'Sensor11', 'Sensor24', 'Sensor25', 'Sensor26', 'Sensor27', 'Sensor28',
    'Sensor29', 'Sensor12', 'Sensor13', 'Sensor14', 'Sensor15', 'Sensor16',
    'Sensor17', 'Sensor18', 'Sensor19', 'Sensor20', 'Sensor21', 'Sensor22',
    'Sensor23'
]]

# Register a component for each sensor:
tags_description_df = pd.DataFrame.from_dict(tags_description_dict, orient='index')
tags_description_df = tags_description_df.reset_index()
tags_description_df.columns = ['Tag', 'Component']

print(equipment_df.shape)
equipment_df.head()

In [None]:
%%time

os.makedirs(os.path.join(TRAIN_DATA, 'centrifugal-pump'), exist_ok=True)
equipment_fname = os.path.join(TRAIN_DATA, 'centrifugal-pump', 'sensors.csv')
equipment_df.to_csv(equipment_fname, index=None)

Let's also persist the tags description file as it will be useful when analyzing the model results:

In [None]:
tags_description_fname = os.path.join(TMP_DATA, 'tags_description.csv')
tags_description_df.to_csv(tags_description_fname, index=None)

## Loading label data
---
This dataset contains synthetically generated anomalies over different periods of time. Labels are stored as time ranges with a start and end timestamp. Each label is a period of time where we know some anomalous behavior happen:

In [None]:
label_fname = os.path.join(TMP_DATA, 'label-data', 'labels.csv')
labels_df = pd.read_csv(label_fname, header=None)
labels_df.to_csv(os.path.join(PROCESSED_DATA, 'label-data', 'labels.csv'), index=None, header=None)
labels_df.columns = ['start', 'end']
labels_df.head()

## Uploading data to Amazon S3
---
Let's now load our training data and labels to Amazon S3, so that Lookout for Equipment can access them to train and evaluate a model.

In [None]:
train_s3_path = f's3://{BUCKET}/training-data/centrifugal-pump/sensors.csv'
!aws s3 cp $equipment_fname $train_s3_path

label_s3_path = f's3://{BUCKET}/label-data/labels.csv'
!aws s3 cp $label_fname $label_s3_path

## (Optional) Data exploration
---
This section is optional and just aim at giving you a quick overview about what the data looks like:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import sys
import tsia
import warnings

sys.path.append('../utils')
import lookout_equipment_utils as lookout

%matplotlib inline
plt.style.use('Solarize_Light2')
plt.rcParams['lines.linewidth'] = 0.5
warnings.filterwarnings("ignore")

In [None]:
equipment_df['Timestamp'] = pd.to_datetime(equipment_df['Timestamp'])
equipment_df = equipment_df.set_index('Timestamp')
print(equipment_df.shape)
equipment_df.head()

In [None]:
start = equipment_df.index.min()
end = equipment_df.index.max()

print(start, '|', end)

**Let's plot the first signal and the associated labels:** the `plot_timeseries` function is a utility function you can use to plot a signal and the associated labels on the same figure:

In [None]:
tag = 'Sensor0'
tag_df = equipment_df.loc[start:end, [tag]]
tag_df.columns = ['Value']

fig1, axes = lookout.plot_timeseries(
    tag_df, 
    tag, 
    fig_width=20, 
    labels_df=labels_df,
    custom_grid=False
)

**Run the following cell to get an overview of every signals in the dataset:** colors are allocated to each sensor according to the component it's associated to. This generates a big matplotlib picture in memory. On smaller instances, this can lead to some *out of memory* issues. Upgrade to a bigger instance, or clean up the memory of the instances if you have other notebooks running in parallel to this one:

In [None]:
df_list = []
features = equipment_df.columns.tolist()
for sensor in features:
    df_list.append(equipment_df[[sensor]])
    
fig2 = tsia.plot.plot_multivariate_timeseries(
    timeseries_list=df_list,
    tags_list=features,
    tags_description_df=tags_description_df,
    tags_grouping_key='Component',
    num_cols=3,
)

## Conclusion
---
In this notebook, you downloaded the getting started dataset and prepared it for ingestion in Amazon Lookout for Equipment.

You also had a quick overview of the dataset with basic timeseries visualization.

You uploaded the training time series data and the anomaly labels to Amazon S3: in the next notebook of this getting started, you will be acquainted with the Amazon Lookout for Equipment API to create your first dataset.

In [None]:
# Cleanup, might be necessary on smaller instances:
import gc
del fig1, fig2, equipment_df
gc.collect()