# Automate detection of abnormal equipment behavior and review predictions with human in the loop using Amazon Lookout for Equipment and Amazon A2I

In this notebook we will show you how you can setup Amazon Lookout for Equipment to train an abnormal behavior detection model using a wind turbine dataset for predictive maintenance and setup up a human in the loop workflow to review the predictions using Amazon A2I, augment the dataset and retrain the model.

To get started with Amazon Lookout for Equipment, we will create a dataset, ingest data, train a model and run inference by setting up a scheduler. After going through these steps we will show you how you can quickly setup human review process using Amazon A2I and retrain your model with augmented or human reviewed datasets. we will walk you through the following steps:
1.	Creating a dataset in Amazon Lookout for Equipment
2.	Ingesting data into the Amazon Lookout for Equipment dataset
3.	Training a model in Amazon Lookout for Equipment
4.	Running diagnostics on the trained model
5.	Creating an inference scheduler in Amazon Lookout for Equipment to send a simulated stream of real-time requests.
6.	Setting up an Amazon A2I private human loop  and reviewing the predictions from Amazon Lookout for Equipment.
7.	Retraining your Amazon Lookout for Equipment model based on augmented datasets from Amazon A2I.

**Note:** 
1. Before you get started, make sure you have downloaded the open source wind turbine dataset from Engie and saved it in a designated S3 path. If you haven't done this, please go through `1_data_preparation.ipynb` notebook.

2. The open source wind turbine dataset doesn't come with known date ranges when the turbine behaved abnormaly and this is also a known and common issue for many of our customers. Please, also go through `2_discover_anomaly_labels.ipynb` notebook to generate labels.

## Setup environment

In [1]:
%%sh
pip -q install --upgrade pip
pip -q install --upgrade awscli boto3 sagemaker smart_open
pip -q install tqdm
aws configure add-model --service-model file://../../getting_started/utils/lookoutequipment.json --service-name lookoutequipment

In [2]:
# Restart this notebook after installing the L4E model in the cell above
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
#Uncomment the lines below if you want to view all columns in a dataframe for example, but will be resource intensive
#import pandas as pd
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
#pd.set_option('display.width', None)
#pd.set_option('display.max_colwidth', -1)

In [39]:
REGION_NAME = 'us-east-1'
BUCKET = 'l4e-demo'
PREFIX = 'wind-turbine'

ROLE_ARN = sagemaker.get_execution_role()

TURBINE_ID = 'R80711'
TRAIN_DATA = f's3://{BUCKET}/{PREFIX}/training_data/{TURBINE_ID}'
LABEL_DATA = f's3://{BUCKET}/{PREFIX}/labelled_data/{TURBINE_ID}'

DATASET_NAME = 'wind-turbine-dataset'
MODEL_NAME = 'wind-turbine-model'

In [4]:
import boto3
import datetime
import os
import pandas as pd
import pprint
import pyarrow as pa
import pyarrow.parquet as pq
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import s3fs
import sys
import time
import uuid
import warnings

# Helper functions for managing Lookout for Equipment API calls:
sys.path.append('../../getting_started/utils')
import lookout_equipment_utils as lookout

## View Datasets

In [44]:
df = pd.read_csv(f'{TRAIN_DATA}/telemetry.csv', index_col = 'Timestamp')
df.head()

Unnamed: 0_level_0,Q_avg,Q_min,Q_max,Q_std,Ws1_avg,Ws1_min,Ws1_max,Ws1_std,Ws2_avg,Ws2_min,...,Gb1t_max,Gb1t_std,Db1t_avg,Db1t_min,Db1t_max,Db1t_std,Rbt_avg,Rbt_min,Rbt_max,Rbt_std
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-12-31T23:00:00.000000,14.49,-0.44,41.18,8.19,8.770001,6.27,11.37,0.82,9.16,6.68,...,66.699997,0.73,39.02,37.0,41.0,1.09,28.709999,28.6,28.799999,0.03
2012-12-31T23:10:00.000000,23.700001,1.75,43.02,8.3,8.66,6.01,11.37,1.02,9.12,5.46,...,70.099998,0.92,35.919998,35.099998,37.299999,0.6,28.700001,28.6,28.75,0.01
2012-12-31T23:20:00.000000,25.48,3.2,46.619999,9.479999,8.94,6.08,11.29,0.99,9.45,5.89,...,72.300003,0.7,36.849998,35.400002,38.400002,0.82,28.790001,28.700001,28.799999,0.03
2012-12-31T23:30:00.000000,24.379999,2.2,57.880001,11.1,8.87,5.96,12.15,1.14,8.979999,5.64,...,73.449997,0.62,39.75,38.200001,41.099998,0.81,28.860001,28.799999,29.0,0.07
2012-12-31T23:40:00.000000,14.47,-10.88,35.189999,10.02,9.44,6.06,12.31,1.12,9.51,6.1,...,71.300003,1.4,40.950001,39.599998,41.700001,0.54,28.77,28.700001,28.9,0.05


In [45]:
df.shape

(264673, 112)

In [46]:
labels = pd.read_csv(f'{LABEL_DATA}/labels.csv', header=None)
labels.head()

Unnamed: 0,0,1
0,2013-01-02T02:30:00.000000,2013-01-02T15:30:00.000000
1,2013-01-05T13:50:00.000000,2013-01-10T04:30:00.000000
2,2013-01-10T19:30:00.000000,2013-01-11T12:10:00.000000
3,2013-01-12T13:30:00.000000,2013-01-12T14:00:00.000000
4,2013-01-13T14:50:00.000000,2013-01-14T18:50:00.000000


In [47]:
labels.shape

(726, 2)

### Create the Dataset Component Map

In [48]:
DATASET_COMPONENT_FIELDS_MAP = dict()
DATASET_COMPONENT_FIELDS_MAP[TURBINE_ID] = df.reset_index().columns.to_list()

### Create L4E Dataset

In [49]:
lookout_dataset = lookout.LookoutEquipmentDataset(
    dataset_name=DATASET_NAME,
    component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
    region_name=REGION_NAME,
    access_role_arn=ROLE_ARN
)

pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

{'Components': [{'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
                             {'Name': 'Q_avg', 'Type': 'DOUBLE'},
                             {'Name': 'Q_min', 'Type': 'DOUBLE'},
                             {'Name': 'Q_max', 'Type': 'DOUBLE'},
                             {'Name': 'Q_std', 'Type': 'DOUBLE'},
                             {'Name': 'Ws1_avg', 'Type': 'DOUBLE'},
                             {'Name': 'Ws1_min', 'Type': 'DOUBLE'},
                             {'Name': 'Ws1_max', 'Type': 'DOUBLE'},
                             {'Name': 'Ws1_std', 'Type': 'DOUBLE'},
                             {'Name': 'Ws2_avg', 'Type': 'DOUBLE'},
                             {'Name': 'Ws2_min', 'Type': 'DOUBLE'},
                             {'Name': 'Ws2_max', 'Type': 'DOUBLE'},
                             {'Name': 'Ws2_std', 'Type': 'DOUBLE'},
                             {'Name': 'Ws_avg', 'Type': 'DOUBLE'},
                             {'Name': 'Ws_min', 'Type

In [50]:
lookout_dataset.create()

Dataset "wind-turbine-dataset" does not exist, creating it...



{'DatasetName': 'wind-turbine-dataset',
 'DatasetArn': 'arn:aws:lookoutequipment:us-east-1:631071447677:dataset/wind-turbine-dataset/79f71c7d-db76-4606-9396-4d5d04be0111',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': 'c332ceb4-3222-4318-96ae-7c2a4189b904',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c332ceb4-3222-4318-96ae-7c2a4189b904',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '186',
   'date': 'Thu, 08 Apr 2021 04:51:47 GMT'},
  'RetryAttempts': 0}}

### Ingest data into L4E dataset

In [51]:
response = lookout_dataset.ingest_data(BUCKET, f'{PREFIX}/training_data/')

In [52]:
# Get the ingestion job ID and status:
data_ingestion_job_id = response['JobId']
data_ingestion_status = response['Status']

# Wait until ingestion completes:
print("=====Polling Data Ingestion Status=====\n")
lookout_client = lookout.get_client(region_name=REGION_NAME)
print(str(pd.to_datetime(datetime.datetime.now()))[:19], "| ", data_ingestion_status)

while data_ingestion_status == 'IN_PROGRESS':
    time.sleep(60)
    describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
    data_ingestion_status = describe_data_ingestion_job_response['Status']
    print(str(pd.to_datetime(datetime.datetime.now()))[:19], "| ", data_ingestion_status)
    
print("\n=====End of Polling Data Ingestion Status=====")

=====Polling Data Ingestion Status=====

2021-04-08 04:51:56 |  IN_PROGRESS
2021-04-08 04:52:57 |  IN_PROGRESS
2021-04-08 04:53:57 |  IN_PROGRESS
2021-04-08 04:54:57 |  IN_PROGRESS
2021-04-08 04:55:57 |  SUCCESS

=====End of Polling Data Ingestion Status=====


In [53]:
describe_data_ingestion_job_response

{'JobId': '34ee44cfad33fed1221caebdc83a07bd',
 'DatasetArn': 'arn:aws:lookoutequipment:us-east-1:631071447677:dataset/wind-turbine-dataset/79f71c7d-db76-4606-9396-4d5d04be0111',
 'IngestionInputConfiguration': {'S3InputConfiguration': {'Bucket': 'l4e-demo',
   'Prefix': 'wind-turbine/training_data/'}},
 'RoleArn': 'arn:aws:iam::631071447677:role/l4e-role',
 'CreatedAt': datetime.datetime(2021, 4, 8, 4, 51, 53, 783000, tzinfo=tzlocal()),
 'Status': 'SUCCESS',
 'ResponseMetadata': {'RequestId': 'f0f600b1-7cf8-4c8a-81e3-fff30ea89106',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f0f600b1-7cf8-4c8a-81e3-fff30ea89106',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '389',
   'date': 'Thu, 08 Apr 2021 04:55:56 GMT'},
  'RetryAttempts': 0}}

## Label your current dataset for L4E training

Some customers may not have an existing labeled dataset available to be able to directly use with L4E. In this case we will present here an example of how to use Amazon SageMaker's Private Labeling workforce to create labels for your dataset

#### Initialize variables

In [54]:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
# Amazon SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Amazon Augment AI (A2I) client
a2i = boto3.client('sagemaker-a2i-runtime')

# Amazon S3 client 
s3 = boto3.client('s3')

# Flow definition name - this value is unique per account and region. You can also provide your own value here.
flowDefinitionName = 'lblfd-l4e-' + timestamp

# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = 'lblui-l4e-' + timestamp

# Flow definition outputs - temp S3 bucket in current region, as L4E is in AP region currently - to be changed at GA
a2ibucket = 'prem-experiments'
OUTPUT_PATH = f's3://' + a2ibucket + '/' + PREFIX + '/label-example/'

role = get_execution_role()
print("RoleArn: {}".format(role))
WORKTEAM_ARN = 'arn:aws:sagemaker:us-east-1:631071447677:workteam/private-crowd/l4e-a2i-workforce'

RoleArn: arn:aws:iam::631071447677:role/l4e-role


#### Setup the Labeling UI

In [55]:
lbltemplate=r"""
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<style>
  table, tr, th, td {
    border: 1px solid black;
    border-collapse: collapse;
    padding: 5px;
  }
</style>

<crowd-form>
    <div>
        <h1>Instructions</h1>
        <p>Please review the equipment sensor inference inputs, and make corrections to anomaly predictions from the Lookout for Equipment Model. </p>
    </div>
   <div>
      <h3>Equipment Sensor Data Inputs</h3>
   <table>
    <tr>
        <th>TIMESTAMP</th>
        <th>Reactive Power</th>
        <th>Wind Speed 1</th>
        <th>Outdoor Temp</th>
        <th>Grid Frequency</th>
        <th>Pitch Angle</th>
    </tr>
    {% for pair in task.input.signal %}
        <tr>
          <td>{{ pair.timestamp }}</td>
          <td>{{ pair.reactive_power }}</td>     
          <td>{{ pair.wind_speed_1 }}</td>
          <td>{{ pair.outdoor_temp }}</td>     
          <td>{{ pair.grid_frequency }}</td>
          <td>{{ pair.pitch_angle }}</td>     
        </tr>
      {% endfor %}
    </table>   
   </div>
    <br>
    <h1>Enter the Start and End Time Ranges below</h1>
    <h3>These date ranges indicate previously detected anomalies and will serve as labels for your dataset</h3>
    <table>
    <tr>
        <th>START</th>
        <th>END</th>
    </tr>
    {% for pair in task.input.anomaly %}

        <tr>
          <td>
          <p>
            <input type="text" name="lblstart{{ forloop.index }}" style="height:50%; width:100%" />
            </p>
            </td>
            <td>
            <p>
            <input type="text" name="lblend{{ forloop.index }}" style="height:50%; width:100%" />
            </p>
            </td>
        </tr>

      {% endfor %}
    </table>
    <br>
    <br>
</crowd-form>
"""

#### Create the Task UI to use for our labeling activity

In [56]:
def create_task_ui():
    '''
    Creates a Human Task UI resource.
    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': lbltemplate})
    return response

In [57]:
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

arn:aws:sagemaker:us-east-1:631071447677:human-task-ui/lblui-l4e-2021-04-08-04-56-48


In [58]:
role = get_execution_role()
print("RoleArn: {}".format(role))

RoleArn: arn:aws:iam::631071447677:role/l4e-role


#### Create the Human Workflow Definition and activate it

In [59]:
create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn=role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and enter the start and end time ranges for labeling your dataset",
            "TaskTitle": "Equipment Anomaly Labels"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

In [60]:
for x in range(60):
    describeFlowDefinitionResponse = sagemaker_client.describe_flow_definition(FlowDefinitionName=flowDefinitionName)
    print(describeFlowDefinitionResponse['FlowDefinitionStatus'])
    if (describeFlowDefinitionResponse['FlowDefinitionStatus'] == 'Active'):
        print("Flow Definition is active")
        break
    time.sleep(2)

Active
Flow Definition is active


#### Prepare a list of training data points for the Labeling UI

In [61]:
NUM_TO_REVIEW = 10 # number of line items to review
dftimestamp = df.index.astype(str).to_list()
dfsig001 = df['Q_avg'].astype(str).to_list()
dfsig002 = df['Ws1_avg'].astype(str).to_list()
dfsig003 = df['Ot_avg'].astype(str).to_list()
dfsig004 = df['Nf_avg'].astype(str).to_list()
dfsig046 = df['Ba_avg'].astype(str).to_list()
sig_list = [{'timestamp': dftimestamp[x], 'reactive_power': dfsig001[x], 'wind_speed_1': dfsig002[x], 'outdoor_temp': dfsig003[x], 'grid_frequency': dfsig004[x], 'pitch_angle': dfsig046[x]} for x in range(NUM_TO_REVIEW)]
sig_list

[{'timestamp': '2012-12-31T23:00:00.000000',
  'reactive_power': '14.49',
  'wind_speed_1': '8.7700005',
  'outdoor_temp': '5.0900002',
  'grid_frequency': '50.009998',
  'pitch_angle': '-1.0'},
 {'timestamp': '2012-12-31T23:10:00.000000',
  'reactive_power': '23.700001',
  'wind_speed_1': '8.659999800000001',
  'outdoor_temp': '5.2600002',
  'grid_frequency': '49.959999',
  'pitch_angle': '-1.0'},
 {'timestamp': '2012-12-31T23:20:00.000000',
  'reactive_power': '25.48',
  'wind_speed_1': '8.9399996',
  'outdoor_temp': '5.5599999',
  'grid_frequency': '49.990002',
  'pitch_angle': '-1.0'},
 {'timestamp': '2012-12-31T23:30:00.000000',
  'reactive_power': '24.379999',
  'wind_speed_1': '8.869999900000002',
  'outdoor_temp': '5.6999998',
  'grid_frequency': '50.0',
  'pitch_angle': '-1.0'},
 {'timestamp': '2012-12-31T23:40:00.000000',
  'reactive_power': '14.47',
  'wind_speed_1': '9.4399996',
  'outdoor_temp': '5.8200002',
  'grid_frequency': '49.98',
  'pitch_angle': '-0.98000002'},
 {'

#### Load it into a list

In [62]:
# How many labels do we want the user to enter
num_labels = 3
anomaly = []
for i in range(0, num_labels):
    anomaly.append(i)
    
ip_content = {"signal": sig_list,
              "anomaly": anomaly
             }
ip_content

{'signal': [{'timestamp': '2012-12-31T23:00:00.000000',
   'reactive_power': '14.49',
   'wind_speed_1': '8.7700005',
   'outdoor_temp': '5.0900002',
   'grid_frequency': '50.009998',
   'pitch_angle': '-1.0'},
  {'timestamp': '2012-12-31T23:10:00.000000',
   'reactive_power': '23.700001',
   'wind_speed_1': '8.659999800000001',
   'outdoor_temp': '5.2600002',
   'grid_frequency': '49.959999',
   'pitch_angle': '-1.0'},
  {'timestamp': '2012-12-31T23:20:00.000000',
   'reactive_power': '25.48',
   'wind_speed_1': '8.9399996',
   'outdoor_temp': '5.5599999',
   'grid_frequency': '49.990002',
   'pitch_angle': '-1.0'},
  {'timestamp': '2012-12-31T23:30:00.000000',
   'reactive_power': '24.379999',
   'wind_speed_1': '8.869999900000002',
   'outdoor_temp': '5.6999998',
   'grid_frequency': '50.0',
   'pitch_angle': '-1.0'},
  {'timestamp': '2012-12-31T23:40:00.000000',
   'reactive_power': '14.47',
   'wind_speed_1': '9.4399996',
   'outdoor_temp': '5.8200002',
   'grid_frequency': '49.98

#### Start the human workflow loop

In [63]:
import json
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(ip_content)
            }
        )


#### Check the status of the human loop

In [64]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
   
      
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: 47bdd099-82b5-4ae5-9469-ef6d3a5deab3
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://prem-experiments/wind-turbine/label-example/lblfd-l4e-2021-04-08-04-56-48/2021/04/08/04/57/44/47bdd099-82b5-4ae5-9469-ef6d3a5deab3/output.json'}




In [65]:
resp

{'ResponseMetadata': {'RequestId': 'fbc11978-a4cc-46ca-a450-760039d64111',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Thu, 08 Apr 2021 04:57:48 GMT',
   'content-type': 'application/json; charset=UTF-8',
   'content-length': '2679',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'fbc11978-a4cc-46ca-a450-760039d64111',
   'access-control-allow-origin': '*',
   'x-amz-apigw-id': 'dcsD6GdcoAMF3zQ=',
   'x-amzn-trace-id': 'Root=1-606e8d4c-11af7ce76dad866a2c8b8b0b'},
  'RetryAttempts': 0},
 'CreationTime': datetime.datetime(2021, 4, 8, 4, 57, 44, 672000, tzinfo=tzlocal()),
 'HumanLoopStatus': 'InProgress',
 'HumanLoopName': '47bdd099-82b5-4ae5-9469-ef6d3a5deab3',
 'HumanLoopArn': 'arn:aws:sagemaker:us-east-1:631071447677:human-loop/47bdd099-82b5-4ae5-9469-ef6d3a5deab3',
 'FlowDefinitionArn': 'arn:aws:sagemaker:us-east-1:631071447677:flow-definition/lblfd-l4e-2021-04-08-04-56-48',
 'HumanLoopOutput': {'OutputS3Uri': 's3://prem-experiments/wind-turbine/label-example/lblfd-l4e-

#### Get the URL to the labeling task UI so our workers can login and do the labeling task

In [66]:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!
https://klkkf8ofpo.labeling.us-east-1.sagemaker.aws


#### Check the status of the human loop again

In [67]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
   
      
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: 47bdd099-82b5-4ae5-9469-ef6d3a5deab3
HumanLoop Status: Failed
HumanLoop Output Destination: {'OutputS3Uri': 's3://prem-experiments/wind-turbine/label-example/lblfd-l4e-2021-04-08-04-56-48/2021/04/08/04/57/44/47bdd099-82b5-4ae5-9469-ef6d3a5deab3/output.json'}




#### Review the results of the labeling output to extract our labels

In [68]:
import re
import pprint

pp = pprint.PrettyPrinter(indent=4)
json_output = ''
for resp in completed_human_loops:
    splitted_string = re.split('s3://' + a2ibucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    print(splitted_string[1])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=a2ibucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)
    print('\n')

#### Create our labels file

In [69]:
for i in json_output['humanAnswers']:
    print("checking entered labels...")
    x = i['answerContent']
    print(len(x))

TypeError: string indices must be integers

In [70]:
lbl_df = pd.DataFrame(columns=['start','end'])
tslist = {}

# Let's first check if the users mark equipment as faulty and if so get those row numbers into a dataframe            
for i in json_output['humanAnswers']:
    print("checking entered labels...")
    x = i['answerContent']
    print("Number of labeled date ranges specified: " + str(int(len(x)/2)))
    
# Now we will get the date ranges for the faulty choices                     
for k in range(1, int(len(x)/2)+1):
    y = json_output['humanAnswers'][0]
    strchk = "lblstart"+str(k)
    endchk = "lblend"+str(k)
    for i in y['answerContent']:
        if i == strchk:
            tslist[i] = y['answerContent'].get(i)
        if i == endchk:
            tslist[i] = y['answerContent'].get(i)
    lbl_df.loc[len(lbl_df.index)] = [tslist[strchk], tslist[endchk]]
    

lbl_df    

TypeError: string indices must be integers

#### Load the labels into a csv file
**Note:** In our case we will use the labels that came along with the dataset for training, so we create an example-labels.csv file below to conclude the demonstration of the labeling example. Note that if you want to continue this labeling example and use the label file you created for your actual L4E training in the next step, you need to copy the label file to an Amazon S3 bucket and provide the location in training configuration as below, when you setup your training job.

lookout_model.set_label_data(bucket=BUCKET,  <br>
                          $\;\;\;\;\;\;$prefix=PREFIX+'/labelled_data/', <br>
                          $\;\;\;\;\;\;$access_role_arn=ROLE_ARN)

In [None]:
# Now lets create an example-labels.csv file to upload our labels into. For the rest of this notebook we will use the labels 
# that came along with our dataset
lbl_df.to_csv('../data/wind-turbine/interim/example-labels.csv', header=None, index=None)

## Train L4E Model

### Split train and test data

In [30]:
train_ratio = 0.8
train_split = int(len(df.index)*train_ratio)

def change_date_format(datetime):
    return pd.to_datetime(datetime).strftime("%Y-%m-%d %H:%M:%S")

training_start   = pd.to_datetime(df.index[0])
training_end     = pd.to_datetime(df.index[train_split])
evaluation_start = pd.to_datetime(df.index[train_split+1])
evaluation_end   = pd.to_datetime(df.index[-1])

print(f'Training period: from {training_start} to {training_end}')
print(f'Evaluation period: from {evaluation_start} to {evaluation_end}')

Training period: from 2012-12-31 23:00:00 to 2017-01-10 08:40:00
Evaluation period: from 2017-01-10 08:50:00 to 2018-01-12 23:00:00


### Prepare labels
For this notebook example we are using the existing labels available in our dataset. If you would like to know how to create your own labels for your dataset please go to the previous section - **Label your current dataset for L4E**

In [31]:
df_labels = pd.read_csv('../data/wind-turbine/interim/R80711_labels.csv', header=None, index_col=0, parse_dates=True)
df_labels[1] = [pd.to_datetime(x).strftime("%Y-%m-%dT%H:%M:%S.%f") for x in df_labels[1]]
df_labels[2] = [pd.to_datetime(x).strftime("%Y-%m-%dT%H:%M:%S.%f") for x in df_labels[2]]
df_labels

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2013-01-02T02:30:00.000000,2013-01-02T15:30:00.000000
1,2013-01-05T13:50:00.000000,2013-01-10T04:30:00.000000
2,2013-01-10T19:30:00.000000,2013-01-11T12:10:00.000000
3,2013-01-12T13:30:00.000000,2013-01-12T14:00:00.000000
4,2013-01-13T14:50:00.000000,2013-01-14T18:50:00.000000
...,...,...
721,2017-12-22T14:00:00.000000,2017-12-23T07:40:00.000000
722,2017-12-25T00:20:00.000000,2017-12-25T01:40:00.000000
723,2018-01-03T05:30:00.000000,2018-01-03T11:00:00.000000
724,2018-01-06T04:40:00.000000,2018-01-06T13:00:00.000000


In [None]:
df_labels.to_csv('../data/wind-turbine/final/labelled-data/labels.csv', header=None, index=None)

In [None]:
!aws s3 cp ../data/wind-turbine/final/labelled-data/labels.csv s3://$BUCKET/$PREFIX/labelled_data/labels.csv

### Setup Training Config

In [32]:
# Prepare the model parameters:
lookout_model = lookout.LookoutEquipmentModel(model_name=MODEL_NAME,
                                              dataset_name=DATASET_NAME,
                                              region_name=REGION_NAME)

# Set the training / evaluation split date:
lookout_model.set_time_periods(evaluation_start,
                               evaluation_end,
                               training_start,
                               training_end)

# Set the label data location:
lookout_model.set_label_data(bucket=BUCKET, 
                             prefix=PREFIX+'/labelled_data/',
                             access_role_arn=ROLE_ARN)

# This sets up the rate the service will resample the data before 
# training:
lookout_model.set_target_sampling_rate(sampling_rate='PT10M')

### Train model

In [None]:
# Actually create the model and train it:
lookout_model.train()

#### the step below will make this notebook poll for 2.5 hours

In [None]:
# Run this only if you want this notebook to wait here till the training is complete
lookout_model.poll_model_training()

### Get diagnostics for the trained model

In [33]:
MODEL_NAME

'wind-turbine-10min-PR-trial2'

In [34]:
lookout_client = lookout.get_client(region_name=REGION_NAME)
describe_model_response = lookout_client.describe_model(ModelName=MODEL_NAME)
list(describe_model_response.keys())

['ModelName',
 'ModelArn',
 'DatasetName',
 'DatasetArn',
 'Schema',
 'LabelsInputConfiguration',
 'TrainingDataStartTime',
 'TrainingDataEndTime',
 'EvaluationDataStartTime',
 'EvaluationDataEndTime',
 'RoleArn',
 'DataPreProcessingConfiguration',
 'Status',
 'TrainingExecutionStartTime',
 'TrainingExecutionEndTime',
 'ModelMetrics',
 'LastUpdatedTime',
 'CreatedAt',
 'ResponseMetadata']

In [None]:
describe_model_response['Status']

In [35]:
LookoutDiagnostics = lookout.LookoutEquipmentAnalysis(model_name=MODEL_NAME, tags_df=df, region_name=REGION_NAME)
LookoutDiagnostics.set_time_periods(evaluation_start, evaluation_end, training_start, training_end)
predicted_ranges = LookoutDiagnostics.get_predictions()
labels_fname = os.path.join(LABEL_DATA, 'labels.csv')
labeled_ranges = LookoutDiagnostics.get_labels(labels_fname)

In [None]:
labeled_ranges

In [None]:
predicted_ranges.to_csv('../data/wind-turbine/final/inference-a2i/predicted_ranges.csv')

#### Model diagnostics with feature contribution (% that the feature contributed to the anomaly that was detected) toward anomaly patterns

In [36]:
list_d = []
for rec in predicted_ranges['diagnostics']:
    list_d.append(pd.DataFrame.from_dict(rec).set_index('name'))
diagnostics_df_ = pd.concat(list_d, axis=1).T.reset_index(drop=True)
diagnostics_df = pd.concat([predicted_ranges[['start','end']],diagnostics_df_], axis=1)
diagnostics_df

Unnamed: 0,start,end,R80711\Q_avg,R80711\Q_min,R80711\Q_max,R80711\Q_std,R80711\Ws1_avg,R80711\Ws1_min,R80711\Ws1_max,R80711\Ws1_std,...,R80711\Gb1t_max,R80711\Gb1t_std,R80711\Db1t_avg,R80711\Db1t_min,R80711\Db1t_max,R80711\Db1t_std,R80711\Rbt_avg,R80711\Rbt_min,R80711\Rbt_max,R80711\Rbt_std
0,2017-01-10 12:40:00,2017-01-10 13:50:00,0.005211,0.002247,0.003430,0.003763,0.008415,0.002388,0.004297,0.007515,...,0.012911,0.012361,0.009474,0.003436,0.009422,0.008364,0.008244,0.011312,0.006233,0.012759
1,2017-01-10 20:00:00,2017-01-10 20:00:00,0.004847,0.002041,0.004639,0.003613,0.008423,0.001509,0.004229,0.012388,...,0.014946,0.013435,0.006608,0.006157,0.002863,0.006902,0.003349,0.017126,0.009902,0.016110
2,2017-01-12 20:00:00,2017-01-13 01:40:00,0.015175,0.021451,0.014111,0.012320,0.017866,0.005015,0.009011,0.011949,...,0.004139,0.008411,0.004603,0.004496,0.004590,0.007221,0.003540,0.004551,0.004699,0.003594
3,2017-01-15 01:00:00,2017-01-15 01:10:00,0.009687,0.005040,0.004712,0.002825,0.004340,0.007455,0.007712,0.009380,...,0.007486,0.013601,0.005142,0.004162,0.003610,0.016625,0.007284,0.010275,0.009079,0.006205
4,2017-01-15 11:40:00,2017-01-15 12:00:00,0.009312,0.003107,0.006634,0.008769,0.002880,0.005982,0.003760,0.003926,...,0.004826,0.008104,0.008848,0.005986,0.009512,0.007832,0.010284,0.008849,0.003863,0.007695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1174,2018-01-09 03:50:00,2018-01-09 04:10:00,0.012959,0.005610,0.006387,0.035898,0.006276,0.004817,0.005699,0.003555,...,0.012475,0.010787,0.007991,0.004956,0.006308,0.004431,0.007931,0.003963,0.005998,0.006738
1175,2018-01-09 04:40:00,2018-01-09 05:10:00,0.004599,0.006711,0.007207,0.013650,0.006878,0.001041,0.007021,0.010034,...,0.011389,0.012814,0.003246,0.003944,0.003818,0.004609,0.004314,0.006407,0.010395,0.008531
1176,2018-01-09 06:00:00,2018-01-09 06:20:00,0.010506,0.015828,0.011177,0.032212,0.005308,0.000950,0.003832,0.001718,...,0.002351,0.000641,0.002690,0.004197,0.005192,0.008365,0.009064,0.014650,0.011299,0.006367
1177,2018-01-11 01:00:00,2018-01-11 01:20:00,0.009189,0.003402,0.009638,-0.002875,0.002518,0.008195,0.001232,0.001593,...,0.005293,-0.002936,0.003613,0.007226,0.005021,0.006018,-0.000070,0.010705,-0.001379,0.011369


### Display Anomaly Events

In [37]:
def build_labels_df(df, predicted_ranges, labeled_ranges):
    labels_df = pd.DataFrame(index=pd.to_datetime(df.index))
    labels_df['true'] = 0
    labels_df['predicted'] = 0
    
    mask = labels_df.index >= evaluation_start
    labels_df = labels_df.loc[mask, :]
    
    for row in labeled_ranges.iterrows():
        s = pd.to_datetime(row[1]['start'])
        e = pd.to_datetime(row[1]['end'])
        labels_df.loc[s:e,'true'] = 1
    
    for row in predicted_ranges.iterrows():
        s = pd.to_datetime(row[1]['start'])
        e = pd.to_datetime(row[1]['end'])
        labels_df.loc[s:e,'predicted'] = 1
    
    return labels_df

labels_df = build_labels_df(df, predicted_ranges, labeled_ranges)
labels_df

Unnamed: 0_level_0,true,predicted
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-10 08:50:00,0,0
2017-01-10 09:00:00,0,0
2017-01-10 09:10:00,0,0
2017-01-10 09:20:00,0,0
2017-01-10 09:30:00,0,0
...,...,...
2018-01-12 22:20:00,0,0
2018-01-12 22:30:00,0,0
2018-01-12 22:40:00,0,0
2018-01-12 22:50:00,0,0


In [38]:
c_ = []
for row in labeled_ranges.iterrows():
    s = pd.to_datetime(row[1]['start'])
    e = pd.to_datetime(row[1]['end'])
    a = labels_df.loc[s:e,:].index
    b = labels_df.loc[labels_df.sum(axis=1) == 2].index
    c = set(a).intersection(set(b))
    if c:
        c_.append(1)

print('Total abnormal events detected: ', len(c_))
print('Total abnormal events in the evaluation period: ', len(labeled_ranges.loc[labeled_ranges['start']>=evaluation_start,:]))

Total abnormal events detected:  148
Total abnormal events in the evaluation period:  150


In [39]:
PREDICTIONS_FNAME = 'predictions.csv'
labels_df.to_csv(f's3://{BUCKET}/{PREFIX}/labelled_data/{PREDICTIONS_FNAME}')

## Run inference on the L4E model

### Create the inference scheduler
The CreateInferenceScheduler API creates a scheduler **and** starts it: this means that this starts costing you right away. However, you can stop and start an existing scheduler at will (see at the end of this notebook):

In [57]:
ROLE_ARN = sagemaker.get_execution_role()
# REGION_NAME = boto3.session.Session().region_name
REGION_NAME = 'ap-northeast-2'
DATASET_NAME = 'wind-turbine-train-dsv2-PR'
MODEL_NAME = 'wind-turbine-10min-PR-trial2'

# Name of the inference scheduler you want to create
INFERENCE_SCHEDULER_NAME = 'wind-turbine-scheduler-a2i-for-baris'

# Name of the model on which you want to create this inference scheduler
MODEL_NAME_FOR_CREATING_INFERENCE_SCHEDULER = MODEL_NAME

# Mandatory parameters:
INFERENCE_DATA_SOURCE_BUCKET = BUCKET
INFERENCE_DATA_SOURCE_PREFIX = f'{PREFIX}/inference-a2i/input/'
INFERENCE_DATA_OUTPUT_BUCKET = BUCKET
INFERENCE_DATA_OUTPUT_PREFIX = f'{PREFIX}/inference-a2i/output/'
ROLE_ARN_FOR_INFERENCE = ROLE_ARN
DATA_UPLOAD_FREQUENCY = 'PT10M'

In [58]:
DATA_DELAY_OFFSET_IN_MINUTES = None
INPUT_TIMEZONE_OFFSET = '+00:00'
COMPONENT_TIMESTAMP_DELIMITER = '_'
TIMESTAMP_FORMAT = 'yyyyMMddHHmmss'

In [59]:
scheduler = lookout.LookoutEquipmentScheduler(
    scheduler_name=INFERENCE_SCHEDULER_NAME,
    model_name=MODEL_NAME_FOR_CREATING_INFERENCE_SCHEDULER,
    region_name=REGION_NAME
)

scheduler_params = {
    'input_bucket': INFERENCE_DATA_SOURCE_BUCKET,
    'input_prefix': INFERENCE_DATA_SOURCE_PREFIX,
    'output_bucket': INFERENCE_DATA_OUTPUT_BUCKET,
    'output_prefix': INFERENCE_DATA_OUTPUT_PREFIX,
    'role_arn': ROLE_ARN_FOR_INFERENCE,
    'upload_frequency': DATA_UPLOAD_FREQUENCY,
    'delay_offset': DATA_DELAY_OFFSET_IN_MINUTES,
    'timezone_offset': INPUT_TIMEZONE_OFFSET,
    'component_delimiter': COMPONENT_TIMESTAMP_DELIMITER,
    'timestamp_format': TIMESTAMP_FORMAT
}

scheduler.set_parameters(**scheduler_params)

### Prepare the inference data
---
Let's prepare and send some data in the S3 input location our scheduler will monitor:

In [43]:
# Let's load all our original signals:
all_tags_fname = TRAIN_DATA+'/'+turbine_id+'/'+turbine_id+'.csv'
all_tags_df = pd.read_csv(all_tags_fname)
all_tags_df['Timestamp']= pd.to_datetime(all_tags_df['Timestamp'])
all_tags_df = all_tags_df.set_index('Timestamp')
all_tags_df.head()

Unnamed: 0_level_0,Q_avg,Q_min,Q_max,Q_std,Ws1_avg,Ws1_min,Ws1_max,Ws1_std,Ws2_avg,Ws2_min,...,Gb1t_max,Gb1t_std,Db1t_avg,Db1t_min,Db1t_max,Db1t_std,Rbt_avg,Rbt_min,Rbt_max,Rbt_std
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-12-31 23:00:00,14.49,-0.44,41.18,8.19,8.770001,6.27,11.37,0.82,9.16,6.68,...,66.699997,0.73,39.02,37.0,41.0,1.09,28.709999,28.6,28.799999,0.03
2012-12-31 23:10:00,23.700001,1.75,43.02,8.3,8.66,6.01,11.37,1.02,9.12,5.46,...,70.099998,0.92,35.919998,35.099998,37.299999,0.6,28.700001,28.6,28.75,0.01
2012-12-31 23:20:00,25.48,3.2,46.619999,9.479999,8.94,6.08,11.29,0.99,9.45,5.89,...,72.300003,0.7,36.849998,35.400002,38.400002,0.82,28.790001,28.700001,28.799999,0.03
2012-12-31 23:30:00,24.379999,2.2,57.880001,11.1,8.87,5.96,12.15,1.14,8.979999,5.64,...,73.449997,0.62,39.75,38.200001,41.099998,0.81,28.860001,28.799999,29.0,0.07
2012-12-31 23:40:00,14.47,-10.88,35.189999,10.02,9.44,6.06,12.31,1.12,9.51,6.1,...,71.300003,1.4,40.950001,39.599998,41.700001,0.54,28.77,28.700001,28.9,0.05


In [44]:
all_tags_df.index.max()

Timestamp('2018-01-12 23:00:00')

Let's load the tags description: this dataset comes with a data description file. From here, we can collect the list of components (subsystem column) if required. Note that the steps below are not mandatory for this notebook, they only serve as a point of reference for our interpretation.

In [45]:
RAW_DATA   = os.path.join(WTDATA, 'raw')
#os.makedirs(RAW_DATA, exist_ok=True)

In [46]:
tags_description_fname = os.path.join(RAW_DATA, 'data_description.csv')
tags_description_df = pd.read_csv(tags_description_fname, sep=';')
tags_description_df.head()

Unnamed: 0,Variable_name,Variable_long_name,Unit_long_name,Comment
0,Q,Reactive_power,kVAr,
1,Ws,Wind_speed,m/s,Average wind speed
2,Va2,Vane_position_2,deg,Second wind vane on the nacelle
3,Git,Gearbox_inlet_temperature,deg_C,
4,Ot,Outdoor_temperature,deg_C,


**Note:** We will use the wind turbine name as the Subsystem in this example, but the code is ready to handle multiple components or subsystems as your use case needs. In case of multiple subsystems, uncomment the cell below and also uncomment the for loop in the sample inference dataset cell below.

In [None]:
#tags_description_df['Subsystem'] = turbine_id
#components = tags_description_df['Subsystem'].unique()

#### To build our sample inference dataset, we will extract the last few minutes of the evaluation period of the original time series:
Specifically we will create 3 csv files for our turbine 5 minutes apart. These are all stored in s3 in the inference-a2i folder

In [60]:
# How many sequences do we want to extract:
num_sequences = 3

# The scheduling frequency in minutes: this **MUST** match the
# resampling rate used to train the model:
frequency = 10
# Getting a better range for more data points
duration = 40

# Loops through each sequence:
start = all_tags_df.index.max() + datetime.timedelta(minutes=-duration * (num_sequences))
j = 0
for i in range(num_sequences):
    print("num seq i: " + str(i))
    end = start + datetime.timedelta(minutes=+duration)
    
# Rounding time to the previous 5 minutes:
    tm = datetime.datetime.now()
    print(tm)
    tm = tm - datetime.timedelta(
        minutes=tm.minute % frequency,
        seconds=tm.second,
        microseconds=tm.microsecond
    )
    tm = tm + datetime.timedelta(minutes=+frequency * (i))
    current_timestamp = (tm).strftime(format='%Y%m%d%H%M%S')


    # For each sequence, we need to loop through all components:
    print(f'Extracting data from {start} to {end}:')
    new_index = None
    
    #for component in components:
        #print(component)
        # Extracting the dataframe for this component and this particular time range:
    signals = list(df.columns)
    signals_df = all_tags_df.loc[start:end, signals]
        
        # We need to reset the index to match the time 
        # at which the scheduler will run inference:
    if new_index is None:
        new_index = pd.date_range(
            start=tm,
            periods=signals_df.shape[0], 
            freq='2min'
        )
    signals_df.index = new_index
    signals_df.index.name = 'Timestamp'
    signals_df = signals_df.reset_index()
    signals_df['Timestamp'] = pd.to_datetime(signals_df['Timestamp'], errors='coerce')
    #signals_df['Timestamp'] = signals_df['Timestamp'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f')
    # IMPORTANT STEP - we are populating a new data frame here to be used in A2I display UI for reference
    if j == 0:
        sig_full_df = signals_df
        j = 1
    else:
        sig_full_df = pd.concat([sig_full_df,signals_df], ignore_index=True)
    # Export this file in CSV format:
    component_fname = os.path.join(INFER_DATA_A2I, 'input', f'{turbine_id}_{current_timestamp}.csv')
    print("creating inference input files: " + component_fname)
    signals_df.to_csv(component_fname, index=None)
    
    start = start + datetime.timedelta(minutes=+duration)
    
    # Upload the whole folder to S3, in the input location:
    INFERENCE_INPUT = os.path.join(INFER_DATA_A2I, 'input')
    !aws s3 cp --recursive --quiet $INFERENCE_INPUT s3://$BUCKET/$PREFIX/inference-a2i/input
    


num seq i: 0
2021-04-07 14:42:41.051319
Extracting data from 2018-01-12 21:00:00 to 2018-01-12 21:40:00:
creating inference input files: ../data/wind-turbine/final/inference-a2i/input/R80711_20210407144000.csv
num seq i: 1
2021-04-07 14:42:44.112115
Extracting data from 2018-01-12 21:40:00 to 2018-01-12 22:20:00:
creating inference input files: ../data/wind-turbine/final/inference-a2i/input/R80711_20210407145000.csv
num seq i: 2
2021-04-07 14:42:47.179511
Extracting data from 2018-01-12 22:20:00 to 2018-01-12 23:00:00:
creating inference input files: ../data/wind-turbine/final/inference-a2i/input/R80711_20210407150000.csv


In [55]:
sig_full_df

Unnamed: 0,Timestamp,Q_avg,Q_min,Q_max,Q_std,Ws1_avg,Ws1_min,Ws1_max,Ws1_std,Ws2_avg,...,Gb1t_max,Gb1t_std,Db1t_avg,Db1t_min,Db1t_max,Db1t_std,Rbt_avg,Rbt_min,Rbt_max,Rbt_std
0,2021-04-07 14:30:00,15.56,9.48,19.09,1.6,4.69,3.29,6.04,0.46,4.83,...,62.65,0.16,41.71,40.9,42.6,0.41,23.69,23.6,23.7,0.02
1,2021-04-07 14:32:00,14.85,6.79,19.85,2.59,4.21,3.32,5.52,0.4,4.33,...,62.0,0.19,43.02,42.2,43.85,0.37,23.69,23.6,23.7,0.02
2,2021-04-07 14:34:00,12.26,4.03,19.27,4.46,4.16,3.04,5.38,0.42,4.29,...,61.7,0.27,43.72,42.6,44.5,0.37,23.61,23.4,23.7,0.1
3,2021-04-07 14:36:00,16.07,9.43,19.26,1.15,4.53,3.48,5.65,0.41,4.68,...,61.2,0.11,40.72,38.6,42.9,1.17,23.23,23.0,23.4,0.12
4,2021-04-07 14:38:00,15.02,8.67,18.47,1.44,5.04,3.84,6.5,0.44,5.22,...,61.35,0.09,36.83,34.85,38.8,1.08,22.94,22.8,23.1,0.1
5,2021-04-07 14:40:00,15.02,8.67,18.47,1.44,5.04,3.84,6.5,0.44,5.22,...,61.35,0.09,36.83,34.85,38.8,1.08,22.94,22.8,23.1,0.1
6,2021-04-07 14:42:00,15.9,8.61,19.58,1.62,5.28,4.1,6.84,0.47,5.44,...,62.1,0.24,33.56,32.6,35.1,0.71,22.78,22.6,23.0,0.06
7,2021-04-07 14:44:00,17.7,10.95,21.15,1.58,5.31,4.26,6.77,0.41,5.58,...,62.9,0.2,33.37,32.6,34.3,0.4,23.04,22.8,23.1,0.06
8,2021-04-07 14:46:00,17.74,10.87,20.85,1.49,4.96,4.01,6.05,0.4,5.15,...,63.05,0.08,34.91,33.85,35.9,0.5,23.26,23.1,23.35,0.07
9,2021-04-07 14:48:00,16.01,6.51,21.1,3.98,4.18,2.91,6.05,0.59,4.34,...,62.7,0.4,36.48,35.55,37.3,0.45,23.34,23.3,23.4,0.04


In [None]:
#Downloading and uploading the inference dataset grouped by timestamp for all sensors to create a dashboard that we will display to the user during A2I review process
sig_full_df.to_csv('../data/inference-a2i/insights.csv', index=False)
s3_ins_url = S3Uploader.upload('../data/inference-a2i/insights.csv', 's3://{}/{}/images'.format(BUCKET, PREFIX))

In [61]:
# Now that we've prepared the data, create the scheduler by running:
create_scheduler_response = scheduler.create()

===== Polling Inference Scheduler Status =====

Scheduler Status: PENDING
Scheduler Status: RUNNING

===== End of Polling Inference Scheduler Status =====


## Get inference results
---

### List inference executions

**Let's now wait for 5-15 minutes to give some time to the scheduler to run its first inferences.** Once the wait is over, we can use the ListInferenceExecution API for our current inference scheduler. The only mandatory parameter is the scheduler name.

You can also choose a time period for which you want to query inference executions for. If you don't specify it, then all executions for an inference scheduler will be listed. If you want to specify the time range, you can do this:

```python
START_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,3,0,0,0)
END_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,5,0,0,0)
```

Which means the executions after `2010-01-03 00:00:00` and before `2010-01-05 00:00:00` will be listed.

You can also choose to query for executions in particular status, the allowed status are `IN_PROGRESS`, `SUCCESS` and `FAILED`.

# BARIS repeat the cell below every 10 mins to get 3 execution, but you can stop after you have 3 rows

In [91]:
START_TIME_FOR_INFERENCE_EXECUTIONS = None
END_TIME_FOR_INFERENCE_EXECUTIONS = None
EXECUTION_STATUS = None

execution_summaries = []

while len(execution_summaries) == 0:
    execution_summaries = scheduler.list_inference_executions(
        start_time=START_TIME_FOR_INFERENCE_EXECUTIONS,
        end_time=END_TIME_FOR_INFERENCE_EXECUTIONS,
        execution_status=EXECUTION_STATUS
    )
    if len(execution_summaries) == 0:
        print('WAITING FOR THE FIRST INFERENCE EXECUTION')
        time.sleep(60)
        
    else:
        print('FIRST INFERENCE EXECUTED\n')
        break
            
execution_summaries

FIRST INFERENCE EXECUTED



[{'ModelName': 'wind-turbine-10min-PR-trial2',
  'ModelArn': 'arn:aws:lookoutequipment:ap-northeast-2:631071447677:model/wind-turbine-10min-PR-trial2/176ce413-a57c-4ea4-b53c-2aea59bf3a95',
  'InferenceSchedulerName': 'wind-turbine-scheduler-a2i-for-baris',
  'InferenceSchedulerArn': 'arn:aws:lookoutequipment:ap-northeast-2:631071447677:inference-scheduler/wind-turbine-scheduler-a2i-for-baris/9d88b382-d7d4-4b95-b818-22cec70c0a94',
  'ScheduledStartTime': datetime.datetime(2021, 4, 7, 15, 10, tzinfo=tzlocal()),
  'DataStartTime': datetime.datetime(2021, 4, 7, 15, 0, tzinfo=tzlocal()),
  'DataEndTime': datetime.datetime(2021, 4, 7, 15, 10, tzinfo=tzlocal()),
  'DataInputConfiguration': {'S3InputConfiguration': {'Bucket': 'prem-experiments-ap',
    'Prefix': 'data/wind-turbine/inference-a2i/input/'}},
  'DataOutputConfiguration': {'S3OutputConfiguration': {'Bucket': 'prem-experiments-ap',
    'Prefix': 'data/wind-turbine/inference-a2i/output/'}},
  'CustomerResultObject': {'Bucket': 'prem-

### Get actual prediction results

After each successful inference, a CSV file is created in the output location of your bucket. Each inference creates a new folder with a single `results.csv` file in it. Let's read these files and display their content here:

In [92]:
# If not installed at the beginning of the notebook, run this
#!pip install smart_open

In [93]:
import json
from smart_open import smart_open
results_df = []
something = 0
for execution_summary in execution_summaries:
    print("Checking inference for " + str(execution_summary['ScheduledStartTime']) + " with status " + execution_summary['Status'])
    if execution_summary['Status'] == 'SUCCESS':
        something = 1
        bucket = execution_summary['CustomerResultObject']['Bucket']
        key = execution_summary['CustomerResultObject']['Key']
        fname = f's3://{bucket}/{key}'
        with smart_open(fname,'r') as file:
            data = json.load(file)
        results_df.append(pd.DataFrame([data]))

        # Assembles them into a DataFrame:
if something == 1:
    results_df = pd.concat(results_df, axis='index')
    results_df.columns = ['Timestamp', 'Predictions']
    results_df['Timestamp'] = pd.to_datetime(results_df['Timestamp'],errors='coerce')
    results_df = results_df.set_index('Timestamp')
else:
    results_df.append('No successful inference results yet, please try again..')

results_df

Checking inference for 2021-04-07 15:10:00+00:00 with status SUCCESS
Checking inference for 2021-04-07 15:00:00+00:00 with status SUCCESS
Checking inference for 2021-04-07 14:50:00+00:00 with status SUCCESS


Unnamed: 0_level_0,Predictions
Timestamp,Unnamed: 1_level_1
2021-04-07 15:00:00,0
2021-04-07 14:50:00,0
2021-04-07 14:40:00,0


In [94]:
results_df.to_csv(os.path.join(INFER_DATA_A2I, 'output', 'results.csv'))
results_df

Unnamed: 0_level_0,Predictions
Timestamp,Unnamed: 1_level_1
2021-04-07 15:00:00,0
2021-04-07 14:50:00,0
2021-04-07 14:40:00,0


### Stop Inference Scheduler
Let's make sure to stop the inference scheduler as we won't require it for the rest of the steps below. But, as part of your solution, the inference scheduler should be running to ensure real-time inference for your equipment are continued.

In [None]:
scheduler.stop(wait=True)

In [None]:
# IF we dont need this scheduler anymore
scheduler.delete()

# A2I activities start here
Now that we saw the inference has been executed, let's now understand how to setup a UI to review the inference results and update it, so we can send it back to L4E for retraining the model. Follow the steps provided below

# BARIS pls start from here

### Initialize handlers

In [150]:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
# Amazon SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Amazon Augment AI (A2I) client
a2i = boto3.client('sagemaker-a2i-runtime')

# Amazon S3 client 
s3 = boto3.client('s3')

# Flow definition name - this value is unique per account and region. You can also provide your own value here.
flowDefinitionName = 'fd-l4e-' + timestamp

# Task UI name - this value is unique per account and region. You can also provide your own value here.
taskUIName = 'ui-l4e-' + timestamp

# Flow definition outputs - temp S3 bucket in current region, as L4E is in AP region currently - to be changed at GA
a2ibucket = 'prem-experiments'
OUTPUT_PATH = f's3://' + a2ibucket + '/' + PREFIX + '/a2i-results'

role = get_execution_role()
print("RoleArn: {}".format(role))
WORKTEAM_ARN = 'arn:aws:sagemaker:us-east-1:631071447677:workteam/private-crowd/l4e-a2i-workforce'

RoleArn: arn:aws:iam::631071447677:role/service-role/AmazonSageMaker-ExecutionRole-20210304T115503


### Create the human task UI
Create a human task UI resource, giving a UI template in liquid html.You can download this tempalte and customize it  This template will be rendered to the human workers whenever human loop is required. For over 70 pre built UIs, check: https://github.com/aws-samples/amazon-a2i-sample-task-uis. But first, lets declare some variables that we need during the next set of steps.

In [151]:
# We customized the tabular template for our notebook as below
template = r"""
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.5.0/Chart.min.js"></script>

<style>
  table, tr, th, td {
    border: 1px solid black;
    border-collapse: collapse;
    padding: 5px;
  }
</style>

<crowd-form>
    <div>
        <h1>Instructions</h1>
        <p>Please review the equipment sensor inference inputs, and make corrections to anomaly predictions from the Lookout for Equipment Model. </p>
    </div>
   <div>
      <h3>Equipment Sensor Readings</h3>
      <div style="width:50%;">
        <canvas id="canvas"></canvas>
      </div>
   <table>
    <tr>
        <th>TIMESTAMP</th>
        <th>Reactive Power</th>
        <th>Wind Speed 1</th>
        <th>Outdoor Temp</th>
        <th>Grid Frequency</th>
        <th>Pitch Angle</th>
    </tr>
    {% for pair in task.input.signal %}
        <tr>
          <td>{{ pair.timestamp }}</td>
          <td>{{ pair.reactive_power }}</td>     
          <td>{{ pair.wind_speed_1 }}</td>
          <td>{{ pair.outdoor_temp }}</td>     
          <td>{{ pair.grid_frequency }}</td>
          <td>{{ pair.pitch_angle }}</td>     
        </tr>
      {% endfor %}
    </table>   
   </div>
    <br>
    <h1>Select the correct equipment status below</h1>
    <h3>0 means the equipment is fine. 1 means the equipment is faulty or is in the process of wearing down</h3>
    <table>
    <tr>
        <th>START</th>
        <th>END</th>
        <th>PREDICTED ANOMALY</th>
        <th>CORRECTED START</th>
        <th>CORRECTED END</th>
        <th>CORRECTED STATUS - Select an option</th>
        <th>COMMENTS</th>
    </tr>
    {% for pair in task.input.anomaly %}

        <tr>
          <td><crowd-text-area name="startts-{{ forloop.index }}" value="{{ pair.startts }}" rows="2"></crowd-text-area></td>
          <td><crowd-text-area name="endts-{{ forloop.index }}" value="{{ pair.endts }}" rows="2"></crowd-text-area></td>
          <td><crowd-text-area name="ano-{{ forloop.index }}" value="{{ pair.ano }}"></crowd-text-area></td>     
          <td>
          <p>
            <input type="text" name="TrueStart{{ forloop.index }}" value="{{ pair.startts }}" style="height:50%; width:100%" />
            </p>
            </td>
            <td>
            <p>
            <input type="text" name="TrueEnd{{ forloop.index }}" value="{{ pair.endts }}" style="height:50%; width:100%" />
            </p>
            </td>
            <td>
            <p>
            <input type="radio" name="faulty-{{forloop.index}}" value="1">
              <label for="faulty-{{forloop.index}}">1-Faulty</label><br>
              <input type="radio" name="good-{{forloop.index}}" value="0">
              <label for="good-{{forloop.index}}">0-Good</label><br>
            </p>
           </td>
           <td>
            <p>
            <input type="text" name="Comments{{ forloop.index }}" placeholder="Explain why you changed the value" style="height:50%; width:80%"/>
            </p>
           </td>
        </tr>

      {% endfor %}
    </table>
    <br>
    <br>
</crowd-form>

<script>
window.chartColors = {
  red: 'rgb(255, 99, 132)',
  orange: 'rgb(255, 159, 64)',
  yellow: 'rgb(255, 205, 86)',
  green: 'rgb(75, 192, 192)',
  blue: 'rgb(54, 162, 235)',
  purple: 'rgb(153, 102, 255)',
  grey: 'rgb(231,233,237)'
};

var  reactive_power = [10, 20, 30, 40, 50];
var wind_speed_1 = [5, 7, 12, 15, 38];
var outdoor_temp = [12, 18, 23, 35, 38];
var grid_frequency = [2, 6, 78, 23, 9];
var pitch_angle = [6, 12, 56, 65, 87];
          



var config = {
  type: 'line',
  data: {
    labels: timestamps,
    datasets: [{
      label: "Reactive Power",
      backgroundColor: window.chartColors.red,
      borderColor: window.chartColors.red,
      data: reactive_power,
      fill: false,
    }, {
      label: "Wind Speed 1",
      fill: false,
      backgroundColor: window.chartColors.blue,
      borderColor: window.chartColors.blue,
      data: wind_speed_1,
    }, {
      label: "Outdoor Temp",
      fill: false,
      backgroundColor: window.chartColors.orange,
      borderColor: window.chartColors.orange,
      data: outdoor_temp,
    }, {
      label: "Grid Frequency",
      fill: false,
      backgroundColor: window.chartColors.green,
      borderColor: window.chartColors.green,
      data: grid_frequency,
    }, {
      label: "Pitch Angle",
      fill: false,
      backgroundColor: window.chartColors.purple,
      borderColor: window.chartColors.purple,
      data: pitch_angle,
    }         
              ]
  },
  options: {
    responsive: true,
    title:{
      display:true,
      text:'Equipment Sensor Readings Line Chart'
    },
    tooltips: {
      mode: 'index',
      intersect: false,
    },
   hover: {
      mode: 'nearest',
      intersect: true
    },
    scales: {
      xAxes: [{
        display: true,
        scaleLabel: {
          display: true,
          labelString: 'Timestamp'
        }
      }],
      yAxes: [{
        display: true,
        scaleLabel: {
          display: true,
        },
      }]
    }
  }
};

document.addEventListener('all-crowd-elements-ready', populateChart);

function populateChart() {
  
{% for pair in task.input.signal %}
    
    timestamps.push({{ pair.timestamp }});
    
{% endfor %}

  var ctx = document.getElementById("canvas").getContext("2d");
  var myLine = new Chart(ctx, config);
}
  
</script>
"""

In [152]:
def create_task_ui():
    '''
    Creates a Human Task UI resource.
    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response

In [153]:
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

arn:aws:sagemaker:us-east-1:631071447677:human-task-ui/ui-l4e-2021-04-07-16-07-54


In [154]:
role = get_execution_role()
print("RoleArn: {}".format(role))

RoleArn: arn:aws:iam::631071447677:role/service-role/AmazonSageMaker-ExecutionRole-20210304T115503


In [155]:
create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn=role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and select correct values as indicated",
            "TaskTitle": "Equipment Condition Review"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

In [156]:
for x in range(60):
    describeFlowDefinitionResponse = sagemaker_client.describe_flow_definition(FlowDefinitionName=flowDefinitionName)
    print(describeFlowDefinitionResponse['FlowDefinitionStatus'])
    if (describeFlowDefinitionResponse['FlowDefinitionStatus'] == 'Active'):
        print("Flow Definition is active")
        break
    time.sleep(2)

Initializing
Active
Flow Definition is active


# Sending predictions to Amazon A2I human loops

In [157]:
a2i_sig_full_df = sig_full_df.reset_index()

In [158]:
NUM_TO_REVIEW = 5 # number of line items to review
dftimestamp = a2i_sig_full_df['Timestamp'].astype(str).to_list()
dfsig001 = a2i_sig_full_df['Q_avg'].astype(str).to_list()
dfsig002 = a2i_sig_full_df['Ws1_avg'].astype(str).to_list()
dfsig003 = a2i_sig_full_df['Ot_avg'].astype(str).to_list()
dfsig004 = a2i_sig_full_df['Nf_avg'].astype(str).to_list()
dfsig046 = a2i_sig_full_df['Ba_avg'].astype(str).to_list()
sig_list = [{'timestamp': dftimestamp[x], 'reactive_power': dfsig001[x], 'wind_speed_1': dfsig002[x], 'outdoor_temp': dfsig003[x], 'grid_frequency': dfsig004[x], 'pitch_angle': dfsig046[x]} for x in range(NUM_TO_REVIEW)]
sig_list

[{'timestamp': '2021-04-07 14:40:00',
  'reactive_power': '15.56',
  'wind_speed_1': '4.69',
  'outdoor_temp': '4.78',
  'grid_frequency': '50.0',
  'pitch_angle': '-0.99'},
 {'timestamp': '2021-04-07 14:42:00',
  'reactive_power': '14.85',
  'wind_speed_1': '4.21',
  'outdoor_temp': '4.84',
  'grid_frequency': '49.96',
  'pitch_angle': '-0.86'},
 {'timestamp': '2021-04-07 14:44:00',
  'reactive_power': '12.26',
  'wind_speed_1': '4.16',
  'outdoor_temp': '4.69',
  'grid_frequency': '49.99',
  'pitch_angle': '-0.73'},
 {'timestamp': '2021-04-07 14:46:00',
  'reactive_power': '16.07',
  'wind_speed_1': '4.53',
  'outdoor_temp': '4.42',
  'grid_frequency': '50.01',
  'pitch_angle': '-0.99'},
 {'timestamp': '2021-04-07 14:48:00',
  'reactive_power': '15.02',
  'wind_speed_1': '5.04',
  'outdoor_temp': '4.13',
  'grid_frequency': '50.01',
  'pitch_angle': '-0.99'}]

In [159]:
old_results_df = results_df

# BARIS the cell below should be executed for the first time only after first inference row

In [160]:
# To be executed only for the first time for after an inference call
results_df.reset_index(inplace=True)

# You can execute this cell and below any number of times

In [161]:
results_df['StartTimestamp'] = results_df['Timestamp'] - datetime.timedelta(minutes=frequency*12)
results_df['EndTimestamp'] = results_df['Timestamp'] + datetime.timedelta(minutes=frequency*12)

In [162]:
#results_df = results_df.drop(['index'], axis=1)
results_df

Unnamed: 0,index,Timestamp,Predictions,StartTimestamp,EndTimestamp
0,0,2021-04-07 15:00:00,0,2021-04-07 13:00:00,2021-04-07 17:00:00
1,1,2021-04-07 14:50:00,0,2021-04-07 12:50:00,2021-04-07 16:50:00
2,2,2021-04-07 14:40:00,0,2021-04-07 12:40:00,2021-04-07 16:40:00


In [163]:
dfstartts = results_df['StartTimestamp'].astype(str).to_list()
dfendts = results_df['EndTimestamp'].astype(str).to_list()
dfano = results_df['Predictions'].to_list()
ano_list = [{'startts': dfstartts[x], 'endts': dfendts[x], 'ano': dfano[x]} for x in range(len(results_df))]
ano_list

[{'startts': '2021-04-07 13:00:00', 'endts': '2021-04-07 17:00:00', 'ano': 0},
 {'startts': '2021-04-07 12:50:00', 'endts': '2021-04-07 16:50:00', 'ano': 0},
 {'startts': '2021-04-07 12:40:00', 'endts': '2021-04-07 16:40:00', 'ano': 0}]

In [164]:
ip_content = {"signal": sig_list,
             'anomaly': ano_list
             }

# Start the human review

In [165]:
import json
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(ip_content)
            }
        )


In [166]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
   
      
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: fe281ece-18d3-4c58-a938-5b8f648c3a54
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://prem-experiments/data/wind-turbine/a2i-results/fd-l4e-2021-04-07-16-07-54/2021/04/07/16/08/25/fe281ece-18d3-4c58-a938-5b8f648c3a54/output.json'}




# login link to navigate to the private workforce portal

In [167]:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!
https://klkkf8ofpo.labeling.us-east-1.sagemaker.aws


In [85]:
completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
print('\n')
   
      
if resp["HumanLoopStatus"] == "Completed":
    completed_human_loops.append(resp)

HumanLoop Name: b22513e8-38f2-4b38-80b7-81c4c99b6ca2
HumanLoop Status: Completed
HumanLoop Output Destination: {'OutputS3Uri': 's3://prem-experiments/data/wind-turbine/a2i-results/fd-l4e-2021-04-07-14-52-15/2021/04/07/14/54/38/b22513e8-38f2-4b38-80b7-81c4c99b6ca2/output.json'}




# Evaluating the results

When the labeling work is complete, your results should be available in the S3 output path specified in the human review workflow definition. 
The human answers are returned and saved in the JSON file.

In [86]:
import re
import pprint

pp = pprint.PrettyPrinter(indent=4)
json_output = ''
for resp in completed_human_loops:
    splitted_string = re.split('s3://' + a2ibucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    print(splitted_string[1])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=a2ibucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)
    print('\n')

data/wind-turbine/a2i-results/fd-l4e-2021-04-07-14-52-15/2021/04/07/14/54/38/b22513e8-38f2-4b38-80b7-81c4c99b6ca2/output.json
{   'flowDefinitionArn': 'arn:aws:sagemaker:us-east-1:631071447677:flow-definition/fd-l4e-2021-04-07-14-52-15',
    'humanAnswers': [   {   'acceptanceTime': '2021-04-07T14:54:46.080Z',
                            'answerContent': {   'TrueEnd1': '2021-04-07 '
                                                             '16:40:00',
                                                 'TrueStart1': '2021-04-07 '
                                                               '12:40:00',
                                                 'ano-1': '0',
                                                 'endts-1': '2021-04-07 '
                                                            '16:40:00',
                                                 'faulty-1': {'1': False},
                                                 'good-1': {'0': False},
                               

## Retrain L4E based on A2I correction
Now we'll take the A2I output, preprocess it and send it back to L4E for retraining our model based on the user corrections

In [None]:
labels_df = pd.read_csv(os.path.join(LABEL_DATA, 'labels.csv'), header=None)
labels_df[0] = pd.to_datetime(labels_df[0])
labels_df[1] = pd.to_datetime(labels_df[1])
labels_df.columns = ['start', 'end']
labels_df.tail()

In [None]:
a2i_lbl_df = pd.DataFrame()

### Update Labels with new date ranges

In [None]:
faulty = False
a2i_lbl_df = labels_df
x = json_output['humanAnswers'][0]
row_df = pd.DataFrame(columns=['rownr'])
tslist = {}

# Let's first check if the users mark equipment as faulty and if so get those row numbers into a dataframe            
for i in json_output['humanAnswers']:
    print("checking equipment review...")
    x = i['answerContent']
    for idx, key in enumerate(x):
        if "faulty" in key:
            if str(x.get(key)).split(':')[1].lstrip().strip('}') == "True": # faulty equipment selected
                    faulty = True
                    row_df.loc[len(row_df.index)] = [key.split('-')[1]] 
                    print("found faulty equipment in row: " + key.split('-')[1])


# Now we will get the date ranges for the faulty choices                     
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    for i in x['answerContent']:
        if i == strchk:
            tslist[i] = x['answerContent'].get(i)
        if i == endchk:
            tslist[i] = x['answerContent'].get(i)

            
# And finally let's add it to our new a2i labels dataset
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    a2i_lbl_df.loc[len(a2i_lbl_df.index)] = [tslist[strchk], tslist[endchk]]

### Dont execute steps below if no new label was added

In [None]:
# Updated Labels after A2I results are included
a2i_lbl_df

In [None]:
a2i_label_src_fname = os.path.join(A2I_LABEL_DATA, 'labels.csv')
a2i_lbl_df.to_csv(a2i_label_src_fname, header=None, index=False)

In [None]:
# Uploading label dataset to S3:
a2i_label_s3_dest_path = f's3://{BUCKET}/{PREFIX}/augmented-labelled-data/labels.csv'
!aws s3 cp $a2i_label_src_fname $a2i_label_s3_dest_path

### Update the training dataset with new measurements
We will now update our original training dataset with the new measurement range based on what we got back from A2I

In [None]:
turbine_id = 'R80711'
file = '../data/wind-turbine/final/training-data/'+turbine_id+'/'+turbine_id+'.csv'
newdf = pd.read_csv(file, index_col='Timestamp')
newdf.head()

In [None]:
newdf = newdf.shape

In [None]:
sig_full_df = sig_full_df.set_index('Timestamp')

In [None]:
sig_full_df

In [None]:
sig_full_df.shape

In [None]:
tm = pd.to_datetime('2021-04-05 20:30:00')
print(tm)
new_index = pd.date_range(
        start=tm,
        periods=sig_full_df.shape[0], 
        freq='10min'
        )
sig_full_df.index = new_index
sig_full_df.index.name = 'Timestamp'
sig_full_df = sig_full_df.reset_index()
sig_full_df['Timestamp'] = pd.to_datetime(sig_full_df['Timestamp'], errors='coerce')

In [None]:
sig_full_df

In [None]:
# Append the original training data with the new measurements that we simulated before we ran our inference. We should be updating this only 
# if A2I reviews tagged faulty equipment
newdf = newdf.reset_index()
newdf = pd.concat([newdf,sig_full_df])
newdf.head()

In [None]:
newdf.tail()

In [None]:
newdf = newdf.set_index('Timestamp')

**Note:** As we can see above, 15 rows were appended to the end of the training dataset. Now lets create a csv file and copy the data to the training channel in S3

In [None]:
TRAIN_DATA_AUGMENTED = os.path.join(TRAIN_DATA,'augmented')
os.makedirs(TRAIN_DATA_AUGMENTED, exist_ok=True)
newdf.to_csv('../data/wind-turbine/final/training-data/augmented/'+turbine_id+'.csv')
!aws s3 sync $TRAIN_DATA_AUGMENTED s3://$BUCKET/$PREFIX/training_data/augmented

In [None]:
# Update the component map for augmented dataset. You should not see any changes to the dataset structure because of A2I updates but just in case
DATASET_COMPONENT_FIELDS_MAP = dict()
for subsystem in components:
    if subsystem not in ".ipynb_checkpoints" and subsystem in "augmented":
        subsystem = turbine_id
        print("sub: " + subsystem)
        subsystem_tags = ['Timestamp']
        for root, _, files in os.walk(f'{TRAIN_DATA}/{subsystem}'):
            for file in files:
                print("file: " + file)
                fname = os.path.join(root, file)
                current_subsystem_df = pd.read_csv(fname, nrows=1)
                subsystem_tags = subsystem_tags + current_subsystem_df.columns.tolist()[1:]

            DATASET_COMPONENT_FIELDS_MAP.update({subsystem: subsystem_tags})    

In [None]:
DATASET_COMPONENT_FIELDS_MAP

### Create the augmented dataset

In [None]:
ROLE_ARN = sagemaker.get_execution_role()
# REGION_NAME = boto3.session.Session().region_name
REGION_NAME = 'ap-northeast-2'
DATASET_NAME = 'wind-turbine-train-augmented-PR-trial5'
MODEL_NAME = 'wind-turbine-augmented-trial3'

lookout_dataset = lookout.LookoutEquipmentDataset(
    dataset_name=DATASET_NAME,
    component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
    region_name=REGION_NAME,
    access_role_arn=ROLE_ARN
)

pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

In [None]:
lookout_dataset.create()

### Ingest augmented data into L4E

In [None]:
response = lookout_dataset.ingest_data(BUCKET, f'{PREFIX}/training_data/augmented/')

In [None]:
# Get the ingestion job ID and status:
data_ingestion_job_id = response['JobId']
data_ingestion_status = response['Status']

# Wait until ingestion completes:
print("=====Polling Data Ingestion Status=====\n")
lookout_client = lookout.get_client(region_name=REGION_NAME)
print(str(pd.to_datetime(datetime.datetime.now()))[:19], "| ", data_ingestion_status)

while data_ingestion_status == 'IN_PROGRESS':
    time.sleep(60)
    describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
    data_ingestion_status = describe_data_ingestion_job_response['Status']
    print(str(pd.to_datetime(datetime.datetime.now()))[:19], "| ", data_ingestion_status)
    
print("\n=====End of Polling Data Ingestion Status=====")

In [None]:
describe_data_ingestion_job_response

### Update the time ranges for training and evaluation of augmented dataset
In the case of a continuous interval of data sampling and training, there will not be any gaps (or minimal gaps) in the time ranges between the previous training run and the current augmented training run. However in our wind turbine example we are looking at a dataset that was last recorded in 2018. As a result we select the training and evaluation period choices as shown below. During operational application, choose a time period that provides you the flexibility of a back test window for your evaluation with adequate data made available for training.  

In [None]:
newdf.index = pd.to_datetime(newdf.index)

In [None]:
# Loading time ranges, for augmented training, the training end will go upto the original evaluation end, and the evaluation end will be the last timestamp for 
# new data points

train_ratio = 0.8
train_split = int(len(df.index)*train_ratio)

   
training_start   = pd.to_datetime(newdf.index[0])
training_end     = pd.to_datetime(newdf.index[train_split])
evaluation_start = pd.to_datetime(newdf.index[train_split+1])
evaluation_end   = pd.to_datetime(newdf.index.max())
    

print(f'Training period: from {training_start} to {training_end}')
print(f'Evaluation period: from {evaluation_start} to {evaluation_end}')

print('Dataset used:', DATASET_NAME)

In [None]:
REGION_NAME

### Finally retrain L4E based on Augmented dataset

In [None]:
# Prepare the model parameters:
lookout_model = lookout.LookoutEquipmentModel(model_name='wind-turbine-augmented-PR-trial10',
                                              dataset_name=DATASET_NAME,
                                              region_name=REGION_NAME)

# Set the training / evaluation split date:
lookout_model.set_time_periods(evaluation_start,
                               evaluation_end,
                               training_start,
                               training_end)

# Set the label data location:
lookout_model.set_label_data(bucket=BUCKET, 
                             prefix=f'{PREFIX}/augmented-labelled-data/',
                             access_role_arn=ROLE_ARN)

# This sets up the rate the service will resample the data before 
# training:
lookout_model.set_target_sampling_rate(sampling_rate='PT10M')

In [None]:
# Actually create the model and train it:
lookout_model.train()

In [None]:
lookout_model.poll_model_training()

## Conclusion
---
In this notebook, we used the model created in part 3 of this notebook, configured a scheduler and extracted the predictions obtained after it executed a few inferences.