# Generate Data for Amazon Lookout for Metrics 

Amazon Lookout for Metrics(ALFM) needs at a minimum a new stream of information that reflects your dataset defined in the previous notebook, it can also leverage historical data to train on beforehand. In this notebook we will walk through the process of curating a historical dataset as well as a forward looking on. Once ALFM identifies the historical data it will start to train better models behind the scenes, then the forward looking data will be crawled in human scale time going forward and used as an inference input. If an anomaly is found, it will be sent to the SNS queue defined previously.

First restore the variables from the previous notebook and then import the libraries needed.

In [1]:
%store -r

In [2]:
import boto3
import os
import pandas as pd
import numpy as np
import csv
import botocore.config
import datetime
import dateutil.parser
import random

Just as in the last notebook, connect to AWS through the SDK.

In [3]:
session = boto3.Session(region_name = "us-east-1")
poirot = session.client("poirot")

The cell below contains the logic for generating synthetic data, it will be the basis for how we build the entire dataset.

In [4]:
def generate_synthetic_data(start_time, end_time, normal=True, closed=None):
    if not normal:
        print("=> generate_synthetic_data({}, {}, {}, {})".format(start_time, end_time, normal, closed))
    result = []

    date_range = pd.date_range(start=start_time, end=end_time, freq='1S', closed=closed)
    for ts in date_range:
        str_ts = ts.strftime("%Y-%m-%d %H:%M:%S")
        product = 'A' # dimension

        count = np.random.randint(10)
        if normal:
            mu, sigma = 100, 10 # mean and standard deviation
        else:
            mu, sigma = 70, 30 # mean and standard deviation
        price = round(np.random.normal(mu, sigma), 2)
        result.append([str_ts, product, count, price])
        #print("{}, {}, {}, {}".format(ts, product, count, price))
        
        product = 'B'  # dimension
        count = np.random.randint(10, 20)
        if normal:
            mu, sigma = 50, 5 # mean and standard deviation
        else:
            mu, sigma = 50, 1 # mean and standard deviation
        price = round(np.random.normal(mu, sigma), 2)
        result.append([str_ts, product, count, price])
        #print("{}, {}, {}, {}".format(ts, product, count, price))
    
    #print("<= generate_synthetic_data(): {} rows".format(len(result)))
    return result


def write_data_to_csv(data, csv_path):
    with open(csv_path, 'w', newline='') as csv_file:
        writer = csv.writer(csv_file, quoting=csv.QUOTE_NONNUMERIC)
        writer.writerow(["timestamp", "product", "count", "price"])
        for row in data:
            writer.writerow(row)
            

def write_csv_to_s3(csv_path, s3_bucket, s3_key):
    s3 = boto3.client('s3', config=botocore.config.Config(s3={'addressing_style':'path'}))
    with open(csv_path, "rb") as f:
        s3.upload_fileobj(f, s3_bucket, s3_key)
    return "s3://%s/%s" % ( s3_bucket, s3_key )


def simulate_data(timestamp, normal=True):    

    time = dateutil.parser.parse(str(timestamp))
    #print("using time = {}".format(time))
        
    interval_minutes = int(5)
    
    interval = datetime.timedelta(minutes = interval_minutes) # generate new data every time interval, e.g. 5 minutes
    one_second = datetime.timedelta(seconds = 1)

    start_time = time - interval
    data = generate_synthetic_data(start_time, time, normal, closed='left')

    csv_file = "{}-{:02d}-{:02d}-{:02d}-{:02d}.csv".format(start_time.year, start_time.month, start_time.day, start_time.hour, start_time.minute)
    csv_path = "/tmp/{}".format(csv_file)
    write_data_to_csv(data, csv_path)
    #print("wrote to {}".format(csv_file))
    
   # print("project = {}, s3_bucket = {}". format(project, s3_bucket))
    if project is None or s3_bucket is None:
        print("both 'project' and 's3_bucket' must be set as environment variables to upload to S3, so skipping S3 upload")
    else:
        s3_key = "{}/{:02d}/{:02d}/{:02d}-{:02d}/{}".format(start_time.year, start_time.month, start_time.day, start_time.hour, start_time.minute, csv_file)
        s3_url = write_csv_to_s3(csv_path, s3_bucket, s3_key)
        #print("uploaded {}".format(s3_url))

## Running The Simulation

Knowing that the data needs to exist in 5 minute intervals, the next question is when are we going to start. To make that easier we will simply start with the current date and then generating 1000 data points. Depending on when you start this exercise, that will provide a sample of historical data as well as some forward looking information. 

Once we have the range we will then be able to invoke a function that for each slice in the range will call our function to create synthetic data, then write it to s3. From there you can sit back and wait to be notified of any anomalies.


In [5]:
# Get Current Date as a string:
todays_date = datetime.datetime.now().strftime("%Y-%m-%d")
print(todays_date)
#todays_date = "2020-10-22"

2020-11-04


In [6]:
# Build a time series of 1000 5 min slices from the start of the day
times = pd.date_range(todays_date, periods=1000, freq='5Min')

The cell below will actually generate the data and upload it to S3 for you. The output below is specifically for the anomalous events. Save that in a textfile to validate the alerts were delivered correctly.

In [7]:
for ts in times:
    chance = random.randint(1,100)
    if chance <= 5:
        simulate_data(timestamp=ts, normal=False)
    else:
        simulate_data(timestamp=ts, normal=True)

=> generate_synthetic_data(2020-11-04 00:20:00, 2020-11-04 00:25:00, False, left)
=> generate_synthetic_data(2020-11-04 02:25:00, 2020-11-04 02:30:00, False, left)
=> generate_synthetic_data(2020-11-04 02:45:00, 2020-11-04 02:50:00, False, left)
=> generate_synthetic_data(2020-11-04 04:10:00, 2020-11-04 04:15:00, False, left)
=> generate_synthetic_data(2020-11-04 07:30:00, 2020-11-04 07:35:00, False, left)
=> generate_synthetic_data(2020-11-04 09:00:00, 2020-11-04 09:05:00, False, left)
=> generate_synthetic_data(2020-11-04 09:45:00, 2020-11-04 09:50:00, False, left)
=> generate_synthetic_data(2020-11-04 09:55:00, 2020-11-04 10:00:00, False, left)
=> generate_synthetic_data(2020-11-04 10:20:00, 2020-11-04 10:25:00, False, left)
=> generate_synthetic_data(2020-11-04 14:20:00, 2020-11-04 14:25:00, False, left)
=> generate_synthetic_data(2020-11-04 15:05:00, 2020-11-04 15:10:00, False, left)
=> generate_synthetic_data(2020-11-04 15:40:00, 2020-11-04 15:45:00, False, left)
=> generate_synt

## What Next?

Note the timestamps above, those are the anomalous events that have been written into your dataset, keep a lookout for any alerts that hit your mobile number and you can see how the service is performing with this synthetic dataset.

You can also view the anomalies in the console later as well for review.