# Generate Backtesting Data for Amazon Lookout for Metrics

Amazon Lookout for Metrics(ALFM) also supports backtesting against your historical information. In this notebook you will generate sample data, upload it to s3 and then guided through the backtesting functionality within the console. Once the backtesting job has completed you can see all of the anomalies that ALFM detected in the last 20% of your historical data. From here you can begin to unpack the kinds of results you will see from ALFM in the future when you start streaming in new data. **NOTE YOU MUST CREATE A NEW DETECTOR TO LEVERAGE REAL TIME DATA. BACKTESTING IS ONLY FOR EXPLORATION.**

This notebook does assume that you already executed the real time example in the previous notebooks, if you have not, go back and complete those first, then give backtesting a whirl while you are waiting for the real time anomalies to alert you!

First restore the variables from the previous notebook and then import the libraries needed.

In [1]:
%store -r

In [2]:
import boto3
import os
import pandas as pd
import numpy as np
import csv
import botocore.config
import datetime
import dateutil.parser
import random

Just as in the last notebook, connect to AWS through the SDK.

In [3]:
session = boto3.Session(region_name = "us-east-1")
poirot = session.client("poirot")

The cell below contains the logic for generating synthetic data, it will be the basis for how we build the entire dataset.

In [4]:
def generate_synthetic_data(logfile, start_time, end_time, normal=True, closed=None):
    if not normal:
        logfile.writelines("=> generate_synthetic_data({}, {}, {}, {})".format(start_time, end_time, normal, closed))
        logfile.writelines("\n")
    result = []

    date_range = pd.date_range(start=start_time, end=end_time, freq='5Min', closed=closed)
    for ts in date_range:
        str_ts = ts.strftime("%Y-%m-%d %H:%M:%S")
        product = 'A' # dimension

        count = np.random.randint(10)
        if normal:
            mu, sigma = 100, 10 # mean and standard deviation
        else:
            mu, sigma = 70, 30 # mean and standard deviation
        price = round(np.random.normal(mu, sigma), 2)
        result.append([str_ts, product, count, price])
        #print("{}, {}, {}, {}".format(ts, product, count, price))
        
        product = 'B'  # dimension
        count = np.random.randint(10, 20)
        if normal:
            mu, sigma = 50, 5 # mean and standard deviation
        else:
            mu, sigma = 50, 1 # mean and standard deviation
        price = round(np.random.normal(mu, sigma), 2)
        result.append([str_ts, product, count, price])
        #print("{}, {}, {}, {}".format(ts, product, count, price))
    
    #print("<= generate_synthetic_data(): {} rows".format(len(result)))
    return result


def simulate_data(logfile, csvfile, timestamp, normal=True):    

    time = dateutil.parser.parse(str(timestamp))
    #print("using time = {}".format(time))
        
    interval_minutes = int(5)
    
    interval = datetime.timedelta(minutes = interval_minutes) # generate new data every time interval, e.g. 5 minutes
    one_second = datetime.timedelta(seconds = 1)

    start_time = time - interval
    data = generate_synthetic_data(logfile, start_time, time, normal, closed='left')
    
    for row in data:
        csvfile.writerow(row)

## Generating the Simulated Data

At present ALFM supports 5,000 historical data points for backtesting, in this case we are going to generate anomalies for 5,000 intervals of 5 minutes historical. This backtesting will also only consider points starting from the moment when you create your detector. For that reason we will start with today's date then walk, backwards 17 days(roughly 5,000 5 min slots). From here you will have a perfect dataset for validating anomalies in the past.


In [None]:
todays_date = datetime.datetime.now()
# Delta of 17 days is roughly 5,000 data points for the service
delta_date = datetime.timedelta(days=17)
start_date = todays_date - delta_date
start_date = start_date.strftime("%Y-%m-%d")
print(start_date)

In [6]:
# Build a time series  5 min slices from the start date
times = pd.date_range(start_date, periods=5000, freq='5Min')

The cells below will actually generate the data and upload it to S3 for you. It will also create a file `anomalies.log` that contains all the anomalous events that were created, here we are using a 1% chance of an anomaly occuring in any period. You can update the probablity rate to be higher than 1 if you like, but if they become too frequent... they cease to be anomalous.

In [7]:
# Must be an integer between 0 and 100
probabilty_of_anomaly_in_percent = 1

In [8]:
# First open the CSV file
with open("anomalies.log", 'w') as logfile:
    with open("backtestdata.csv", 'w', newline='') as csv_file:
        writer = csv.writer(csv_file, quoting=csv.QUOTE_NONNUMERIC)
        writer.writerow(["timestamp", "product", "count", "price"])
        for ts in times:
            chance = random.randint(1,100)
            if chance <= probabilty_of_anomaly_in_percent:
                simulate_data(logfile, writer, timestamp=ts, normal=False)
            else:
                simulate_data(logfile, writer, timestamp=ts, normal=True)

In [9]:
# Upload the Data
s3 = boto3.client('s3', config=botocore.config.Config(s3={'addressing_style':'path'}))
with open("backtestdata.csv", "rb") as f:
    s3.upload_fileobj(f, s3_bucket, "backtestdata.csv")

## Running the Simulation

Next you are ready to configure the backtesting job, you will need a few parameters for this which we can glean from these notebooks, but to begin, first open the ALFM console by opening a new browser window or tab and visiting: https://console.aws.amazon.com/poirot/home?region=us-east-1#landing. 

From here click the `Create detector` button. 

Next give it a simple name like `InitialBacktestDetector` note the name does need to be unique within your account. You can specify whatever you like for the description, then choose `Advertising` for the domain, and select an interval of `5 minute intervals`. If a custom encryption key was needed for this workload, this is where you would sepcify it. Note that ALFM will automatically use a custom generated key for you in the background if no key is provided. This ensures your information is always encrypted at rest and in transit with ALFM. Lastly here click `Create`.

Now the dataset can be specified, click the `Add a dataset` button on the page.

This is where we start to select the historical data that was just provided.

Start by giving this a name such as `InitialBacktestDataSet`, GMT is perfectly fine for the timezone. For Data source, specify Amazon S3. Then select `Test`.

The cell below will print the path to your data, copy and paste it in the historical data field.

In [10]:
print("s3://"+s3_bucket+"/"+"backtestdata.csv")

s3://059124553121poirottestbacktest/backtestdata.csv


The defaults provided in the rest of the form are sufficent, click `Next`.

Now it is time to map the fields of your data into ALFM, to do that get started with the measures use the following configuration:

```
Fieldname: count
Type: "IMPRESSIONS"
Aggregate by: AVG
```

Then click `Add measure` and enter the next configuration:

```
Fieldname: price
Type: "REVENUE"
Aggregate by: SUM
```

Note here the aggregation is not used as we have specified data in the exact rate that it is expected to arrive. This means that there's no reason for these to do anything.

Next click, `Add dimension`, and select `product`.

Lastly the timestamp setup, select the dropdown of `timestamp` for the field, and enter: `yyyy-MM-dd HH:mm:ss` for your Format and click `Next`.

To kick off the training and backtesting process click `Save and activate`. You will see a warning about costs, click `Activate`.

The training process will run for a bit, go relax, take a walk, grab a coffee and come back in about 2 hours to continue on.
