# Prerequisite: Set-up a S3 bucket

This notebook assumes you completed the earlier steps in `README.md`, if you did not, go back and do that, the notebook will wait patiently for you to come back.

In this notebook you will leverage a provided e-commerce dataset to demonstrate the functionality of Amazon Lookout For Metrics(L4M). This is meant to be educational and to guide you through an approach that should work well for your own datasets. 

If you are looking to leverage your own dataset, simply export it as a CSV and follow along here after the extraction steps. 

## Data set-up workflow:

1. Create a bucket
2. Uncompress the dataset
3. Format the data to be read by Lookout For Metrics (Already Done)
3. Save data to bucket

After these steps have been completed you are ready to get started exploring the data with Amazon Lookout For Metrics.

## Import libraries

In [None]:
import os
import json
import shutil
import zipfile
import pathlib
import pandas as pd
import boto3
import utility
import synth_data

### Create a bucket

As mentioned above, data needs to exist somewhere. Run the next cells to create a bucket for you to use.

Note in the very next cell you can define if you are using SageMaker or not. If you are not using SageMaker change this cell to False AND manually set your region to the correct value.

In [None]:
# Set to false if Not Using SageMaker
USING_SAGEMAKER = True
# Change to region = "us-east-1" for example to use that region if you ARENT using SageMaker
region = None

if USING_SAGEMAKER: 
    with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
        data = json.load(notebook_info)
        resource_arn = data['ResourceArn']
        region = resource_arn.split(':')[3]
print(region)

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
s3_bucket = account_id + "-lookoutmetrics-lab"

utility.create_bucket(s3_bucket, region=region)

s3_bucket

### Uncompress dataset

Next uncompress the archive that was provided, you would skip this if bringing your own data, however you should still create a folder named `data`. You can do that by right clicking in the panel to the left and creating a new folder with the diaglog.

In [None]:
data_dirname = os.path.join("./data")

if os.path.exists(data_dirname):
    shutil.rmtree(data_dirname)
os.makedirs(data_dirname)

zip_filename = os.path.join("./ecommerce.zip")

with zipfile.ZipFile( zip_filename, "r" ) as zip_fd:
    zip_fd.extractall(data_dirname)

Before you proceed to the next step let's take a quick look at the folder structure.


In [None]:
paths = utility.DisplayablePath.make_tree(pathlib.Path('data'))
for path in paths:
    print(path.displayable())

 Specifically, notice that we only have one `input.csv` in the `backtest`. Whereas, data in the `live` folder is broken down into days (ex: `20210101` for January 1, 2021) and hours (ex: `0200` for 2:00AM). 

This path structure for the live data is *CRITICAL* for using the service to detect live information, your own datasets must be built in a similar manner so that Lookout for Metrics will understand how to find data in the future. Soon we will have a sample that showcases how to stream data from Lambda or Kinesis into a structure like this.

**Note: it is totally fine to create these live data points later.** The point is to upload data just before the time for anomaly detection, so you will need some form of automated process to do this.

Also, notice that our data goes far into the future. Of course, this is unrealistic of any real-world scenarios but it works for this demontration.

Now when you take a quick look into the data, you will notice the schema for `backtest` and `live` data is identical. 

If you are providing your own data to understand the service, again it is totally fine to just use your backtesting data.

In [None]:
backtest_df = pd.read_csv('data/ecommerce/backtest/input.csv')
backtest_df.head()

In [None]:
live_sample_df = pd.read_csv('data/ecommerce/live/20220425/0000/20220425_000000.csv')
live_sample_df.head()

To ensure we have live data into the future, run the cell below, it will update your data folders to have an up to date history as well as data well into the future. This is crucial to make sure that you generate alerts in the future when using continuous mode. If you aren't going that route, feel free to skip this step and move onto the bucket syncing step.

In [None]:
synth_data.generate_data()

### Save data to bucket

Finally, let save the data into to our s3 bucket. Note the `--quiet` at the end of the command, this will prevent the output from consuming a ton of resources in this browser window(you'd see thousands of files listed here without it). It will take a few minutes to complete.

**Important:** In the cell below there is a folder called `ecommerce` update it to whatever the name of your dataset folder is. If you just placed your content inside the `data` folder, delete the `ecommerce` bit and leave one trailing slash.

In [None]:
!aws s3 sync {data_dirname}/ecommerce/ s3://{s3_bucket}/ecommerce/ --quiet

To make things easier on yourself we are going to leverage the magic functions of Ipython in order to save a few variables for later.

In [None]:
%store s3_bucket

## Configuring IAM

Before Lookout For Metrics can read your data an IAM role will need to be created so that the service can communicate with S3. Additionally you will need to enable SNS support if you wish to recieve alerts later. The cell below will create that role for you and then return its ARN so that you can use it later via the notebooks or the console.

In [None]:
role_name = "L4MTestRole"
role_arn = utility.get_or_create_iam_role(role_name)
%store role_arn

## Next Steps

With data loaded into S3 you can now move on to working with Lookout for Metrics. It is recommended that you explore your historical data via Backtesting first, so continue on to `2.BacktestingWithHistoricalData.ipynb`.