# Implementation with Step Functions

Now that we have a satisfying machine learning model, we want to implement this as part of a process that runs every day at 3AM in the morning based on the latest transactional information that we have dumped on S3 to create forecasts for each reseller.


## 1. Upload data to S3

First you need to create a bucket for this experiment. Upload the data from the following public location to your own S3 bucket. To facilitate the work of the crawler use two different prefixs (folders): one for the billing information and one for reseller. 

### Download the data

In [1]:
# your bucket name
your_bucket = 'blackb-mggaska-implementation'

In [7]:
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/billing_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/reseller_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/awswrangler-0.0b2-py3.6.egg

--2019-07-02 16:35:59--  https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/billing_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.216.20.203
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.216.20.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15803443 (15M) [binary/octet-stream]
Saving to: ‘billing_sm.csv.1’


2019-07-02 16:35:59 (99.2 MB/s) - ‘billing_sm.csv.1’ saved [15803443/15803443]

--2019-07-02 16:35:59--  https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/reseller_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.216.20.203
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.216.20.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210111 (205K) [binary/octet-stream]
Saving to: ‘reseller_sm.csv.1’


2019-07-02 16:35:59 (32.0 MB/s) - ‘reseller_sm.csv.1’ saved [210111/2

Now we upload to an S3 location

In [8]:
import boto3, os
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('billing', 'billing_sm.csv')).upload_file('billing_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('reseller', 'reseller_sm.csv')).upload_file('reseller_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('python', 'awswrangler-0.0b2-py3.6.egg')).upload_file('awswrangler-0.0b2-py3.6.egg')


## 2. Create Crawler

To use this csv information in the context of a Glue ETL, first we have to create a Glue crawler pointing to the location of each file. The crawler will try to figure out the data types of each column. The safest way to do this process is to create one crawler for each table pointing to a different location.

1. Go to the AWS Console.
2. Select under Services AWS Glue.
3. Under crawlers Add Crawler and two crawlers: create one pointing to each S3 location (one to billing and one to reseller)

    3.1 Name: Billing, Data Store, Specific Path in my Account, Navigate to your bucket and your folder Billing, create an IAM role billing-crawler-role, add database implementationdb, Next, Finish
    
    3.2 After the crawler is created select Run it now.
    
    3.3 Name: Reseller, Data Store, Specific Path in my Account, Navigate to your bucket and your folder Reseller, create an IAM role reseller-crawler-role, select database implementationdb, Next, Finish
    
    3.4 After the crawler is created select Run it now.

After both crawlers run you should see one table is been adeed for each. You can use Athena to inspect the tables and double check the data is been added properly.

## 3. Create Glue Job

Now we are going to create a GLUE ETL job in python 3.6. In this job we can combine both the ETL from Notebook #2 and the Preprocessing Pipeline from Notebook #4.

Note that instead of reading from a csv file we are going to use Athena to read from the resulting tables of the Glue Crawler. 

Glue is a "serverless" service so the processing power assigned to the process is meassured in DPUs. Each DPU is equivalent to 16GB of RAM and 4vCPU. 

1. Open the AWS Console
2. Under Services go to AWS Glue
3. Uner Jobs, add new job
4. Name: etlandpipeline, Role: Glueadmin, Type Python Shell, Python3, Select A New Script Authored By you,
Under Security Configuration...  Select Python library path and browse to the location where you have the egg of the aws wrangler Library, Under Maximum Capacity write 1. Then hit "Save Job and Edit Script"
