# Implementation with Step Functions

Now that we have our data science project built, we want to implement it in a robust and repeteable manner. For this, we are going to deploy the ETL using AWS Glue, and then train and batch transform the input using SageMaker integration with Amazon Step Functions. 

This notebook is going to guide you thorugh this process step by step.

But first you need to create or set your own bucket. SageMaker´s SDK is a good way to start. 

## 1. Upload to an S3 Bucket

In [2]:
import sagemaker

In [7]:
ses = sagemaker.Session()
your_bucket = ses.default_bucket()

In [5]:
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/billing/billing_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/reseller/reseller_sm.csv

--2020-12-14 18:34:50--  https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/billing/billing_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.217.108.156
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.217.108.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15803443 (15M) [binary/octet-stream]
Saving to: ‘billing_sm.csv’


2020-12-14 18:34:50 (64.8 MB/s) - ‘billing_sm.csv’ saved [15803443/15803443]

--2020-12-14 18:34:50--  https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/reseller/reseller_sm.csv
Resolving ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)... 52.217.108.156
Connecting to ml-lab-mggaska.s3.amazonaws.com (ml-lab-mggaska.s3.amazonaws.com)|52.217.108.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210111 (205K) [binary/octet-stream]
Saving to: ‘reseller_sm.csv’


2020-12-14 18:34:51 (31.3 MB/s) - ‘reseller_sm.csv’ sa

In [8]:
import boto3, os
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('billing', 'billing_sm.csv')).upload_file('billing_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('reseller', 'reseller_sm.csv')).upload_file('reseller_sm.csv')

## 2. Create the Glue Crawlers

To use this csv information in the context of a Glue ETL, first we have to create a Glue crawler pointing to the location of each file. The crawler will try to figure out the data types of each column. The safest way to do this process is to create one crawler for each table pointing to a different location.

Go to the AWS Console.
Select under Services AWS Glue.
Or follow <a href='https://us-east-1.console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers'> this link! </a>       
    

Under crawlers Add Crawler and two crawlers: create one pointing to each S3 location (one to billing and one to reseller)
* Crawler Name: Billing - Next
* Crawler source type: Data Store - Next
* Add a data store: S3, Specific Path in my Account, Navigate to your bucket and your folder Billing - Next
* Add another data store: no - Next
* Choose an IAM role: create an IAM role billing-crawler-role (if exists, choose the existing) - Next
* Frequency: run on demand - Next
* Crawler’s output: Add database implementationdb - Next
* Finish


Tips: 
- Make sure you name your Glue data base "implementationdb" 
- Make sure to point to the S3 folder containing each file, not the actual file
- Make sure to name your db implementationdb
- Make sure to create new roles

create a new role to run the crawler. Also, don´t forget to run the crawler

<img src='img/crawler1.png' style='width:400px' />

<img src='img/crawler2.png' style='width:400px' />



Let’s add the second crawler:

* Crawler Name: Reseller - Next
* Crawler source type: Data Store - Next
* Add a data store: S3, Specific Path in my Account, Navigate to your bucket and your folder Reseller - Next
* Add another data store: no - Next
* Choose an IAM role: create an IAM role reseller-crawler-role (if exists, choose the existing) - Next
* Frequency: run on demand - Next
* Crawlers’s output: Select database implementationdb - Next
* Finish


Tips:
- Use the same database (implementationdb) but create a different role as each crawler need to access a different folder.


## 3. Test crawlers on Athena

Go to the <a href='https://console.aws.amazon.com/athena/home?region=us-east-1#'> Athena Console </a> and hit Get Started.

Under Settings in the top right corner you can configure an output path for your queries. 
You can use the following value:

In [9]:
f's3://{your_bucket}/athena_results/'

's3://sagemaker-us-east-1-646862220717/athena_results/'

After you set a destination for query results, you can preview the tables created by the crawlers.

<img src='img/athena1.png' style='width:500px'>

## 4. Create Glue Job

First of all, you need to create a role to run the Glue Job. For simplicity we are going to build a role that can be assumed by the Glue Service with administrator access. 

In the <a href='https://console.aws.amazon.com/athena/home?region=us-east-1#'> IAM Console </a>

* Under use case select Glue
* Under Policies Select Administrator Access
* Name your role GlueAdmin and accept.

<img src='img/gluerole1.png' style='width:500px'>
<img src='img/gluerole2.png' style='width:500px'>
<img src='img/gluerole3.png' style='width:500px'>



Now move to the <a href='https://console.aws.amazon.com/glue/home?region=us-east-1#addJob:'> Glue Job Console </a> and author a new job.
    

* Name: etlandpipeline
* Role: Create a role named Glueadmin with AdministratorAccess (this is because we are testing)
* Type: Python Shell
* Glue version: Python3 (Glue Version 1.0)
* Select A New Script Authored By you
* Under Maximum Capacity: 1 - Next

    
Then hit “Save Job and Edit Script”

You can use the following script to run your job:
    

In [17]:
job = open('etlandpipeline.py', 'r').read().replace('your_bucket',your_bucket)
print(job)

## 5. Create the Step Function

First you need to create a role that can be assumed by AWS Step Functions and have enough permissions to create and use for inference SageMaker models and run Glue Jobs. 
First, we are going to create a role that can be assumed by the service Step Functions and then we are going to modify it to add Administrator Access. You can name this role StepFunctionsAdmin


<img src='img/iamstep.png' />

Tip: In this particular case it can not be done in the same step.





Next go to the <a href='https://console.aws.amazon.com/states/home?region=us-east-1#/statemachines'> Step Functions </a> console and create a new State Machine.

* Author with code snippets
* Standard


In the json place you can use the following script:

In [20]:
from sagemaker import get_execution_role

your_role = get_execution_role()

In [21]:
definition = open('step_function.json', 'r').read().replace('your_bucket',your_bucket).replace('your_role',your_role)
print(definition)

{
  "Comment": "Full ML Pipeline",
  "StartAt": "Start Glue Job",
  "States": {
    "Start Glue Job": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "etlandpipeline"
      },
      "Next": "Train model (XGBoost)"
    },
    "Train model (XGBoost)": {
      "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
      "Parameters": {
        "AlgorithmSpecification": {
          "TrainingImage": "811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest",
          "TrainingInputMode": "File"
        },
        "OutputDataConfig": {
          "S3OutputPath": "s3://sagemaker-us-east-1-646862220717/models"
        },
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "ResourceConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m4.xlarge",
          "VolumeSizeInGB": 30
        },
        "RoleArn": "arn:aws:iam::646862220717:role/TeamRole",
    

Use the role that you previously created and then you can create and run your state machine. 


As you process starts running and moves thorugh each step you will be able to see the process running in each servicés console. 

Check <a href='https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=jobs'> Glue </a> for job logs and
<a href='https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs'> SageMaker </a> to see the training job, the model that you created and the batch transform process. 

After you step function finishes the execution, you should see the graph turning to green:

<img src='img/step.png' style='width:500px' />

You can inspect your predictions in the predictinos folder on you bucket checking <a href='https://s3.console.aws.amazon.com/s3/home?region=us-east-1'>S3</a>.