# Lab Environment for BIA Pipeline

This notebook instance will act as the lab environment for setting up and triggering changes to our pipeline.  This is being used to provide a consistent environment, gain some familiarity with Amazon SageMaker Notebook Instances, and to avoid any issues with debugging individual laptop configurations during the workshop. 

PLEASE review the sample notebook [xgboost_customer_churn](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/xgboost_customer_churn) for detailed documentation on the model being built

---

## Step 1:  View the Data

In this step we are going to upload the data that was processed using the same processing detailed in the  example notebook, [xgboost_customer_churn](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/xgboost_customer_churn).  

The following sample shows the label and a subset of the features included in the training dataset.  The label is in the first column, churn, which is the value we are trying to predict to determine whether a customer will churn. 

[Sample with Header](images/training_data_sample.png)



In [1]:
import pandas as pd
train_data = pd.read_csv('./data/train.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nTraining Data\n', train_data)

smoketest_data = pd.read_csv('./data/smoketest.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nSmoke Test Data\n', smoketest_data)

validation_data = pd.read_csv('./data/validation.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nValidation Data\n', validation_data)



Training Data
       0  106  0.1  274.4  120  198.6   82  160.8   62   6.0  3  1  0.2  0.3  \
0     0   28    0  187.8   94  248.6   86  208.8  124  10.6  5  0    0    0   
1     1  148    0  279.3  104  201.6   87  280.8   99   7.9  2  2    0    0   
...  ..  ...  ...    ...  ...    ...  ...    ...  ...   ... .. ..  ...  ...   
2330  0  159    0  198.8  107  195.5   91  213.3  120  16.5  7  5    0    0   
2331  0   99   33  179.1   93  238.3  102  165.7   96  10.6  1  2    0    0   

      0.4  0.5  0.6  0.7  0.8  0.9  0.10  0.11  0.12  0.13  0.14  1.1  0.15  \
0       0    0    0    0    0    0     0     0     0     0     0    0     0   
1       0    0    0    0    0    0     0     0     0     0     0    0     0   
...   ...  ...  ...  ...  ...  ...   ...   ...   ...   ...   ...  ...   ...   
2330    0    0    0    0    0    0     0     0     0     0     0    0     0   
2331    0    0    0    0    0    0     0     0     0     0     0    0     0   

      0.16  0.17  0.18  0.19  0.20

---
## Step 2:  Upload Data to S3 

We will utilize this notebook to perform some of the setup that will be required to trigger the first execution of our pipeline.  In this step, we are going to simulate what would typically be the last step in an Analytics pipeline of creating training and validation datasets. 

To accomplish this, we will actually be uploading data from our local notebook instance (data can be found under /data/*) to S3.  In a typical scenario, this would be done through your analytics pipeline.  We will use the S3 bucket that was created through the CloudFormation template we launched at the beginning of the lab. You can validate the S3 bucket exists by:
  1. Going to the [S3 Service](https://s3.console.aws.amazon.com/s3/) inside the AWS Console
  2. Find the name of the S3 data bucket created by the CloudFormation template: mlops-bia- data-*yourintials*-*randomid*
  3. Update the bucket variable in the cell below

   ### UPDATE THE BUCKET NAME BELOW BEFORE EXECUTING THE CELL

In [None]:
import os
import boto3
import re
import time

# UPDATE THE NAME OF THE BUCKET TO MATCH THE ONE WE CREATED THROUGH THE CLOUDFORMATION TEMPLATE
# Example: mlops-bia-data-jdd-df4d4850
#bucket = 'mlops-bia-data-<yourinitials>-<generated id>'
bucket = 'mlops-bia-data-yourintials-uniqueid'


from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name


trainfilename = 'train/train.csv'
smoketestfilename = 'smoketest/smoketest.csv'
validationfilename = 'validation/validation.csv'


s3 = boto3.resource('s3')

s3.meta.client.upload_file('./data/train.csv', bucket, trainfilename)
s3.meta.client.upload_file('./data/smoketest.csv', bucket, smoketestfilename)
s3.meta.client.upload_file('./data/validation.csv', bucket, validationfilename)

----

## Step 3:  Monitor CodePipeline Execution

The code above will trigger the execution of your CodePipeline. You can monitor progress of the pipeline execution in the [CodePipeline dashboard](https://console.aws.amazon.com/codesuite/codepipeline/pipelines).  Within the pipeline, explore the stages while the pipeline is execution to understand what is being performed and what user pararameters are being included as input into each stage.   For example, StartTraining takes a set of parameters detailing the training environment as well as the hyperparameters for training: 

    {"Algorithm": "xgboost:1", "traincompute": "ml.c4.2xlarge" , "traininstancevolumesize": 10, "traininstancecount": 1, "MaxDepth": "5", "eta": "0.2", "gamma": "4", "MinChildWeight": "6", "SubSample": 0.8, "Silent": 0, "Objective": "binary:logistic", "NumRound": "100"} 

As the pipeline is executing information is being logged to [CloudWatch logs](https://console.aws.amazon.com/cloudwatch/logs).  Explore the logs for your Lambda functions (/aws/lambda/MLOps-BIA*) as well as output logs from SageMaker (/aws/sagemaker/*).  Also, since this is a Built-In Algorithm, SageMaker automatically emits training metrics to understand how well your model is learning and whether it will generalize well on unseen data. Those metrics are logged to /aws/sagemaker/TrainingJobs in CloudWatch


Note: It will take awhile to execute all the way through the pipeline.  Please don't proceed to the next step until the last stage is shows **'succeeded'**

---

## Step 4: Additional Clean-Up

Return to the [README.md](https://github.com/aws-samples/amazon-sagemaker-devops-with-ml/1-Built-In-Algorithm/README.md) to complete the environment cleanup instructions. 

# CONGRATULATIONS! 

You've built a basic pipeline for the use case of utilizing a Built-In SageMaker algorithm.  This pipeline can act as a starting point for building in additional capabilities such as additional quality gates, more dynamic logic for capturing hyperparameter changes, various deployment strategies (ex. A/B Testing).  Another common extension to the pipeline may be creating/updating your API serving predictions through API Gateway.  