# Lab Environment for BYOA Pipeline

This notebook instance will act as the lab environment for setting up and triggering changes to our pipeline.  This is being used to provide a consistent environment, gain some familiarity with Amazon SageMaker Notebook Instances, and to avoid any issues with debugging individual laptop configurations during the workshop. 

PLEASE review the [sample notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb) for detailed documentation on the model being built

---

## Step 1:  Upload Data to S3 

We will utilize this notebook to perform some of the setup that will be required to trigger the first execution of our pipeline.   In this first step in our Machine Learning pipeline, we are going to simulate what would typically be the last step in an Analytics pipeline of creating data science datasets. 

To accomplish this, we will actually be uploading data from our local notebook instance (data can be found under /data/1-train/*) to S3.  In a typical scenario, this would be done through your analytics pipeline.  We will use the S3 bucket that was created through the CloudFormation template we launched at the beginning of the lab. You can validate the S3 bucket exists by:
  1. Going to the [S3 Service](https://s3.console.aws.amazon.com/s3/) inside the AWS Console
  2. Find the name of the S3 data bucket created by the CloudFormation template: mlops-data-*yourintials*-*randomid*
  
In the code cell below, we'll take a look at the training/test/validation datasets and then upload them to S3.  

In [1]:
import pandas as pd
train_data = pd.read_csv('./data/1-train/train/iris.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nTraining Data\n', train_data)

smoketest_data = pd.read_csv('./data/1-train/smoketest/iris.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nSmoke Test Data\n', smoketest_data)

validation_data = pd.read_csv('./data/1-train/validation/iris.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
print('\nValidation Data\n', validation_data)



Training Data
         setosa  5.1  3.5  1.4  0.2
0       setosa  4.9  3.0  1.4  0.2
1       setosa  4.7  3.2  1.3  0.2
..         ...  ...  ...  ...  ...
147  virginica  6.2  3.4  5.4  2.3
148  virginica  5.9  3.0  5.1  1.8

[149 rows x 5 columns]

Smoke Test Data
        setosa  5.0  3.5  1.3  0.3
0  versicolor  5.5  2.6  4.4  1.2
1   virginica  5.8  2.7  5.1  1.9

Validation Data
         setosa  5.2  3.5  1.5  0.2
0       setosa  5.2  3.4  1.4  0.2
1       setosa  4.7  3.2  1.6  0.2
..         ...  ...  ...  ...  ...
42  versicolor  5.9  3.2  4.8  1.8
43  versicolor  6.1  2.8  4.0  1.3

[44 rows x 5 columns]


### UPDATE THE BUCKET NAME BELOW BEFORE EXECUTING THE CELL

In [2]:
import os
import boto3
import re
import time

# UPDATE THE NAME OF THE BUCKET TO MATCH THE ONE WE CREATED THROUGH THE CLOUDFORMATION TEMPLATE
# Example: mlops-data-jdd-df4d4850
#bucket = 'mlops-data-<yourinitials>-<generated id>'
bucket = 'mlops-data-jdd-ec8a0350'


from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name

ts = time.strftime("%m%d%Y%H%M%S")
trainfilename = 'train/train'+ts+'.csv'
smoketestfilename = 'smoketest/smoketest.csv'
validationfilename = 'validation/validation.csv'


s3 = boto3.resource('s3')

s3.meta.client.upload_file('./data/1-train/train/iris.csv', bucket, trainfilename)
s3.meta.client.upload_file('./data/1-train/smoketest/iris.csv', bucket, smoketestfilename)
s3.meta.client.upload_file('./data/1-train/validation/iris.csv', bucket, validationfilename)

## Step 2:  Commit Training Code To Trigger Pipeline Build

In this step, we are trigger an execution of the pipeline by committing our training code to the CodeCommit repository that was setup as part of the CloudFormation stack.  The pipeline is currently setup to trigger on a commit to the master branch; however, this should be adjusted in a real-world scenario based on your branching strategy. 

The CodeCommit repository created can be viewed by:
  1. Going to the [CodeCommit Service](https://console.aws.amazon.com/codesuite/codecommit/repositories) inside the AWS Console
  2. Find the name of the repository created by the CloudFormation template: mlops-codecommit-byo-*yourinitials*
  
**UPDATE** Ensure you update the cell below where noted prior to executing 

In [4]:
# View the CodeCommit repository -
# This Git integration was configured as part of the creation of the notebook instance in the CloudFormation stack.

# The following will return the CodeCommit repository that has been configured with this notebook and will be used 
# for the source control repository during this workshop. 

# UPDATE: Add new remote for CodeCommit so we can push code and trigger our pipeline    
# Example: https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd
#!git remote add codecommit https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd

# Ensure remote repo is now setup
!git remote -v

!git config --global user.name "Jane Smith"
!git config --global user.email JaneSmith@example.com

!git remote add codecommit https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd
    
!git remote -v




origin	git@github.com:aws-samples/amazon-sagemaker-devops-with-ml.git (fetch)
origin	git@github.com:aws-samples/amazon-sagemaker-devops-with-ml.git (push)
codecommit	https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd (fetch)
codecommit	https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd (push)
origin	git@github.com:aws-samples/amazon-sagemaker-devops-with-ml.git (fetch)
origin	git@github.com:aws-samples/amazon-sagemaker-devops-with-ml.git (push)


### Commit training code to the CodeCommit repository to trigger the execution of the CodePipeline

In [None]:
!git pull
!git add ./model-code/*
!git commit -m "Initial add of model code to CodeCommit Repo"
!git push codecommit master

The authenticity of host 'github.com (140.82.113.4)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
RSA key fingerprint is MD5:16:27:ac:a5:76:28:2d:36:63:1b:56:4d:eb:df:a6:48.
Are you sure you want to continue connecting (yes/no)? 

### Monitor CodePipeline Execution

The code above will trigger the execution of your CodePipeline. You can monitor progress of the pipeline execution in the [CodePipeline dashboard](https://console.aws.amazon.com/codesuite/codepipeline/pipelines).