UDACITY SageMaker Essentials: Training Job Exercise

In [None]:
import os
import boto3
import sagemaker
import json
import zipfile

import pandas as pd
import numpy as np

Preprocessing

The data we'll be examining today is a collection of reviews for an assortment of toys and games found on Amazon. This data includes, but is not limited to, the text of the review itself as well as the number of user "votes" on whether or not the review was helpful. Today, we will be making a model that predicts the usefulness of a review, given only the text of the review. This is an example of a problem in the domain of supervised sentiment analysis; we are trying to extract something subjective from text given prior labeled text.

Before we get started, we want to know what form of data is accepted in the algorithm we're using. We'll be using BlazingText, an implemention of Word2Vec optimized for SageMaker. In order for this optimization to be effective, data needs to be preprocessed to match the correct format. The documentation for this algorithm can be found here: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html

We will be training under "File Mode", which requires us to do two things in preprocessing this data. First, we need to generate labels from the votes. For this exercise, if the majority of votes for a review is helpful, we will designate it __label__1, and if the majority of votes for a review is unhelpful, we will designate it __label__2. In the edge case where the values are equal, we will drop the review from consideration. Second, we need to separate the sentences, while keeping the original label for the review. These reviews will often consist of several sentences, and this algorithm is optimized to perform best on many small sentences rather than fewer larger paragraphs. We will separate these sentences by the character "."

(This process is obviously very naive, but we will get remarkable results even without a lot of finetuning!)

This preprocessing is done for you in the cells below. Make sure you go through the code and understand what's being done in each step.

In [None]:
import zipfile

# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    split_sentences = []
    for d in labeled_data:
        label = d.split()[0]        
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                split_sentences.append(" ".join([label, s]))
    return split_sentences


input_data  = unzip_data('Toys_and_Games_5.json.zip')
labeled_data = label_data('Toys_and_Games_5.json')
split_sentence_data = split_sentences(labeled_data)

print(split_sentence_data[0:9])

Exercise: Upload Data

Your first responsibility is to separate that split_sentence_data into a training_file and a validation_file. Have the training file make up 90% of the data, and have the validation file make up 10% of the data. Careful that the data doesn't overlap! (This will result in overfitting, which might result in nice validation metrics, but crummy generalization.)

Using the methodology of your choice, upload these files to S3. (In practice, it's important to know how to do this through the console, programatically, and through the CLI. If you're feeling frisky, try all 3!) If you're doing this programatically, the Boto3 documentation would be a good reference. https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

The BUCKET will be the name of the bucket you wish to upload it to. The s3_prefix will be the name of the desired 'file-path' that you upload your file to within the bucket. For example, if you wanted to upload a file to:

"s3://example-bucket/1/2/3/example.txt

The "BUCKET" will be 'example-bucket', and the s3_prefix would be '1/2/3'

The code below shows you how to upload it programatically.

In [None]:
import boto3
from botocore.exceptions import ClientError
# Note: This section implies that the bucket below has already been made and that you have access
# to that bucket. You would need to change the bucket below to a bucket that you have write
# premissions to. This will take time depending on your internet connection, the training file is ~ 40 mb

BUCKET = "mldeployex"
# s3_prefix = "s3://mldeployex/txtdata/"


def cycle_data(fp, data):
    for d in data:
        fp.write(d + "\n")

def write_trainfile(split_sentence_data):
    train_path = "hello_blaze_train"
    with open(train_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return train_path

def write_validationfile(split_sentence_data):
    validation_path = "hello_blaze_validation"
    with open(validation_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return validation_path 

def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, BUCKET, object_name)
    except ClientError as e:
        logging.error(e)
        return False

s3_prefix = "s3://mldeployex/txtdata/"

# Split the data
split_data_trainlen = int(len(split_sentence_data) * .9)
split_data_validationlen = int(len(split_sentence_data) * .1)

# Todo: write the training file
train_path = write_trainfile(split_sentence_data[:split_data_trainlen])
print("Training file written!")

# Todo: write the validation file
validation_path = write_validationfile(split_sentence_data[split_data_trainlen:])
print("Validation file written!")

upload_file_to_s3(train_path, s3_prefix)
print("Train file uploaded!")
upload_file_to_s3(validation_path, s3_prefix)
print("Validation file uploaded!")

print(" ".join([train_path, validation_path]))

Exercise: Train SageMaker Model

Believe it or not, you're already almost done! Part of the appeal of SageMaker is that AWS has already done the heavy implementation lifting for you. Launch a "BlazingText" training job from the SageMaker console. You can do so by searching "SageMaker", and navigating to Training Jobs on the left hand side. After selecting "Create Training Job", perform the following steps:

Select "BlazingText" from the algorithms available.
Specify the "file" input mode of training.
Under "resource configuration", select the "ml.m5.large" instance type. Specify 5 additional GBs of memory.
Set a stopping condition for 15 minutes.
Under hyperparameters, set "mode" to "supervised"
Under input_data configuration, input the S3 path to your training and validation datasets under the "train" and "validation" channels. You will need to create a channel named "validation".
Specify an output path in the same bucket that you uploaded training and validation data.

## Endpoint SDK demo

Through the SDK, we will first initiate a boto3 client. Then, we obtain the model image uri, the model artifact, and an execution role. This is used to initiate a Model object. Then, we call the deploy method, specifying what kind of instance we want and how many.

In [None]:
role = get_execution_role()
image_uri = image_uris.retrieve(framework='xgboost',region='us-west-2', version='latest')
model_data = "s3://sagemaker-us-west-2-565094796913/boston-xgboost-HL/output/xgboost-2021-08-31-23-02-30-970/output/model.tar.gz"

model = Model(image_uri=image_uri, model_data=model_data, role=role)

predictor = model.deploy(initial_instance_count=1, instance_type="ml.m5.large")

Using endpoint through SDK

To utilize this endpoint, you can do it programmatically through the SDK's Predictor interface. You pass in the endpoint name and your Boto3 session. Depending on the type of data, you may need to serialize the data. Serializing the data breaks the data down in such a way that it can be recreated later. An example of a predictor object and serialization is shown below.

In [None]:
#deploy the model
deployment = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    )
#get the endpoint name
endpoint = deployment.endpoint_name

#instantiate a Predictor
predictor = sagemaker.predictor.Predictor(
    endpoint,
    sagemaker_session=sagemaker.Session(),
)

#prepare one image for prediction
predictor.serializer = IdentitySerializer("image/png")
with open("test_image.png", "rb") as f:
    payload = f.read()

#use the predictor to make a prediction
inference = predictor.predict(payload)

## UDACITY SageMaker Essentials: Endpoint Exercise

In [None]:
import boto3
import json
import sagemaker
import zipfile

Understanding Exercise: Preprocessing Data (again)

Before we start, we're going to do preprocessing on a new set of data that we'll be evaluating on HelloBlaze. We won't keep track of the labels here, we're just seeing how we could potentially evaluate new data using an existing model. This code should be very familiar, and requires no modification. Something to note: it is getting tedious to have to manually process the data ourselves whenever we want to do something with our model. We are also doing this on our local machine. Can you think of potential limitations and dangers to the preprocessing setup we currently have? Keep this in mind when we move on to our lesson about batch-transform jobs.

In [None]:
# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:       
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(s)
    return new_split_sentences


unzip_data('reviews_Musical_Instruments_5.json.zip')
labeled_data = label_data('reviews_Musical_Instruments_5.json')
new_split_sentence_data = split_sentences(labeled_data)

print(new_split_sentence_data[0:9])

Exercise: Deploy Model

Once you have your model, it's trivially easy to create an endpoint. All you need to do is initialize a "model" object, and call the deploy method. Fill in the method below with the proper addresses and an endpoint will be created, serving your model. Once this is done, confirm that the endpoint is live by consulting the SageMaker Console. You'll see this under "Endpoints" in the "Inference" menu on the left-hand side. If done correctly, this will take a while to get instantiated.

You will need the following methods:

You'll need image_uris.retrieve method to determine the image uri to get a BlazingText docker image uri https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html
You'll need a model_data to pass the S3 location of a SageMaker model data
You'll need to use the Model object https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
You'll need to the get execution role.

In [None]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# get the execution role
role = get_execution_role()
# get the image using the "blazingtext" framework and your region
image_uri = image_uris.retrieve(framework='blazingtext', region='us-east-1', version='latest')
# get the S3 location of a SageMaker model data
model_data = 's3://mldeployex/12el/model_artifact/blazetxt/output/model.tar.gz'
# define a model object
model = Model(image_uri=image_uri, model_data=model_data, role=role)
# deploy the model using a single instance of "ml.m5.large"
predictor = model.deploy(initial_instance_count=1, instance_type="ml.m5.large")

Exercise: Evaluate Data

Alright, we now have an easy way to evaluate our data! You will want to interact with the endpoint using the predictor interface: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html

Predictor is not the endpoint itself, but instead is an interface that we can use to easily interact with our deployed model. Your task is to take new_split_sentence_data and evaluate it using the predictor.

Note that the BlazingText supports "application/json" as the content-type for inference and the model expects a payload that contains a list of sentences with the key as “instances”.

The method you'll need to call is highlighted below.

Another recommendation: try evaluating a subset of the data before evaluating all of the data. This will make debugging significantly faster.

In [None]:
from sagemaker.predictor import Predictor
import json

predictor = Predictor('blazingtext-2022-05-14-06-32-59-663')

# load the first five reviews from new_split_sentence_data
example_sentences = new_split_sentence_data[0:5]

payload = {"instances": example_sentences}

print(json.dumps(payload))

# make predictions using the "predict" method. Set initial_args to {'ContentType': 'application/json'}
predictions = json.loads(predictor.predict(json.dumps(payload), initial_args={'ContentType': 'application/json'}))
print(predictions)

In [None]:
predictor.stop_endpoint()

## Batch Transform 

In [None]:
import boto3
client = boto3.client(‘sagemaker’)
response = client.create_transform_job(
    TransformJobName=‘my_job’,
    ModelName=‘target_model_for_inference’,
    TransformInput={
        ‘DataSource’: {
            ‘S3DataSource’: {
            ‘S3DataType’: ‘S3Prefix’,
            ‘S3Uri’: ‘s3://mybucket/inputfolder/’
         }
     },
    ‘SplitType’: ‘Line’
    },
    TransformInput={
        ‘S3OutputPath’: ‘s3://mybucket/outputpath’,
        ‘AssembleWith’: ‘Line’
    },
    TransformResources={
    'InstanceType’: ‘ml.m4.xlarge’,
    ‘InstanceCount’: 1})
}

## Batch Transform Demo

using SDK 

Programmatically, we do this operation in a similar pattern to what we’re used to. First, we’ll create a boto3 session. Then, similar to our operations on endpoints, we create a model object utilizing the image uri of Amazon’s XGBoost, our model artifact, and an execution role.



In [None]:
#creating model objects
role = get_execution_role()
image_uri = image_uris.retrieve(framework='xgboost',region='us-west-2', version='latest')
model_data = "s3://sagemaker-us-west-2-565094796913/boston-xgboost-HL/output/xgboost-2021-08-31-23-02-30-970/output/model.tar.gz"

model = Model(image_uri=image_uri, model_data=model_data, role=role)

Using this model object, we create a transformer object. This is used to perform operations on data, which in this case would be inference. We specify machine type, # of machines, and the output of where we want our transformations to go.



In [None]:
transformer = model.transformer(
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    output_path = batch_transform_output_path
)


Then, you call the transform method to do the job.

In [None]:
transformer.transform(
    data=data_s3_path,
    data_type='S3Prefix',
    content_type='text/csv',
    split_type='Line'
)

Exercise: Preprocess (again, again) and upload to S3

The cell below provides you two functions. The split_sentences preprocesses the reviews and you should be very familiar with function. Remember that the BlazingText expects a input with JSON format, the cycle_data formats the review to the following: {'source': 'THIS IS A SAMPLE SENTENCE'} and writes it into a file.

Using the cell to complete the following tasks:

preprecessing reviews_Musical_Instruments_5.json
upload the file consisting of the data to s3

In [None]:
import boto3
import json
import os
import zipfile

# Todo: Input the s3 bucket
s3_bucket = "mldeployex"

# Todo: Input the s3 prefix
s3_prefix = "13el"

# Todo: Input the the file to write the data to
file_name = "music_instruments_reviews.txt"

# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')


def split_sentences(input_data):
    split_sentences = []
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        if total_votes != 0 and helpful_votes/total_votes != .5:  # Filter out same data as prior jobs. 
            reviewText = l_object['reviewText']
            sentences = reviewText.split(".") 
            for s in sentences:
                if s: # Make sure sentences isn't empty. Common w/ "..."
                    split_sentences.append(s)
    return split_sentences

# Format the data as {'source': 'THIS IS A SAMPLE SENTENCE'}
# And write the data into a file
def cycle_data(fp, data):
    for d in data:
        fp.write(json.dumps({'source':d}) + '\n')

# Todo: write a function to upload the data to s3
def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response=s3_client.upload_file(file_name, s3_bucket, object_name)
    except clientError as e:
        logging.error(e)
        return False


# Unzips file.
unzip_data('reviews_Musical_Instruments_5.json.zip')

# Todo: preprocess reviews_Musical_Instruments_5.json 
sentences = split_sentences('reviews_Musical_Instruments_5.json')

# Write data to a file and upload it to s3.
with open(file_name, 'w') as f:
    cycle_data(f, sentences)

upload_file_to_s3(file_name, s3_prefix)

# Get the s3 path for the data
batch_transform_input_path = "s3://" + "/".join([s3_bucket, s3_prefix, file_name])

print(batch_transform_input_path)

Exercise: Use Batch Transform to perform an inference on the dataset


We utilize batch transform through a transformer object. Similar to how we initialized a predictor object in the last exercise, complete the code below to initialize a transformer object and launch a transform job.

You will need the following:

Similar to last exercise, you will need to get a BlazingText image uri from AWS. The methodology you use to do so should be identical to the last exercise.
You will need to instantiate a "model" object.
You will need to call the "transformer" method on the model object to create a transformer. We suggest using 1 instance of ml.m4.xlarge. If this isn't available in your region, feel free to use another instance, such as ml.m5.large
You will need to use this transformer on the data we uploaded to s3. You will be able to do so by inserting an "S3Prefix" data_type and a "application/jsonlines" content_type, split by "Line".
Consult the following documentation: https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html

End-to-end, this process should take about 5 minutes on the whole dataset. While developing, consider uploading a subset of the data to s3, and evaluate on that instead.

In [None]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# Get the execution role

role = get_execution_role()

# Get the image uri using the "blazingtext" algorithm in your region. 

image_uri = image_uris.retrieve(framework='blazingtext',region='us-east-1')

# Get the model artifact from S3

model_data = 's3://mldeployex/12el/model_artifact/blazetxt/output/model.tar.gz'

# Get the s3 path for the batch transform data

batch_transform_output_path = "s3://mldeployex/l3el/batchtransform_output"

# Define a model object

model = Model(image_uri=image_uri, model_data=model_data, role=role)

# Define a transformer object, using a single instance ml.m4.xlarge. Specify an output path to your s3 bucket. 

transformer = model.transformer(
        instance_count=1,
        instance_type='ml.m4.xlarge',
        output_path=batch_transform_output_path
)

# Call the transform method. Set content_type='application/jsonlines', split_type='Line'

transformer.transform(
    data=batch_transform_input_path, 
    data_type='S3Prefix',
    content_type='application/jsonlines', 
    split_type='Line'
)

transformer.wait()


## Processing Job Demo

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor

Sklearn_processor = SKLearnProcessor(framework_version=string,
    role=string,
    instance_type=string,
    instance_count=int)

Then we start the processing job by executing the run method of the Processor object. Within the run method, we need to specify:

The code we wish to use to process the data
Input channels. (Both S3 and local path within the ProcessingInput)
Output channels. (The local path within the ProcessingOutput)

In [None]:
sklearn_processor.run(code='xgboost_process_script.py',
                        inputs=[ProcessingInput(
                        source='s3://sagemaker-us-west-2-565094796913/boston-xgboost-HL/train.csv',
                        destination='/opt/ml/processing/input/data/')],
                        outputs=[ProcessingOutput(source='/opt/ml/processing/output')]
                     )

AWS Console

To launch a processing job through the console, we need to specify the following

A name.

Desired ECR image.246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3

An IAM role

Computing resources, again going for the cheaper option.

Entry Point: python3;/opt/ml/processing/input/code/xgboost_process_script.py

Input channel

    Data S3 location
    
    Data local path:/opt/ml/processing/input/data
    
    Code S3 location
    
    Code local path: /opt/ml/processing/input/code)

Output channels

    Output S3 location
    
    Local path:/opt/ml/processing/output

Trigger for upload (End of Job)

With that, we can submit the job and track it through the Processing Jobs landing page.

## UDACITY SageMaker Essentials: Processing Job Exercise

In prior exercises, we've been running and rerunning the same preprocessing job over and over again. For cleanly formatted data, it's possible to do some preprocessing utilizing batch transform. However, if slightly more sophisticated processing is needed, we would want to do so through a processing job. Finally, after beating around the bush for a few exercises, we're finally going offload the preprocessing step of our data!

In [None]:
import sklearn
import boto3
import json

## Preprocessing (for the final time!)

The cell below should be very familiar to you by now. This cell will write the preprocessing code to a file called "HelloBlazePreprocess.py". This code will be utilized by AWS via a SciKitLearn processing interface to process our data.

In [None]:
%%writefile HelloBlazePreprocess.py

import json
import zipfile

# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')
        return input_data_zip.namelist()[0]

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:
        label = d.split()[0]        
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(" ".join([label, s]))
    return new_split_sentences

def write_data(data, train_path, test_path, proportion):
    border_index = int(proportion * len(data))
    train_f = open(train_path, 'w')
    test_f = open(test_path, 'w')
    index = 0
    for d in data:
        if index < border_index:
            train_f.write(d + '\n')
        else:
            test_f.write(d + '\n')
        index += 1

if __name__ == "__main__":
    unzipped_path = unzip_data('/opt/ml/processing/input/Toys_and_Games_5.json.zip')
    labeled_data = label_data(unzipped_path)
    new_split_sentence_data = split_sentences(labeled_data)
    write_data(new_split_sentence_data, '/opt/ml/processing/output/train/hello_blaze_train_scikit', '/opt/ml/processing/output/test/hello_blaze_test_scikit', .9)

## Exercise: Upload unprocessed data to s3

No more local preprocessing! Upload the raw zipped Toys_and_Games dataset to s3.

In [None]:
import os 
import boto3


# Todo
s3_bucket = "mldeployex"
s3_prefix = "13e1"
file_name = "Toys_and_Games_5.json.zip"

def upload_file_to_s3(file_name):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, s3_bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    
upload_file_to_s3(file_name)

source_path = "s3://" + "/".join([s3_bucket, s3_prefix, file_name])
print(source_path)

Exercise: Launch a processing job through the SciKitLearn interface.

We'll be utilizing the SKLearnProcessor object from SageMaker to launch a processing job. Here is some information you'll need to complete the exercise:

You will need to use the SKLearnProcessor object.
You will need 1 instance of ml.m5.large. You can specify this programatically.
You will need an execution role.

You will need to call the "run" method on the SKLearnProcessor object.

You will need to specify the preprocessing code
the S3 path of the unprocessed data
a 'local' directory path for the input to be downloaded into
'local' directory paths for where you expect the output to be.
you will need to make use of the ProcessingInput and ProcessingOutput features. Review the preprocessing code for where the output is going to go, and where it expects the input to come from.

It is highly recommended that you consult the documentation to help you implement this. https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
Remember that, conceptually, you are creating a server that our code will be run from. This server will download data, execute code that we specify, and upload data to s3.

If done successfully, you should see a processing job launch in the SageMaker console. To see it, go to the "processing" drop-down menu on the left-hand side and select "processing jobs." Wait until the job is finished.



In [None]:
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Get role

role = get_execution_role()

# Create an SKLearnProcessor. Set framework_version='0.20.0'.

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
    role=role,
    instance_type= 'ml.m5.large',
    instance_count=1)

# Start a run job. You will pass in as parameters the local location of the processing code, 
# a processing input object, two processing output objects. The paths that you pass in here are directories, 
# not the files themselves. Check the preprocessing code for a hint about what these directories should be. 

sklearn_processor.run(code= 'HelloBlazePreprocess.py', # preprocessing code
                      inputs=[ProcessingInput(
                          source = source_path, # the S3 path of the unprocessed data
                          destination='/opt/ml/processing/input' , # a 'local' directory path for the input to be downloaded into
                      )],
                      outputs=[ProcessingOutput(source='/opt/ml/processing/output/train' ),# a 'local' directory path for where you expect the output for train data to be
                               ProcessingOutput(source='/opt/ml/processing/output/test' )]) # a 'local' directory path for where you expect the output for test data to be 

In [None]:
sklearn_processor.jobs[-1].describe() #to check the output path 