In [2]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import re
import numpy as np
import pandas as pd
from pathlib import Path

In [3]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket() # Replace with your own bucket name if needed

prefix = 'review_topics' #Replace with the prefix under which you want to store the data if needed

arn:aws:iam::443501626368:role/service-role/AmazonSageMaker-ExecutionRole-20200806T142735


In [15]:
print(bucket)

sagemaker-us-east-1-443501626368


### Preprocess training and test data

In [8]:
# define a simple tokenizer (NLTK won't be available to us later on, in our Lambda function)

def simple_tokenizer(input_text):
    REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(=)|(`)")
    REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)|(\n)|(\t)")
    tokens = REPLACE_NO_SPACE.sub("", input_text.lower())
    tokens = REPLACE_WITH_SPACE.sub(" ", tokens) # note that blazing text expects space-separated tokens
    return tokens

In [9]:
# read and process the dataset
def preprocesser(input_file, output_file):
    filepath=Path.joinpath(Path.cwd().parent, 'data', 'amazon_review_polarity_csv', input_file)
    df = pd.read_csv(filepath, names=["Label", "Title", "Review"])
    # class 0 is the negative and class 1 is the positive
    df['Label']=df['Label']-1
    # drop Title, focus on review itself
    df=df.drop('Title', axis=1)
    # Shuffle reviews and corresponding labels within training and test sets
    df = df.sample(frac = 1)
    # apply the simple tokenizer
    df['Review']=df['Review'].apply(simple_tokenizer)
    # Prefix the index-ed label with __label__
    df['Label']=df['Label'].apply(lambda row: "__label__" +  str(row) )
    # convert the transformed dataframe into a list
    transformed_rows = np.array(df).tolist()
    # write to csv file
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n') # notice the delimiter.
        csv_writer.writerows(transformed_rows)

In [10]:
%%time

# Preparing the training dataset
preprocesser('train.csv', 'polarity.train')

CPU times: user 15.1 s, sys: 1.15 s, total: 16.3 s
Wall time: 22.7 s


In [12]:
%%time

# Preparing the validation dataset        
preprocesser('test.csv', 'polarity.validation')

CPU times: user 1.74 s, sys: 164 ms, total: 1.91 s
Wall time: 3.15 s


The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.

In [14]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='polarity.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='polarity.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 48.9 ms, sys: 7.05 ms, total: 56 ms
Wall time: 240 ms


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [16]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training the Sentiment Classifier
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [17]:
region_name = boto3.Session().region_name

In [18]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `c4.4xlarge` instance.


In [19]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [20]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [21]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the Estimator classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins.

In [22]:
bt_model.fit(inputs=data_channels, logs=True)

2020-08-13 15:30:20 Starting - Starting the training job...
2020-08-13 15:30:23 Starting - Launching requested ML instances.........
2020-08-13 15:32:06 Starting - Preparing the instances for training......
2020-08-13 15:33:06 Downloading - Downloading input data
2020-08-13 15:33:06 Training - Downloading the training image.[34mArguments: train[0m
[34m[08/13/2020 15:33:22 INFO 140550977075008] nvidia-smi took: 0.0251748561859 secs to identify 0 gpus[0m
[34m[08/13/2020 15:33:22 INFO 140550977075008] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[08/13/2020 15:33:22 INFO 140550977075008] Processing /opt/ml/input/data/train/polarity.train . File size: 0 MB[0m
[34m[08/13/2020 15:33:22 INFO 140550977075008] Processing /opt/ml/input/data/validation/polarity.validation . File size: 0 MB[0m
[34mRead 0M words[0m
[34mNumber of words:  104[0m
[34mLoading validation data from /opt/ml/input/data/validation/polarity.validation[0m
[34mLoaded validation

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. 
note on pricing: https://aws.amazon.com/sagemaker/pricing/instance-types/

In [23]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!

BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [50]:
sentences = ['best happy joy smile wonderful amazing loved it recommend very glad excited eager feeling good quality top-notch great']
tokenized_sentences = [simple_tokenizer(sentences[0])]
payload = {"instances" : tokenized_sentences}
response = text_classifier.predict(json.dumps(payload)) 
predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "prob": [
      0.5000717043876648
    ],
    "label": [
      "__label__0"
    ]
  }
]


### Prepare for Lambda Function
When we eventually construct our Lambda function we won't have access to the `text_classifier` object, so how do we call a SageMaker endpoint?

Python functions that are used in Lambda have access to another Amazon library called boto3. The `boto3` library provides an API for working with Amazon services, including SageMaker. To start with, we need to get a handle to the SageMaker runtime.

In [51]:
import boto3
runtime = boto3.Session().client('sagemaker-runtime')
print(runtime)

<botocore.client.SageMakerRuntime object at 0x7fc60627aac8>


And now that we have access to the SageMaker runtime, we can ask it to make use of (invoke) an endpoint that has already been created. However, we need to provide SageMaker with the name of the deployed endpoint. To find this out we can print it out using the text_classifier object.

In [52]:
print(text_classifier.endpoint)

blazingtext-2020-08-13-15-30-20-492


In [57]:
# write function to call the endpoint
def make_prediction(sentences):
    sentences=list(sentences)
    tokenized_sentences = [simple_tokenizer(sentences[0])]
    payload = {"instances" : tokenized_sentences}
    response = text_classifier.predict(json.dumps(payload)) 

    response = runtime.invoke_endpoint(EndpointName = text_classifier.endpoint, # The name of the endpoint we created
                                           ContentType = 'application/json', # The data format that is expected
                                           Body = json.dumps(payload))

    output = json.loads(response['Body'].read().decode('utf-8'))
    print(output[0]['prob'][0])
    print(output[0]['label'][0].split('__label__')[1])

In [58]:
# try a positive sentence

make_prediction('best happy joy smile wonderful amazing loved it recommend very glad excited eager feeling good quality top-notch great')

0.5002014636993408
1


In [59]:
# try a negative sentence

make_prediction('awful horrible hated worst never disgusting bad terrible destroy bad')

0.5000330805778503
1
