# Classify pos/neg reviews by topic

In [1]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import re
import numpy as np
import pandas as pd
from pathlib import Path

In [6]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket() # Replace with your own bucket name if needed

region_name = boto3.Session().region_name

arn:aws:iam::443501626368:role/service-role/AmazonSageMaker-ExecutionRole-20200806T142735


In [23]:
print(bucket)

sagemaker-us-east-1-443501626368


# Classify Negative  Topics

In [9]:
prefix = 'negative_topics' #Replace with the prefix under which you want to store the data if needed

In [37]:
df = pd.read_csv('topics_negative.train', header=None)
df.head()

Unnamed: 0,0
0,i only read the first 2 chapters of this book ...
1,i have to start by saying i love this movie an...
2,i started this book a gift threw it across the...
3,poor quality lasted for less than a month even...
4,im sorry to say that this book almost put me t...


In [13]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='topics_negative.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='topics_negative.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 122 ms, sys: 21.6 ms, total: 144 ms
Wall time: 437 ms


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [14]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [15]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


Now, let's define the SageMaker Estimator with resource configurations and hyperparameters to train Text Classification on DBPedia dataset, using "supervised" mode on a c4.4xlarge instance.

In [17]:
neg_topics_classifier = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [18]:
neg_topics_classifier.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the sagemaker.session.s3_input objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [19]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [20]:
neg_topics_classifier.fit(inputs=data_channels, logs=True)

2020-08-13 18:57:15 Starting - Starting the training job...
2020-08-13 18:57:18 Starting - Launching requested ML instances......
2020-08-13 18:58:33 Starting - Preparing the instances for training......
2020-08-13 18:59:38 Downloading - Downloading input data
2020-08-13 18:59:38 Training - Downloading the training image..[34mArguments: train[0m
[34m[08/13/2020 18:59:54 INFO 140585622771520] nvidia-smi took: 0.0252358913422 secs to identify 0 gpus[0m
[34m[08/13/2020 18:59:54 INFO 140585622771520] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[08/13/2020 18:59:54 INFO 140585622771520] Processing /opt/ml/input/data/train/topics_negative.train . File size: 8 MB[0m
[34m[08/13/2020 18:59:54 INFO 140585622771520] Processing /opt/ml/input/data/validation/topics_negative.validation . File size: 2 MB[0m
[34mRead 1M words[0m
[34mNumber of words:  27328[0m
[34mLoading validation data from /opt/ml/input/data/validation/topics_negative.validation[0m


Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. 

In [21]:
neg_topics_classifier_deployed = neg_topics_classifier.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-------------!

In [24]:
neg_topics_classifier_deployed.endpoint

'blazingtext-2020-08-13-18-57-15-339'

# Classify Positive Topics

In [34]:
prefix = 'positive_topics' #Replace with the prefix under which you want to store the data if needed

In [38]:
df1 = pd.read_csv('topics_positive.train', header=None)
df1.head()

Unnamed: 0,0
0,i only read the first 2 chapters of this book ...
1,i have to start by saying i love this movie an...
2,i started this book a gift threw it across the...
3,poor quality lasted for less than a month even...
4,im sorry to say that this book almost put me t...


In [40]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='topics_positive.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='topics_positive.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 138 ms, sys: 9.23 ms, total: 147 ms
Wall time: 595 ms


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [41]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

In [42]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


Now, let's define the SageMaker Estimator with resource configurations and hyperparameters to train Text Classification on DBPedia dataset, using "supervised" mode on a c4.4xlarge instance.

In [43]:
pos_topics_classifier = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [44]:
pos_topics_classifier.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the sagemaker.session.s3_input objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [45]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [46]:
# train the positive topics classifier
pos_topics_classifier.fit(inputs=data_channels, logs=True)

2020-08-13 20:36:38 Starting - Starting the training job...
2020-08-13 20:36:41 Starting - Launching requested ML instances......
2020-08-13 20:38:00 Starting - Preparing the instances for training.........
2020-08-13 20:39:28 Downloading - Downloading input data...
2020-08-13 20:40:08 Training - Training image download completed. Training in progress.
2020-08-13 20:40:08 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[08/13/2020 20:39:58 INFO 140543817901888] nvidia-smi took: 0.0252060890198 secs to identify 0 gpus[0m
[34m[08/13/2020 20:39:58 INFO 140543817901888] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[08/13/2020 20:39:58 INFO 140543817901888] 2 files found in train channel. Using /opt/ml/input/data/train/topics_positive.train for training...[0m
[34m[08/13/2020 20:39:58 INFO 140543817901888] Processing /opt/ml/input/data/train/topics_positive.train . File size: 8 MB[0m
[34m[08/13/2020 20:39:58 INFO 140543817

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. 

In [47]:
pos_topics_classifier_deployed = pos_topics_classifier.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!

In [48]:
pos_topics_classifier_deployed.endpoint

'blazingtext-2020-08-13-20-36-38-040'