## Introduction

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [1]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import nltk
sess = sagemaker.Session()

role = get_execution_role()
print(
    role
)  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = 'inox-icm-bt'  # Replace with your own bucket name if needed
print(bucket)
prefix = "blazingtext/new_data/supervised"  # Replace with the prefix under which you want to store the data if needed

arn:aws:iam::430758128697:role/service-role/AmazonSageMaker-ExecutionRole-20220308T183116
inox-icm-bt


### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. 

In [2]:
import pandas as pd
import numpy as np
import glob
import os
import boto3

In [3]:
!mkdir -p icm_data
# Declare bucket name, remote file, and destination
my_bucket = 'inox-icm-bt'
# Connect to S3 bucket and download file
s3 = boto3.resource('s3')
s3.Bucket(my_bucket).download_file('icm_data/Advertisement.txt', 'icm_data/Advertisement.txt')
s3.Bucket(my_bucket).download_file('icm_data/CallCentreServices.txt', 'icm_data/CallCentreServices.txt')
s3.Bucket(my_bucket).download_file('icm_data/Commissions.txt', 'icm_data/Commissions.txt')
s3.Bucket(my_bucket).download_file('icm_data/GeneralContractualServices.txt', 'icm_data/GeneralContractualServices.txt')
s3.Bucket(my_bucket).download_file('icm_data/Goods.txt', 'icm_data/Goods.txt')
s3.Bucket(my_bucket).download_file('icm_data/NonTaxable.txt', 'icm_data/NonTaxable.txt')
s3.Bucket(my_bucket).download_file('icm_data/Professional.txt', 'icm_data/Professional.txt')
s3.Bucket(my_bucket).download_file('icm_data/Rent_Ia.txt', 'icm_data/Rent_Ia.txt')
s3.Bucket(my_bucket).download_file('icm_data/Rent_Ib.txt', 'icm_data/Rent_Ib.txt')
s3.Bucket(my_bucket).download_file('icm_data/Royalty.txt', 'icm_data/Royalty.txt')
s3.Bucket(my_bucket).download_file('icm_data/RoyaltyOther.txt', 'icm_data/RoyaltyOther.txt')
s3.Bucket(my_bucket).download_file('icm_data/Technical.txt', 'icm_data/Technical.txt')
s3.Bucket(my_bucket).download_file('icm_data/Transportation.txt', 'icm_data/Transportation.txt')


In [4]:
file_list = glob.glob(os.path.join(os.getcwd(), "icm_data", "*.txt"))
file_list

['/root/other/prep-model-to-use/icm_data/GeneralContractualServices.txt',
 '/root/other/prep-model-to-use/icm_data/Rent_Ia.txt',
 '/root/other/prep-model-to-use/icm_data/RoyaltyOther.txt',
 '/root/other/prep-model-to-use/icm_data/Royalty.txt',
 '/root/other/prep-model-to-use/icm_data/Commissions.txt',
 '/root/other/prep-model-to-use/icm_data/NonTaxable.txt',
 '/root/other/prep-model-to-use/icm_data/Goods.txt',
 '/root/other/prep-model-to-use/icm_data/Transportation.txt',
 '/root/other/prep-model-to-use/icm_data/Rent_Ib.txt',
 '/root/other/prep-model-to-use/icm_data/Technical.txt',
 '/root/other/prep-model-to-use/icm_data/Advertisement.txt',
 '/root/other/prep-model-to-use/icm_data/Professional.txt',
 '/root/other/prep-model-to-use/icm_data/CallCentreServices.txt']

# Routing the data from text file to dataframe

In [5]:
df = pd.DataFrame()
for tag in file_list:
    data = pd.read_csv(tag,sep='\n',header=None,names=['Text','label'])
    data['label']=tag.split('/')[-1][:-4]
    df = pd.concat([df,data],axis=0)

In [6]:
pd.set_option("display.max_colwidth", -1)

  """Entry point for launching an IPython kernel.


In [7]:
df.head(4)

Unnamed: 0,Text,label
0,\t3 Nos. VRC AMC contract for SDEE.,GeneralContractualServices
1,\tAMC for 10 Nos of VRC's 1000KG,GeneralContractualServices
2,\tAMC for 3 No's VRC installed at SJAB-FC from the 1st September 2018 to 31st August 2019,GeneralContractualServices
3,\tAMC for 3 No's VRC installed at SJAB-FC from the 1st September 2019 to 31st August 2020,GeneralContractualServices


In [8]:
# we need to reshuffle the dataframe
df = df.sample(frac = 1)

In [9]:
df.head(3)

Unnamed: 0,Text,label
10946,60840 ;Batch<2955>/Line<2597>-44013002-Health Club-Gym Bill-Company,NonTaxable
2497,"20553 Refund for order: 403-1869284-1935521, sellerId 11647761925, amount 1050.00 INR",NonTaxable
4,16999 Branding for HBDA station,Advertisement


In [10]:
df.label.value_counts()

NonTaxable                    38564
Goods                         7777 
Rent_Ia                       5185 
GeneralContractualServices    4678 
Rent_Ib                       2739 
Professional                  1774 
Transportation                1339 
Advertisement                 841  
RoyaltyOther                  262  
Technical                     57   
CallCentreServices            36   
Commissions                   12   
Royalty                       5    
Name: label, dtype: int64

In [11]:
# Checking for Duplicate rows
df.shape

(63269, 2)

In [12]:
### Dropping Duplicate rows
df = df.drop_duplicates(subset = ['Text', 'label']).reset_index(drop = True)
df.shape

(62627, 2)

In [13]:
df.label.value_counts()

NonTaxable                    38443
Goods                         7584 
Rent_Ia                       5173 
GeneralContractualServices    4463 
Rent_Ib                       2723 
Professional                  1744 
Transportation                1311 
Advertisement                 816  
RoyaltyOther                  262  
Technical                     57   
CallCentreServices            34   
Commissions                   12   
Royalty                       5    
Name: label, dtype: int64

# Preprocessing the text

#### Removing Commissions from training and testing dataset due to high imbalance

In [14]:
df.head(3)

Unnamed: 0,Text,label
0,60840 ;Batch<2955>/Line<2597>-44013002-Health Club-Gym Bill-Company,NonTaxable
1,"20553 Refund for order: 403-1869284-1935521, sellerId 11647761925, amount 1050.00 INR",NonTaxable
2,16999 Branding for HBDA station,Advertisement


#### Feature Engineering FE

In [15]:
all_categories_df = df

In [16]:
all_categories_df["label"].value_counts()

NonTaxable                    38443
Goods                         7584 
Rent_Ia                       5173 
GeneralContractualServices    4463 
Rent_Ib                       2723 
Professional                  1744 
Transportation                1311 
Advertisement                 816  
RoyaltyOther                  262  
Technical                     57   
CallCentreServices            34   
Commissions                   12   
Royalty                       5    
Name: label, dtype: int64

Let's calculate number of words for each row.

In [17]:
all_categories_df.head(2)

Unnamed: 0,Text,label
0,60840 ;Batch<2955>/Line<2597>-44013002-Health Club-Gym Bill-Company,NonTaxable
1,"20553 Refund for order: 403-1869284-1935521, sellerId 11647761925, amount 1050.00 INR",NonTaxable


In [18]:
all_categories_df["word_count"] = all_categories_df["Text"].apply(lambda x: len(str(x).split()))
all_categories_df.head()

Unnamed: 0,Text,label,word_count
0,60840 ;Batch<2955>/Line<2597>-44013002-Health Club-Gym Bill-Company,NonTaxable,4
1,"20553 Refund for order: 403-1869284-1935521, sellerId 11647761925, amount 1050.00 INR",NonTaxable,10
2,16999 Branding for HBDA station,Advertisement,5
3,66030 ;Batch<4300>/Line<4096>-66398982-Individual Meals-OFFICE WORK-Company,NonTaxable,4
4,14873\tTOWARDS SNACKS & BERVERAGES VENDING MANCHINE CHAREGS TDS 194I 2%,Rent_Ia,11


Let's get basic statistics about the dataset.

In [19]:
stat=dict(all_categories_df["word_count"].describe())
stat

{'count': 62627.0,
 'mean': 7.635732192185479,
 'std': 4.8250174359064095,
 'min': 1.0,
 '25%': 4.0,
 '50%': 7.0,
 '75%': 10.0,
 'max': 50.0}

In [20]:
Rule_IQR_Range = stat['75%'] + 1.5 * (stat['75%']-stat['25%'])
print(Rule_IQR_Range)

19.0


In [21]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# plt.figure(figsize = (15,6))
# all_categories_df["word_count"].plot()

We can see that the mean value is around 8 words. However, there are outliers, such as a text with 50 words. This can make it harder for the model to result in good performance. We will take care to drop those rows.

Let's drop empty rows first.

In [22]:
no_text = all_categories_df[all_categories_df["word_count"] == 1]
print(len(no_text))

# drop these rows
all_categories_df.drop(no_text.index, inplace=True)

25


Let's drop the rows that are longer than 20 words, as it is a length close to the mean value of the word count. This is done to make it easy for the model to train without outliers. 

In [23]:
long_text = all_categories_df[(all_categories_df["word_count"] > 20)]
print(len(long_text))

# drop these rows
all_categories_df.drop(long_text.index, inplace=True)

1421


In [24]:
all_categories_df["label"].value_counts()

NonTaxable                    38259
Goods                         7269 
Rent_Ia                       4913 
GeneralContractualServices    4216 
Rent_Ib                       2442 
Professional                  1675 
Transportation                1279 
Advertisement                 772  
RoyaltyOther                  259  
Technical                     46   
CallCentreServices            34   
Commissions                   12   
Royalty                       5    
Name: label, dtype: int64

Let's get basic statistics about the dataset after our outliers fixes.

In [25]:
all_categories_df["word_count"].describe()

count    61181.000000
mean     7.213269    
std      3.895428    
min      2.000000    
25%      4.000000    
50%      7.000000    
75%      10.000000   
max      20.000000   
Name: word_count, dtype: float64

In [26]:
# plt.figure(figsize = (15,8))
all_categories_df["word_count"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fc0be416b90>

This looks much more balanced.

Now we drop the `word_count` columns as we will not need it anymore.

In [27]:
all_categories_df.drop(columns="word_count", axis=1, inplace=True)

In [28]:
all_categories_df.head(3)

Unnamed: 0,Text,label
0,60840 ;Batch<2955>/Line<2597>-44013002-Health Club-Gym Bill-Company,NonTaxable
1,"20553 Refund for order: 403-1869284-1935521, sellerId 11647761925, amount 1050.00 INR",NonTaxable
2,16999 Branding for HBDA station,Advertisement


In [29]:
# split imbalanced dataset into train and test sets with stratification
from sklearn.model_selection import train_test_split
# split into train test sets
train, test = train_test_split(df, test_size=0.02, random_state=6)
# train = all_categories_df.copy()
# test = all_categories_df.sample(frac = .06)

In [30]:
test.label.value_counts(),test.shape

(NonTaxable                    782
 Goods                         147
 Rent_Ia                       94 
 GeneralContractualServices    70 
 Rent_Ib                       52 
 Professional                  33 
 Transportation                25 
 Advertisement                 13 
 RoyaltyOther                  7  
 Royalty                       1  
 Name: label, dtype: int64,
 (1224, 2))

In [31]:
train.label.value_counts(),train.shape

(NonTaxable                    37477
 Goods                         7122 
 Rent_Ia                       4819 
 GeneralContractualServices    4146 
 Rent_Ib                       2390 
 Professional                  1642 
 Transportation                1254 
 Advertisement                 759  
 RoyaltyOther                  252  
 Technical                     46   
 CallCentreServices            34   
 Commissions                   12   
 Royalty                       4    
 Name: label, dtype: int64,
 (59957, 2))

In [32]:
train.head(2)

Unnamed: 0,Text,label
33009,60840 ;Batch<3036>/Line<4117>-45276292-Health Club--Company,NonTaxable
52973,66030 ;Batch<3903>/Line<18401>-59742725-Individual Meals--Company,NonTaxable


In [33]:
# train.drop(train[train.label=="Royalty"].index,inplace = True)

In [34]:
train.shape

(59957, 2)

In [35]:
# we need to reshuffle the dataframe
train = train.sample(frac = 1)

In [36]:
train.head(4)

Unnamed: 0,Text,label
19773,61110 Towards Aug ART Collateral print and dispatch Jaipur,Goods
25671,60840 ;Batch<3024>/Line<5271>-45125904-Health Club-Employee FItness-Company,NonTaxable
3046,16999\tTOWARDS PROJECT MANAGEMNT CABLING SERVICES TDS 194J-10%,Professional
41351,61720 Mar-19: Initiative Media - APV- Made in Heaven RADIO campaign (#OgA4MHY) - Budgeted,Advertisement


In [37]:
test.head(3)

Unnamed: 0,Text,label
16821,14873\tTOWARDS WOODEN PALLET HIRE CHARGES TDS 194I-2%,Rent_Ia
27172,"20553 Refund for order: 406-1895653-0114735, sellerId 19696890935, amount 299.00 INR",NonTaxable
59914,66016 ;Batch<3685>/Line<27502>-55706402- Taxi-went to BRDD-Company,NonTaxable


In [38]:
!mkdir -p data
train.to_csv("./data/train.csv",index=False)
test.to_csv("./data/test.csv",index=False)

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [39]:
%%time
import preprocessing
# Preparing the training dataset
preprocessing.preprocess("data/train.csv", "icm.train")

# Preparing the validation dataset
preprocessing.preprocess("data/test.csv", "icm.validation")

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


CPU times: user 20.7 s, sys: 258 ms, total: 21 s
Wall time: 27.8 s


The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

In [40]:
%%time

train_channel = prefix + "/train"
validation_channel = prefix + "/validation"

sess.upload_data(path="icm.train", bucket=bucket, key_prefix=train_channel)
sess.upload_data(path="icm.validation", bucket=bucket, key_prefix=validation_channel)

s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, validation_channel)

CPU times: user 86.3 ms, sys: 4.18 ms, total: 90.5 ms
Wall time: 317 ms


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [41]:
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)

## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [42]:
region_name = boto3.Session().region_name

In [43]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print("Using SageMaker BlazingText container: {} ({})".format(container, region_name))

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Using SageMaker BlazingText container: 433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:1 (us-west-2)


## Training the BlazingText model for supervised text classification

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|      	|          	|        ✔       	|     | |

Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `c4.4xlarge` instance.

Refer to [BlazingText Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) in the Amazon SageMaker documentation for the complete list of hyperparameters.

In [44]:
bt_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
#     max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    hyperparameters={
        "mode": "supervised",
        "epochs": 7,
#         "min_count": 2,
        "learning_rate": 0.12,
#         "vector_dim": 10,
#         "early_stopping": True,
#         "patience": 4,
        "min_epochs": 5,
        "word_ngrams": 2,
    },
)

In [45]:
# Auto tune
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "epochs": IntegerParameter(10,45),
#     "batch_size": IntegerParameter(8,32),
    "learning_rate": ContinuousParameter(0.020, 0.300),
#     "window_size": IntegerParameter(1, 10),
    "word_ngrams": IntegerParameter(1,2),
#     "buckets": IntegerParameter(1000000,10000000),
#     "min_count": IntegerParameter(1,100)
}

objective_metric_name = "validation:accuracy"
objective_type = "Maximize"

tuner = HyperparameterTuner(
    bt_model,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=6,
    max_parallel_jobs=2,
    objective_type=objective_type,
)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [46]:
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

### Launch hyperparameter tuning job

Now we can launch a hyperparameter tuning job by calling fit() function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

This should take around 12 minutes to complete.


In [47]:
%%time

tuner.fit(inputs=data_channels, logs=True)

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


.....................................................................................................................................!
CPU times: user 535 ms, sys: 70.3 ms, total: 605 ms
Wall time: 11min 13s



### Analyze Results of a Hyperparameter Tuning job

Once you have completed a tuning job, (or even while the job is still running) you can use the code below to analyze the results to understand how each hyperparameter effects the quality of the model.


In [48]:
sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

'blazingtext-220601-1820'

### Track hyperparameter tuning job progress

After you launch a tuning job, you can see its progress by calling describe_tuning_job API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job. You can call list_training_jobs_for_tuning_job to see a detailed list of the training jobs that the tuning job launched.

In [49]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_minimize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Maximize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

6 training jobs have completed


In [50]:
from pprint import pprint

if tuning_job_result.get("BestTrainingJob", None):
    print("Best model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

Best model found so far:
{'CreationTime': datetime.datetime(2022, 6, 1, 18, 27, 42, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:accuracy',
                                                 'Value': 0.964900016784668},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2022, 6, 1, 18, 31, 42, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:430758128697:training-job/blazingtext-220601-1820-005-036a40d9',
 'TrainingJobName': 'blazingtext-220601-1820-005-036a40d9',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2022, 6, 1, 18, 29, 25, tzinfo=tzlocal()),
 'TunedHyperParameters': {'epochs': '13',
                          'learning_rate': '0.2878765151001413',
                          'word_ngrams': '2'}}


#### We can list hyperparameters and objective metrics of all training jobs and pick up the training job with the best objective metric.

In [51]:
import pandas as pd

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

Number of training jobs with valid objective: 6
{'lowest': 0.9534000158309937, 'highest': 0.964900016784668}


  del sys.path[0]


Unnamed: 0,epochs,learning_rate,word_ngrams,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
1,13.0,0.287877,2.0,blazingtext-220601-1820-005-036a40d9,Completed,0.9649,2022-06-01 18:29:25+00:00,2022-06-01 18:31:42+00:00,137.0
3,33.0,0.073748,2.0,blazingtext-220601-1820-003-67fcde4f,Completed,0.9649,2022-06-01 18:25:10+00:00,2022-06-01 18:27:22+00:00,132.0
5,31.0,0.081417,2.0,blazingtext-220601-1820-001-c003843c,Completed,0.9649,2022-06-01 18:22:31+00:00,2022-06-01 18:24:49+00:00,138.0
0,18.0,0.052974,1.0,blazingtext-220601-1820-006-d4832edb,Completed,0.9583,2022-06-01 18:29:24+00:00,2022-06-01 18:30:06+00:00,42.0
4,37.0,0.164153,1.0,blazingtext-220601-1820-002-be54cbf3,Completed,0.9551,2022-06-01 18:22:29+00:00,2022-06-01 18:23:16+00:00,47.0
2,41.0,0.072506,1.0,blazingtext-220601-1820-004-90148ee0,Completed,0.9534,2022-06-01 18:26:39+00:00,2022-06-01 18:27:26+00:00,47.0


In [52]:
bt_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
#     max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    hyperparameters={
        "mode": "supervised",
        "epochs": 16,
#         "min_count": 2,
        "learning_rate": 0.06049598865396785,
#         "vector_dim": 10,
#         "early_stopping": True,
#         "patience": 4,
        "min_epochs": 5,
        "word_ngrams": 2,
    },
)

### Deploy the best trained model

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train, because usually for inference, less compute power is needed than for training, and in addition, instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

    ml.c4.4xlarge - Compute Optimized instances are ideal for compute bound applications that benefit from high performance processors.
    ml.m4.xlarge - General purpose instances provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads.



In [53]:
from sagemaker.serializers import JSONSerializer

text_classifier = tuner.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=JSONSerializer()
)


2022-06-01 18:31:42 Starting - Preparing the instances for training
2022-06-01 18:31:42 Downloading - Downloading input data
2022-06-01 18:31:42 Training - Training image download completed. Training in progress.
2022-06-01 18:31:42 Uploading - Uploading generated training model
2022-06-01 18:31:42 Completed - Training job completed
------!

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [54]:
sentences = [
"67410 [PC_INVOICE_LINE_ID: 50849626]Prione PO April - December 2019",
"61410 Jan ART Specials Influencer campaign",
"61110 Towards leaflets & Posters print for Van & Stores",
"61410 Customer Assistance Campaign Measurement cost",
"61410 PR for App Next - Seller App Marketing",
"61730 Nov-19: Initiative - APV- Inside Edge2 TV Campaign (#OGA3IE2Y) - Budgeted",
"14490 Labour charges for Maintenance & Testing of Oil Type,500 KVA ,22KV/433V ,Transformer are as following; a)Magnetic Balance Test. b)Ratio Test c)insulation Resistance Test.",
"67410 [PC_INVOICE_LINE_ID: 50849626]Prione PO April - December 2019",
"16999	Cold Aisle Containment modification to accommodate 1 rack per ro; Cold Aisle Containment modification to accommodate 1 rack per row from Row103 to Row118."
]

sentences = preprocessing.preprocess_line(sentences)
payload = {"instances": sentences, "configuration": {"k": 1}}

response = text_classifier.predict(payload)

predictions = json.loads(response)
# print(json.dumps(predictions, indent=2))
listed = []
for i in predictions:
    listed.append({"Intent":i["label"][0][9:],"Intent_confidence":i["prob"][0]*100})

listed



[{'Intent': 'Technical', 'Intent_confidence': 87.72117495536804},
 {'Intent': 'Advertisement', 'Intent_confidence': 100.00004768371582},
 {'Intent': 'Advertisement', 'Intent_confidence': 96.89972996711731},
 {'Intent': 'Advertisement', 'Intent_confidence': 99.98526573181152},
 {'Intent': 'Advertisement', 'Intent_confidence': 99.99953508377075},
 {'Intent': 'Advertisement', 'Intent_confidence': 99.99761581420898},
 {'Intent': 'GeneralContractualServices',
  'Intent_confidence': 38.53313624858856},
 {'Intent': 'Technical', 'Intent_confidence': 87.72117495536804},
 {'Intent': 'Goods', 'Intent_confidence': 92.15027093887329}]

In [55]:
%%time
labels = test.label.to_list()
sentences = preprocessing.preprocess_line(test.Text)
payload = {"instances": sentences, "configuration": {"k": 1}}

response = text_classifier.predict(payload)

predictions_ = json.loads(response)
pred = []
for i in predictions_:
    pred+= [i['label'][0][9:]]    

# print(json.dumps(predictions_, indent=2))

CPU times: user 519 ms, sys: 7.75 ms, total: 527 ms
Wall time: 597 ms


In [56]:
sentences[:10]

['14873 towards wooden pallet hire charges tds 194i 2',
 '20553 refund order 406 1895653 0114735 sellerid 19696890935 amount 299 00 inr',
 '66016 batch 3685 line 27502 55706402 taxi went brdd company',
 '66010 vt192300022000020000036 batch 3495 line 602 52543941 hotel business trip curator kpp project company',
 '52105 add fund',
 '64530 batch 4330 line 2940 66588816 home internet company',
 '66010 vt200580019005190000126 batch 4341 line 2269 66971298 hotel business company',
 '64757 towards wooden aplelt charges tds 194i 2',
 '66010 vt192800022000280000101 batch 3742 line 10354 56943180 hotel tax ab mobile transition company',
 '28603 lease rent atspl del 17 ambience corporate tower 2 7th floor sft 3095 1st jan 19 31st dec 19']

In [57]:
pred.__len__(),test.__len__()

(1224, 1224)

In [58]:
pd.Series(labels).value_counts(),pd.Series(pred).value_counts()

(NonTaxable                    782
 Goods                         147
 Rent_Ia                       94 
 GeneralContractualServices    70 
 Rent_Ib                       52 
 Professional                  33 
 Transportation                25 
 Advertisement                 13 
 RoyaltyOther                  7  
 Royalty                       1  
 dtype: int64,
 NonTaxable                    779
 Goods                         153
 Rent_Ia                       98 
 GeneralContractualServices    70 
 Rent_Ib                       49 
 Professional                  30 
 Transportation                27 
 Advertisement                 10 
 RoyaltyOther                  7  
 Technical                     1  
 dtype: int64)

In [59]:
len(labels),len(pred)

(1224, 1224)

In [60]:
df_confusion = pd.crosstab(pd.Series(labels),pd.Series(pred), rownames=['Actual'], colnames=['Predicted'], margins=True)

In [61]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("Blazing Text Confusion Matrix")
sns.heatmap(df_confusion,annot=True,fmt='',cmap='Blues');

In [62]:
from sklearn.metrics import classification_report
print("\t\tBlazing Text Report\n",classification_report(pd.Series(labels),pd.Series(pred)))

		Blazing Text Report
                             precision    recall  f1-score   support

             Advertisement       1.00      0.77      0.87        13
GeneralContractualServices       0.87      0.87      0.87        70
                     Goods       0.92      0.95      0.93       147
                NonTaxable       0.99      0.99      0.99       782
              Professional       0.90      0.82      0.86        33
                   Rent_Ia       0.94      0.98      0.96        94
                   Rent_Ib       1.00      0.94      0.97        52
                   Royalty       0.00      0.00      0.00         1
              RoyaltyOther       0.86      0.86      0.86         7
                 Technical       0.00      0.00      0.00         0
            Transportation       0.85      0.92      0.88        25

                  accuracy                           0.96      1224
                 macro avg       0.76      0.74      0.74      1224
              weighted 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Stop / Close the Endpoint (Optional)
Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

In [64]:
sess.delete_endpoint(text_classifier.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "arn:aws:sagemaker:us-west-2:430758128697:endpoint/blazingtext-220601-1820-005-036a40d9".