## Table 
1. [Introduction](sec_1)

2. [Data Preparation](sec_2) 

3. [Training](sec_3) 

4. [Hosting / Inference](sec_4)


<a id='sec_1'></a>
## Introduction

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

<a id='sec_2'></a>
## Data Preparation  

Deserialize the pickle file generated in the previous step to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".



### Prepare Traning & Testing Data 

In [1]:
!pip install jieba 
!pip install ckiptagger 
!pip install tensorflow==1.13.1

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import pickle 

all_contents = pickle.load(open('/home/ec2-user/SageMaker/chinese-corpus/all_contents.p', 'rb')) 


all_contents[0:10]    


['__label__news_culture 京城 最 值得 你 来场 文化 之旅 的 博物馆',
 '__label__news_culture 发酵 床 的 垫料 种类 有 哪些 ？ 哪 种 更好 ？',
 '__label__news_culture 上联 ： 黄山 黄河 黄皮肤 黄土高原 。 怎么 对 下联 ？',
 '__label__news_culture 林徽因 什么 理由 拒绝 了 徐志摩 而 选择 梁思成 为 终身伴侣 ？',
 '__label__news_culture 黄杨木 是 什么 树 ？',
 '__label__news_culture 上联 ： 草根 登上 星光 道 ， 怎么 对 下联 ？',
 '__label__news_culture 什么 是 超 写实 绘画 ？',
 '__label__news_culture 松涛 听雨莺 婉转 ， 下联 ？',
 '__label__news_culture 上联 ： 老子 骑牛 读书 ， 下联 怎么 对 ？',
 '__label__news_culture 上联 ： 山水 醉人 何须 酒 。 如何 对 下联 ？']

In [4]:
import random 
train_number  = int(len(all_contents) * 0.8) 
random.shuffle(all_contents)
train = all_contents[:train_number]
test = all_contents[train_number:]
train_file_name = 'toutaio_blazingtext.train.txt'
test_file_name = 'toutaio_blazingtext.test.txt'
f_train = open(train_file_name, 'w')
for t in train: 
    f_train.write(t+'\n')
f_train.close()
f_test = open(test_file_name, 'w')
for t in test: 
    f_test.write(t+'\n')
f_test.close()     


### Setup SageMaker

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [8]:
!aws s3 mb s3://nlp-demo-yianc/ --region us-east-1

make_bucket: nlp-demo-yianc


In [9]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf
bucket = 'nlp-demo-yianc' # Replace with your own bucket name if needed
print(bucket)
prefix = 'blazingtext/supervised' #Replace with the prefix under which you want to store the data if needed

arn:aws:iam::230755935769:role/SageMakerExecutionRole
nlp-demo-yianc


In [10]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path=train_file_name, bucket=bucket, key_prefix=train_channel)
sess.upload_data(path=test_file_name, bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)
s3_train_data

CPU times: user 802 ms, sys: 189 ms, total: 991 ms
Wall time: 1.18 s


's3://nlp-demo-yianc/blazingtext/supervised/train'

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [11]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
print(s3_output_location)

s3://nlp-demo-yianc/blazingtext/supervised/output


<a id='sec_3'></a>
## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [12]:
region_name = boto3.Session().region_name

In [13]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:1 (us-east-1)


### Training the BlazingText model for supervised text classification

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|      	|          	|        ✔       	|     | |

Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using "supervised" mode on a `c4.4xlarge` instance.


In [14]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [15]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [16]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [17]:
bt_model.fit(inputs=data_channels, logs=True)

2020-12-01 03:41:52 Starting - Starting the training job...
2020-12-01 03:41:55 Starting - Launching requested ML instances.........
2020-12-01 03:43:25 Starting - Preparing the instances for training...
2020-12-01 03:44:19 Downloading - Downloading input data...
2020-12-01 03:44:46 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[12/01/2020 03:44:47 INFO 139871151068544] nvidia-smi took: 0.025226593017578125 secs to identify 0 gpus[0m
[34m[12/01/2020 03:44:47 INFO 139871151068544] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[12/01/2020 03:44:47 INFO 139871151068544] Processing /opt/ml/input/data/train/toutaio_blazingtext.train.txt . File size: 29.15908718109131 MB[0m
[34m[12/01/2020 03:44:47 INFO 139871151068544] Processing /opt/ml/input/data/validation/toutaio_blazingtext.test.txt . File size: 7.292818069458008 MB[0m
[34mRead 4M words[0m


<a id='sec_4'></a>
## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [21]:
from  sagemaker.serializers import JSONSerializer
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge', serializer=JSONSerializer())

---------------!

#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [23]:
import sys
sys.path.append('/home/ec2-user/SageMaker/nlp_processing/')
from segmenter import JiebaSegmenter

sentence = "MLB／林子偉守一壘價值高 羅里奇解釋調度 紅襪隊台灣好手林子偉日前參加夏季訓練，而他也被分配到新守備位置，那就是一壘手，總教練Ron Roenicke（羅里奇）解釋，這是為今年有突破僵局的特別做法。"
segmenter = JiebaSegmenter()
toks = segmenter.segment(sentence)
tokenized_sentences = [' '.join(toks)]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(payload)

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "label": [
      "__label__news_entertainment"
    ],
    "prob": [
      0.17639851570129395
    ]
  }
]


By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:

In [25]:
payload = {"instances" : tokenized_sentences,
          "configuration": {"k": 2}}

response = text_classifier.predict(payload)

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "label": [
      "__label__news_entertainment",
      "__label__news_sports"
    ],
    "prob": [
      0.17639851570129395,
      0.17404887080192566
    ]
  }
]


### Stop / Close the Endpoint (Optional)
Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

In [None]:
sess.delete_endpoint(text_classifier.endpoint)