## Introduction

Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learning. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used extensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation.  

In this notebook, we present *SageMaker BlazingText* which provides efficient implementations of Word2Vec on 
- single CPU instance
- single instance with multiple GPUs - P2 or P3 instances
- multiple CPU instances (Distributed training)


## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python module.

In [36]:
bucket = '<bucket-name>' # Put your s3 bucket name here, and create the s3 bucket if it does not exist already
prefix = 'sagemaker/DEMO-blazingtext-text8'

In [37]:
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [38]:
from time import gmtime, strftime
import time
import numpy as np
import os
import json

### Data ingestion

Next, we download a dataset from the web on which we want to train the word vectors. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence.

In this example, let us train the vectors on [text8](http://mattmahoney.net/dc/textdata.html) dataset (100 MB), which is a small (already preprocessed) version of Wikipedia dump.  

#### You may skip to the [next section](#Set-the-algorithm-container-image) if you have your own training file in the "train" folder on  S3 (s3://bucket/prefix/train)

In [None]:
!wget http://mattmahoney.net/dc/text8.zip -O text8.gz

In [5]:
# Uncompressing
!gzip -d text8.gz -f

Upload to S3 bucket under "train" channel

In [39]:
def upload_to_s3(bucket, prefix, channel, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = prefix + "/" + channel + '/' + file
    s3.Bucket(bucket).put_object(Key=key, Body=data)

upload_to_s3(bucket, prefix, 'train', 'text8')

### Set the algorithm container image

In [40]:
region_name = boto3.Session().region_name

In [None]:
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/blazingtext:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/blazingtext:latest',
              'ap-northeast-1': '501404015308.dkr.ecr.ap-northeast-1.amazonaws.com/blazingtext:latest'}
container = containers[region_name]
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

## Training the BlazingText model for generating word vectors

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow 	| skipgram 	| batch_skipgram 	|
|:----------------------:	|:----:	|:--------:	|:--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|
| Multiple CPU instances 	|      	|          	|        ✔       	|

Now, let's define the resource configuration and hyperparameters to train word vectors on *text8* dataset, using "batch_skipgram" mode on two `c4.2xlarge instances`.


In [42]:
resource_config = {
        "InstanceCount": 2,
        "InstanceType": "ml.c4.2xlarge",
        "VolumeSizeInGB": 2
    }

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [43]:
hyperparameters = {
        "mode": "batch_skipgram",
        "epochs": "5",
        "min_count": "5",
        "sampling_threshold": "0.0001",
        "learning_rate": "0.025",
        "window_size": "5",
        "vector_dim": "100",
        "negative_samples": "5",
        "batch_size": "11", #  = (2*window_size + 1) (Preferred)
        "evaluation": "true" # Perform similarity evaluation on WS-353 dataset at the end of training
    }

**Before starting the training, let us validate resource_config and hyperparameters using the validation script provided with this notebook, to check for any errors.**

In [None]:
from validator import validate_params

validate_params(resource_config, hyperparameters)

In [None]:
job_name = "DEMO-BT-text8-{}-{}-{}-".format(resource_config["InstanceCount"],
                                            resource_config["InstanceType"].replace(".","-"),
                                            hyperparameters["mode"].replace("_","-"))\
                                    + strftime("%Y-%m-%d-%H-%M", gmtime())
print("Training job", job_name)

create_training_params = \
{
    "TrainingJobName": job_name,
    "ResourceConfig": resource_config,
    "HyperParameters": hyperparameters,
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/".format(bucket, prefix)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600 * 24 #Hours
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"  # Always keep FullyReplicated for BlazingText
                }
            },
        },
    ]
}

sagemaker_client = boto3.Session().client(service_name='sagemaker')
sagemaker_client.create_training_job(**create_training_params)
status = sagemaker_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)

Run the cell below to wait for the job to complete...

In [None]:
sagemaker_client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)

status = sagemaker_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)

# if the job failed, determine why
if status == 'Failed':
    message = sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

### Hosting/Inference

Unlike other SageMaker algorithms, model hosting is not supported for BlazingText as the model artifacts (or weights) consist of words to vectors mapping we are interested in. This mapping is likely to be used as input data for a subsequent ML algorithm, or could be operationalized in a number of technologies that support fast key-value lookups, such as DynamoDB.

### Evaluation and Visualization

Let us now download the word vectors learned by our model and visualize them using a [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) plot.

In [47]:
info = sagemaker_client.describe_training_job(TrainingJobName=job_name)
if "ModelArtifacts" not in info:
    raise Exception('Could not find model artifacts. Please wait for the job to finish!')

In [48]:
key = "{}/{}/output/model.tar.gz".format(prefix, info["TrainingJobName"])
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, 'model.tar.gz')

Uncompress `model.tar.gz` to get `vectors.txt`

In [None]:
!tar -xvzf model.tar.gz

If you set "evaluation" as "true" in the hyperparameters, then "eval.json" will be there in the model artifacts. 

The quality of trained model is evaluated on word similarity task. We use [WS-353](http://alfonseca.org/eng/research/wordsim353.html), which is one of the most popular test datasets used for this purpose. It contains word pairs together with human-assigned similarity judgments. 

The word representations are evaluated by ranking the pairs according to their cosine similarities, and measuring the Spearmans rank correlation coefficient with the human judgments.  

Let's look at the evaluation scores which are there in `eval.json`.  For embeddings trained on the text8 dataset, scores above `0.65` are pretty good.

In [None]:
!cat eval.json

Now, let us do a 2D visualization of the word vectors

In [34]:
import numpy as np
from sklearn.preprocessing import normalize

# Read the 400 most frequent word vectors. The vectors in the file are in descending order of frequency.
num_points = 400

first_line = True
index_to_word = []
with open("vectors.txt","r") as f:
    for line_num, line in enumerate(f):
        if first_line:
            dim = int(line.strip().split()[1])
            word_vecs = np.zeros((num_points, dim), dtype=float)
            first_line = False
            continue
        line = line.strip()
        word = line.split()[0]
        vec = word_vecs[line_num-1]
        for index, vec_val in enumerate(line.split()[1:]):
            vec[index] = float(vec_val)
        index_to_word.append(word)
        if line_num >= num_points:
            break
word_vecs = normalize(word_vecs, copy=False, return_norm=False)

In [32]:
from sklearn.manifold import TSNE

tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=10000)
two_d_embeddings = tsne.fit_transform(word_vecs[:num_points])
labels = index_to_word[:num_points]

In [None]:
from matplotlib import pylab
%matplotlib inline

def plot(embeddings, labels):
    pylab.figure(figsize=(20,20))
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
    pylab.show()

plot(two_d_embeddings, labels)

Running the code above might generate a plot like the one below. t-SNE and Word2Vec are stochastic, so although when you run the code the plot won’t look exactly like this, you can still see clusters of similar words such as below where 'british', 'american', 'french', 'english' are near the bottom-left, and 'military', 'army' and 'forces' are all together near the bottom.

![tsne plot of embeddings](./tsne.png)