# Word Embedding for Georgian Language

Word2Vec is a popular algorithm used for generating dense vector representations of words (called embedding) in large corpora using unsupervised learning. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used extensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation.

SageMaker BlazingText which provides efficient implementations of Word2Vec on

- single CPU instance
- single instance with multiple GPUs - P2 or P3 instances
- multiple CPU instances (Distributed training)

In this notebook, we demonstrate how BlazingText can be used for distributed training of word2vec using multiple CPU instances.

## Step 1: Getting a large corpus of text in Georgian

### Data Ingestion

We will use the latest Georgian Wikipedia dump, which at the time of writing can be found here: https://dumps.wikimedia.org/kawiki/latest/. For the algorithm to work we will need large amounts of plain text. For that purpose, we can just download the main `...pages-articles...` file. Files and formats in the dump are explained here: https://meta.wikimedia.org/wiki/Data_dumps/Dump_format

In [None]:
!wget https://dumps.wikimedia.org/kawiki/latest/kawiki-latest-pages-articles-multistream.xml.bz2

This archive contains rich markup and lots of info besides plain text. To just get the text we will use the [`WikiExtractor.py`](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) script created by Giuseppe Attardi and co-contributors.

In [None]:
!wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
!chmod u+x WikiExtractor.py

Next, we will run WikiExtractor on the dump as shown in an example here: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

In [None]:
!./WikiExtractor.py -o extracted-ka-temp kawiki-latest-pages-articles-multistream.xml.bz2 2>&1 | tee we.out | awk 'NR<=100'

The plain text files will be dumped into a directory structure under the specified directory `extracted-ka`. All that remains is to concatenate all files together while stripping out remaining XML `<doc/>` tags and empty lines. **BlazingText expects a single preprocessed text file with space separated tokens.**

In [None]:
!find extracted-ka -type f -exec cat {} \; | grep -v '^<doc' | grep -v '^</doc' | grep -v '^$' > ka-corpus.txt

In [None]:
!head ka-corpus.txt

## Step 2: Training the BlazingText model with SageMaker

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [None]:
import sagemaker
from sagemaker import get_execution_role
import boto3
import json

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket() # Replace with your own bucket name if needed
print(bucket)
prefix = 'sagemaker/DEMO-blazingtext-georgian' #Replace with the prefix under which you want to store the data if needed

We need to upload the data to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload the text file to the bucket and prefix location that we have set earlier.

In [None]:
train_channel = prefix + '/train'

sess.upload_data(path='ka-corpus.txt', bucket=bucket, key_prefix=train_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's training job.

In [None]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training the BlazingText model for generating word vectors

Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354). BlazingText also supports learning of subword embeddings with CBOW and skip-gram modes. This enables BlazingText to generate vectors for out-of-vocabulary (OOV) words, as demonstrated in this [notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_word2vec_subwords_text8/blazingtext_word2vec_subwords_text8.ipynb).




Besides skip-gram and CBOW, SageMaker BlazingText also supports the "Batch Skipgram" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more.

BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) or [the text classification notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.ipynb).

To summarize, the following modes are supported by BlazingText on different types instances:

|          Modes         	| cbow (supports subwords training) 	| skipgram (supports subwords training) 	| batch_skipgram 	| supervised |
|:----------------------:	|:----:	|:--------:	|:--------------:	| :--------------:	|
|   Single CPU instance  	|   ✔  	|     ✔    	|        ✔       	|  ✔  |
|   Single GPU instance  	|   ✔  	|     ✔    	|                	|  ✔ (Instance with 1 GPU only)  |
| Multiple CPU instances 	|     	|         	|        ✔       	|      |



## Training Setup
Now that we are done with all the setup that is needed, we are ready to train our algorithm. To begin, let's figure out the BlazingText container we will use.

In [None]:
region_name = boto3.Session().region_name

In [None]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

Next, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [None]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=4, 
                                         train_instance_type='ml.c4.2xlarge',
                                         train_volume_size = 5,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Now, let's define the hyperparameters to train word vectors on Georgian Wikipedia dataset, using "batch_skipgram" mode on two c4.2xlarge instances.

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [None]:
bt_model.set_hyperparameters(mode="batch_skipgram",
                             epochs=20,
                             min_count=5,
                             sampling_threshold=0.0001,
                             learning_rate=0.05,
                             window_size=5,
                             vector_dim=100,
                             negative_samples=5,
                             batch_size=11, #  = (2*window_size + 1) (Preferred. Used only if mode is batch_skipgram)
                             evaluation=False,# Perform similarity evaluation on WS-353 dataset at the end of training
                             subwords=False) # Subword embedding learning is not supported by batch_skipgram

Before kicking off training, we need to point to the location of the training data. No validation data is needed for unsupervised learning.

In [None]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}

Now we are ready to start training

In [None]:
bt_model.fit(inputs=data_channels, logs=True)

## Step 3: Deploy and verify word embeddings

Once the model is trained, we can create a real-time endpoint using which we can compute word embeddings

In [None]:
bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

Now that the "model" is deployed, we can lookup embeddings for these words

In [None]:
tsigni = 'წიგნი' # book
chai   = 'ჩაი'   # tea

ahali  = 'ახალი' # new
dzveli = 'ძველი' # old

erti = 'ერთი' # one
ori = 'ორი' # two
orive = 'ორივე' # both

words = [ tsigni, chai, ahali, dzveli, erti, ori, orive ]

payload = {"instances" : words}

response = bt_endpoint.predict(json.dumps(payload))

vecs = json.loads(response)
print(vecs)

In [None]:
import numpy as np

v_tsigni = np.array(vecs[0]['vector'])
v_chai   = np.array(vecs[1]['vector'])
print(np.linalg.norm(v_tsigni - v_chai))

v_ahali  = np.array(vecs[2]['vector'])
v_dzveli = np.array(vecs[3]['vector'])
print(np.linalg.norm(v_ahali - v_dzveli))

v_erti = np.array(vecs[4]['vector'])
v_ori = np.array(vecs[5]['vector'])
v_orive = np.array(vecs[6]['vector'])
print(np.linalg.norm(v_erti - v_ori))
print(np.linalg.norm(v_ori - v_orive))

In [None]:
s3 = boto3.resource('s3')

key = bt_model.model_data[bt_model.model_data.find("/", 5)+1:]
s3.Bucket(bucket).download_file(key, 'model.tar.gz')

In [None]:
!tar -xvzf model.tar.gz

In [None]:
from sklearn.preprocessing import normalize

# Read the num_points most frequent word vectors. The vectors in the file are in descending order of frequency.
num_points = 400

first_line = True
index_to_word = []
words = []
with open("vectors.txt","r") as f:
    for line_num, line in enumerate(f):
        if first_line:
            dim = int(line.strip().split()[1])
            word_vecs = np.zeros((num_points, dim), dtype=float)
            first_line = False
            continue
        line = line.strip()
        word = line.split()[0]
        words.append(word)
        vec = word_vecs[line_num-1]
        for index, vec_val in enumerate(line.split()[1:]):
            vec[index] = float(vec_val)
        index_to_word.append(word)
        if line_num >= num_points:
            break
word_vecs = normalize(word_vecs, copy=False, return_norm=False)
words

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=10000)
two_d_embeddings = tsne.fit_transform(word_vecs[:num_points])
labels = index_to_word[:num_points]

In [None]:
from matplotlib import pylab
%matplotlib inline

def plot(embeddings, labels):
    pylab.figure(figsize=(30,30))
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
    pylab.show()

plot(two_d_embeddings, labels)

Interesting observations
===================

Though t-SNE plot may look different for you, often you can observe the following:

ერთ (ert/"one") is right next to ორ (or/"two") and ორივე (orive/"both") is not too far.

ახალი (akhali/"new") is next to ძველი (dzveli/"old").

რომის (romis/"Roman") is next to იმპერიის (imperiis/"empire").

### (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sess.delete_endpoint(bt_endpoint.endpoint)