# SageMaker Latent Dirichlet Allocation - An End-to-End  Example

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Training](#Training)
1. [Inference](#Inference)
1. [Epilogue](#Epilogue)

# Introduction
***

Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Latent Dirichlet Allocation (LDA) is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

In this notebook we will use the Amazon SageMaker LDA algorithm to train an LDA model on some example synthetic data. We will then use this model to classify (perform inference on) the data. The main goals of this notebook are to,

* learn how to obtain and store data for use in Amazon SageMaker,
* create an AWS SageMaker training job on a data set to produce an LDA model,
* use the LDA model to perform inference with an Amazon SageMaker endpoint.

The following are ***not*** goals of this notebook:

* understand the LDA model,
* understand how the Amazon SageMaker LDA algorithm works,
* interpret the meaning of the inference output

If you would like to know more about these things take a minute to run this notebook and then check out the SageMaker LDA Documentation and the **LDA - Science.ipynb** notebook.

In [None]:
!conda install -y scipy

In [None]:
%matplotlib inline

import io, json, os, re, sys, time

import boto3
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(precision=3, suppress=True)

# some helpful utility functions are defined in the Python module
# "generate_example_data" located in the same directory as this
# notebook
from generate_example_data import generate_griffiths_data, plot_lda, match_estimated_topics

# accessing SageMaker via Python
sagemaker = boto3.client('sagemaker')
from sagemaker.amazon.common import write_numpy_to_dense_tensor
from mxnet.recordio import MXRecordIO

# Setup

***

*This notebook was created and tested on an ml.m4.xlarge notebook instance.*

Before we do anything at all, we need data! We also need to setup our AWS credentials so that AWS SageMaker can store and access data. In this section we will do four things:

1. [Setup AWS Credentials](#SetupAWSCredentials)
1. [Obtain Example Dataset](#ObtainExampleDataset)
1. [Inspect Example Data](#InspectExampleData)
1. [Store Data on S3](#StoreDataonS3)

## Setup AWS Credentials

We first need to specify some AWS credentials; specifically data locations and access roles. This is the only cell of this notebook that you will need to edit. In particular, we need the following data:

* `bucket` - An S3 bucket accessible by this account.
  * Used to store input training data and model data output.
  * Should be withing the same region as this notebook instance, training, and hosting.
* `prefix` - The location in the bucket where this notebook's input and and output data will be stored. (The default value is sufficient.)
* `role` - The IAM Role ARN used to give training and hosting access to your data.
  * See documentation on how to create these.
  * The script below will try to determine an appropriate Role ARN.

In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/lda_rosetta_stone'

assumed_role = boto3.client('sts').get_caller_identity()['Arn']
role = re.sub(r'^(.+)sts::(\d+):assumed-role/(.+?)/.*$', r'\1iam::\2:role/\3', assumed_role)


print('Training input/output will be stored in {}/{}'.format(bucket, prefix))
print('\nIAM Role: {}'.format(role))

## Obtain Example Data


We generate some example synthetic document data. For the purposes of this notebook we will omit the details of this process. All we need to know is that each piece of data, commonly called a *"document"*, is a vector of integers representing *"word counts"* within the document. In this particular example there are a total of 25 words in the *"vocabulary"*.

$$
\underbrace{w}_{\text{document}} = \overbrace{\big[ w_1, w_2, \ldots, w_V \big] }^{\text{word counts}},
\quad
V = \text{vocabulary size}
$$

These data are based on that used by Griffiths and Steyvers in their paper [Finding scietific topics](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf). For more information, see the **LDA - Science.ipynb** notebook.

In [None]:
print('Generating example data...')
num_documents = 6000
known_alpha, known_beta, documents, topic_mixtures = generate_griffiths_data(
    num_documents=num_documents, num_topics=5)
num_topics, vocabulary_size = known_beta.shape


# separate the generated data into training and tests subsets
num_documents_training = int(0.9*num_documents)
num_documents_test = num_documents - num_documents_training

documents_training = documents[:num_documents_training]
documents_test = documents[num_documents_training:]

topic_mixtures_training = topic_mixtures[:num_documents_training]
topic_mixtures_test = topic_mixtures[num_documents_training:]

print('documents_training.shape = {}'.format(documents_training.shape))
print('documents_test.shape = {}'.format(documents_test.shape))

## Inspect Example Data

*What does the example data actually look like?* Below we print an example document as well as its corresponding known *topic-mixture*. A topic-mixture serves as the "label" in the LDA model. It describes the ratio of topics from which the words in the document are found.

For example, if the topic mixture of an input document $\mathbf{w}$ is,

$$\theta = \left[ 0.3, 0.2, 0, 0.5, 0 \right]$$

then $\mathbf{w}$ is 30% generated from the first topic, 20% from the second topic, and 50% from the fourth topic. For more information see **How LDA Works** in the documentation as well as the **LDA - Science.ipynb** notebook.

Below, we compute the topic mixtures for the first few traning documents. As we can see, each document is a vector of word counts from the 25-word vocabulary and its topic-mixture is a probability distribution across the 10 topics used to generate the sample dataset.

In [None]:
print('First training document =\n{}'.format(documents[0]))
print('\nVocabulary size = {}'.format(vocabulary_size))

In [None]:
print('Known topic mixture of first document =\n{}'.format(topic_mixtures_training[0]))
print('\nNumber of topics = {}'.format(num_topics))
print('Sum of elements = {}'.format(topic_mixtures_training[0].sum()))

Later, when we perform inference on the training data set we will compare the inferred topic mixture to this known one.


Human beings are visual creatures, so it might be helpful to come up with a visual representation of these documents. In the below plots, each pixel of a document represents a word. The greyscale intensity is a measure of how frequently that word occurs. Below we plot the first few documents of the training set reshaped into 5x5 pixel grids.

In [None]:
fig = plot_lda(documents_training, nrows=3, ncols=4, cmap='gray_r', with_colorbar=True)
fig.suptitle('Example Document Word Counts')
fig.set_dpi(160)

## Store Data on S3

A SageMaker training job needs access to training data stored in an S3 bucket. Although training can accept data of various formats we convert the documents MXNet RecordIO Protobuf format before uploading to the S3 bucket defined at the beginning of this notebook.

In [None]:
# convert documents_training to Protobuf RecordIO format
fname = 'data.pbr'
recordio = MXRecordIO(fname, 'w')
write_numpy_to_dense_tensor(recordio, documents_training)
recordio.close()

# upload to S3 in bucket/prefix/train
s3_object = os.path.join(prefix, 'train', fname)
boto3.Session().resource('s3').Bucket(bucket).Object(s3_object).upload_file(fname)

print('Uploaded data to S3: {}/{}'.format(bucket, s3_object))

# Training

***

Once the data is preprocessed and available in a recommended format the next step is to train our model on the data. There are number of parameters required by SageMaker LDA configurng the model and defining the computational environment in which training will take place.

Particular to a SageMaker LDA training job are the following hyperparameters:

* **`num_topics`** - The number of topics or categories in the LDA model.
  * Usually, this is not known a priori.
  * However, in this example we know that the data is generated by five topics.

* **`feature_dim`** - The size of the *"vocabulary"*, in LDA parlance.
  * In this example, this is equal 25.

* **`mini_batch_size`** - The number of input training documents.

* **`alpha0`** - *(optional)* a measurement of how "mixed" are the topic-mixtures.
  * When `alpha0` is small the data tends to be represented by one or few topics.
  * When `alpha0` is large the data tends to be an even combination of several or many topics.

In addition to these LDA model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role. Note that,

* Recommended instance type: `ml.c4`
* Current limitations:
  * SageMaker LDA *training* can only run on a single instance.
  * SageMaker LDA does not take advantage of GPU hardware.
  * (The Amazon AI Algorithms team is working hard to provide these capabilities in a future release!)

In [None]:
containers = {
    'us-west-2': '266724342769.dkr.ecr.us-west-2.amazonaws.com/lda:latest',
    'us-east-1': '766337827248.dkr.ecr.us-east-1.amazonaws.com/lda:latest',
    'us-east-2': '999911452149.dkr.ecr.us-east-2.amazonaws.com/lda:latest',
    'eu-west-1': '999678624901.dkr.ecr.eu-west-1.amazonaws.com/lda:latest'
}
region_name = boto3.Session().region_name
container = containers[region_name]

print('Using SageMaker LDA container: {} ({})'.format(container, region_name))

In [None]:
# create a name for this training job. to better distinguish this training job
# from others we append a timestamp.
job_name_prefix = 'lda-rosetta-stone-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp


# set up the parameters for the SageMaker create_raining_job call. This includes
# things like
#   * which algorithm image to use
#   * algorithm hyperparameters
#   * S3 locations of input/output data
training_params = {
    'AlgorithmSpecification': {
        'TrainingImage': container,
        'TrainingInputMode': 'File',
    },
    'HyperParameters': {
        'num_topics': str(num_topics),
        'feature_dim': str(vocabulary_size),
        'mini_batch_size': str(num_documents_training),
        'alpha0': str(1.0),
    },
    'InputDataConfig': [
        {
            'ChannelName': 'train',
            'CompressionType': 'None',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://{}/{}/training/'.format(bucket, job_name_prefix),
                    'S3DataDistributionType': 'FullyReplicated',
                }
            },
            'RecordWrapperType': 'None',
        }
    ],
    'OutputDataConfig': {
        'S3OutputPath': 's3://{}/{}/output'.format(bucket, job_name_prefix),
    },
    'ResourceConfig': {
        'InstanceCount': 1,
        'InstanceType': 'ml.c4.2xlarge',
        'VolumeSizeInGB': 50,
    },
    'RoleArn': role,
    'StoppingCondition': {
        'MaxRuntimeInSeconds': 60 * 60,
    },
    'TrainingJobName': job_name,
}


print('Training job name: {}'.format(job_name))

Using the above configuration create a SageMaker client and use the client to create a training job.

In [None]:
# create the Amazon SageMaker training job
sagemaker.create_training_job(**training_params)

# confirm that the training job has started, wait for the job to
# finish, and then report the ending status
#
# if the job failed, report why
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

sagemaker.get_waiter('TrainingJob_Created').wait(TrainingJobName=job_name)
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print('Training job ended with status: {}'.format(status))

if status == 'Failed':
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

If you see the message,

> `Training job ended with status: Completed`

then that means training sucessfully completed and the output LDA model was stored in the output path specified by `training_params['OutputDataConfig']`.

You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the "Jobs" tab and select training job matching the training job name, below:

In [None]:
print('Training job name: {}'.format(job_name))

# Inference

***

A trained model does nothing on its own. We now want to use the model we computed to perform inference on data. For this example, that means predicting the topic mixture representing a given document.

This section involves several steps,

1. [Create Model](#CreateModel) - Create a hosting model from the trainng output.
1. [Create Endpoint Configuration](#CreateEndpointConfiguration) - Create a configuration defining an endpoint.
1. [Create Endpoint](#CreateEndpoint) - Use the configuration to create an inference endpoint.
1. [Perform Inference](#Perform Inference) - Perform inference on some input data using the endpoint.

## Create Model

We now use the output from our training job to create a formal LDA model. This model will serve as the core of our inference hosting endpoint.

In [None]:
model_name = job_name
model_data_url = sagemaker.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts']
print('Model data located at {}'.format(model_data_url))

model_response = sagemaker.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': container,
        'ModelDataUrl': model_data_url,
    }
)
print('\nModel ARN: {}'.format(model_response['ModelArn']))

## Create Endpoint Configuration

Next, we use the model to create an endpoint configuration. The endpoint configuration also contains information about the type and number of EC2 instances to use when hosting the algorithm for inference.

Recommended instance type: `ml.c4`.

In [None]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-endpoint-config' + timestamp
print('Endpoint configuration name: {}'.format(endpoint_config_name))

endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': 'ml.c4.2xlarge',
        'InitialInstanceCount': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic',
    }]
)
print('\nEndpoint configuration ARN:  {}'.format(endpoint_config_response['EndpointConfigArn']))

## Create Endpoint

Finally, we use the endpoint configuration to create a SageMaker LDA hosting enpoint.

In [None]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-endpoint' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_response = sagemaker.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
print('\nEndpoint ARN = {}'.format(endpoint_response['EndpointArn']))


# confirm that endpoint creation has started, wait for the creation
# process to finish, and then report the ending status
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('\nEndpoint creation current status: {}'.format(status))

sagemaker.get_waiter('Endpoint_Created').wait(EndpointName=endpoint_name)
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with status: {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed')

If you see the message,

> `Endpoint creation ended with status = InService`

then congratulations! You now have a functioning SageMaker LDA inference endpoint. You can confirm the endpoint configuration and status by navigating to the "Endpoints" tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name, below: 

In [None]:
print('Endpoint name: {}'.format(endpoint_name))

## Perform Inference

With this realtime endpoint at our fingertips we can finally perform inference on our training and test data.

We can pass a variety of data formats to our inference endpoint. In this example we will demonstrate passing CSV-formatted data. Other available formats are JSON-formatted, JSON-sparse-formatter, and RecordIO Protobuf.

Below is a helper function which will convert Numpy arrays to CSV format.

In [None]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

Let's convert some test document data to CSV format.

In [None]:
payload = np2csv(documents_test[:12])
print('CSV payload:\n\n{}'.format(payload))

Finally, we connect to the SageMaker LDA Endpoint and invoke it using this payload.

In [None]:
sagemaker_lda_runtime = boto3.Session().client('sagemaker-runtime')

print('Invoking endpoint...')
response = sagemaker_lda_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Body=payload
)

print('\nResponse:')
results = json.loads(response['Body'].read().decode())
print(results)

It may be hard to see but the output format of SageMaker LDA inference endpoint is a Python dictionary with the following format.

```
{
  'predictions': [
    {'topic_mixture': [ ... ] },
    {'topic_mixture': [ ... ] },
    {'topic_mixture': [ ... ] },
    ...
  ]
}
```

We extract the topic mixtures, themselves, corresponding to each of the input documents.

In [None]:
computed_topic_mixtures = np.array([prediction['topic_mixture'] for prediction in results['predictions']])

print(computed_topic_mixtures)

If you decide to compare these results to the known topic mixtures generated in [Obtain Example Data](#ObtainExampleData) keep in mind that SageMaker LDA discovers topics in no particular order. That is, the approximate topic mixtures computed above may be permutations of the known topic mixtures corresponding to the same documents.

In [None]:
print(topic_mixtures_test[0])      # known test topic mixture
print(computed_topic_mixtures[0])  # computed topic mixture (topics permuted)

## Stop / Close the Endpoint

Finally, we should delete the endpoint before we close the notebook.

To do so uncomment and execute the cell below. Alternately, you can navigate to the "Endpoints" tab in the SageMaker console, select the endpoint with the name stored in the variable `endpoint_name`, and select "Delete" from the "Actions" dropdown menu. 

In [None]:
#sagemaker.delete_endpoint(EndpointName=endpoint_name)

# Epilogue

---

In this notebook we,

* generated some example LDA documents and their corresponding topic-mixtures,
* trained a SageMaker LDA model on a training set of documents,
* created an inference endpoint,
* used the endpoint to infer the topic mixtures of a test input.

There are several things to keep in mind when applying SageMaker LDA to real-word data such as a corpus of text documents. Note that input documents to the algorithm, both in training and inference, need to be vectors of integers representing word counts. Each index corresponds to a word in the corpus vocabulary. Therefore, one will need to "tokenize" their corpus vocabulary.

$$
\text{"cat"} \mapsto 0, \; \text{"dog"} \mapsto 1 \; \text{"bird"} \mapsto 2, \ldots
$$

Each text document then needs to be converted to a "bag-of-words" format document.

$$
w = \text{"cat bird bird bird cat"} \quad \longmapsto \quad w = [2, 0, 3, 0, \ldots, 0]
$$

Also note that many real-word applications have large vocabulary sizes. It may be necessary to represent the input documents in sparse format. Finally, stemming and lemmatization can improve compute time, by reducing the effective vocabulary size, but also topic quality of the learned model.