# SageMaker Latent Dirichlet Allocation - Scientific Deep Dive


### Table of Contents

1. [Introduction](#Introduction)
1. [Data Exploration](#DataExploration)
1. [Training](#Training)
1. [Inference](#Inference)

# Introduction
***

Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Latent Dirichlet Allocation (LDA) is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

## The LDA Model

As mentioned above, LDA is a model for discovering latent topics describing a collection of documents. In this section we will give a brief introduction to the model. Let,

* $M$ = the number of *documents* in a corpus
* $N$ = the average *length* of a document.
* $V$ = the size of the *vocabulary*; the total number of unique words

We denote a *document* by a vector $w \in \mathbb{R}^V$ where $w_i$ equals the number of times the $i$th word in the vocabulary occurs within the document. This is called the "bag-of-words" format of representing a document. The *length* of a document is equal to the total number of words in the document: $N_w = \sum_{i=1}^V w_i$.

An LDA model is defined by two parameters: a topic-word distribution matrix $\beta \in \mathbb{R}^{K \times V}$ and a  Dirichlet topic prior $\alpha \in \mathbb{R}^K$. In particular, let,

$$\beta = \left[ \beta_1, \ldots, \beta_K \right]$$

be a collection of $K$ *topics* where each topic $\beta_k \in \mathbb{R}^V$ is represented as probability distribution over the vocabulary. One of the utilities of the LDA model is that a given word is allowed to appear in multiple topics with positive probability. The Dirichlet topic prior is a vector $\alpha \in \mathbb{R}^K$ such that $\alpha_k > 0$ for all $k$.

# Data Exploration

---

## An Example Dataset

Before explaining further let's get our hands dirty with an example dataset. The following synthetic data comes from [1] and comes with a very useful visual interpretation.

> [1] Thomas Griffiths and Mark Steyvers. *Finding Scientific Topics.* Proceedings of the National Academy of Science, 101(suppl 1):5228-5235, 2004.

In [None]:
import numpy as np
np.set_printoptions(precision=3, suppress=True)

In [None]:
!conda install -y scipy

In [None]:
%%time
from generate_example_data import generate_griffiths_data

num_documents = 10000
known_alpha, known_beta, documents, topic_mixtures = generate_griffiths_data(
    num_documents=num_documents, num_topics=10)
num_topics, vocabulary_size = known_beta.shape


# reserve a holdout set of documents and topic mixtures for testing purposes
num_training = int(0.8*num_documents)
documents_test = documents[num_training:]
documents = documents[:num_training]

topic_mixtures_test = topic_mixtures[num_training:]
topix_mixtures = topic_mixtures[:num_training]

num_documents_test = len(documents_test)
num_documents = len(documents)

Let's inspect these data. Starting with the documents, note that the vocabulary size is equal to 25.

In [None]:
print('first document = {}'.format(documents[0]))
print('\nlength of first document = {}'.format(documents[0].sum()))

Next, we investigate the topic-word probability matrix, $\beta$. Let's look at the first topic and verify that it is a probability distribution on the vocabulary.

In [None]:
print('first topic = {}'.format(known_beta[0]))

print('\nbeta shape: (num_topics, vocabulary_size) = {}'.format(known_beta.shape))
print('\nsum of elements of first topic = {}'.format(known_beta[0].sum()))

Human beings are visual creatures. Lucky for us, this example LDA dataset has a natural visualization. We reshape each topic-word distribution to a 5x5 pixel image. Each pixel represents a word from the 25-word-long vocabulary and the color represents the frequency of occurrence.

In [None]:
from generate_example_data import plot_lda

fig = plot_lda(known_beta, nrows=1, ncols=10)
fig.suptitle(r'Known $\beta$ - Topic-Word Probability Distributions')
fig.set_dpi(160)
fig.set_figheight(1.5)

Finally, let's inspect some documents in this visual format. In these 

In [None]:
fig = plot_lda(documents[:12], nrows=3, ncols=4, cmap='gray_r')
fig.suptitle(r'$w$ - Sample Document Word Counts')
fig.set_dpi(160)
fig.set_figheight(4)

## Generating Documents

LDA is a generative model, meaning that the LDA parameters $(\alpha, \beta)$ can be used to construct documents word-by-word by drawing from the topics. In fact, looking closely at the example documents above you can see that some documents sample more words from some topics than from others.

LDA works as follows: given $M$ documents $w^{(1)}, w^{(2)}, \ldots, w^{(M)}$, an average document length of $N$, and an LDA model $(\alpha, \beta)$,

**For** each document $m$:
* sample a topic mixture: $\theta^{(m)} \sim \text{Dirichlet}(\alpha)$
* **For** each word $n$ in the document:
  * Sample a topic $z_n^{(m)} \sim \text{Multinomial}\big( \theta^{(m)} \big)$
  * Sample a word from this topic, $w_n^{(m)} \sim \text{Multinomial}\big( \beta_{z_n^{(m)}} \; \big)$
  * Add to document

The [plate notation](https://en.wikipedia.org/wiki/Plate_notation) for the LDA model, introduced in [2], encapsulates this process pictorially.

![](http://scikit-learn.org/stable/_images/lda_model_graph.png)

> [2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.

## Topic Mixtures

For the documents we generated above lets look at their corresponding topic mixtures, $\theta \in \mathbb{R}^K$. The topic mixtures represent the probablility that a given word of the document is sampled from a particular topic. The objective of inference, also known as scoring, is to determine the most likely topic mixture of a given input document.

Since we generated these example documents using the LDA model we know the topic mixture generating them. Let's examine these topic mixtures.

In [None]:
print('first document =\n{}'.format(documents[0]))
print('\nlength of first document = {}'.format(documents[0].sum()))

In [None]:
print('first document topic mixture =\n{}'.format(topic_mixtures[0]))
print('\nsum(theta) = {}'.format(topic_mixtures[0].sum()))

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.matshow(documents[0].reshape(5,5), cmap='gray_r')
plt.title(r'$w$ - Sample Document', fontsize=20)
plt.xticks([])
plt.yticks([])


plt.matshow(topic_mixtures[0].reshape(1,-1), cmap='Reds', vmin=0, vmax=1)
plt.colorbar(orientation='horizontal')
plt.title(r'$\theta$ - Sample Topic Mixture', fontsize=20)
plt.xticks([])
plt.yticks([])

fig = plot_lda(known_beta, nrows=1, ncols=10)
fig.suptitle(r'Known $\beta$ - Topic-Word Probability Distributions')
fig.set_dpi(160)
fig.set_figheight(1.5)

The above shows that the ***first*** and ***third*** topics are most represented in this document. These correspond to the first and third "column topics". Looking at the document, itself, it seems to be the case as the word count is noticably larger in these two columns.

We plot the first few sample documents $w \in \mathbb{R}^V$ along with their corresponding topic mixtures $\theta \in \mathbb{R}^K$.

In [None]:
%matplotlib inline

import matplotlib.cm as cm
from matplotlib.gridspec import GridSpec, GridSpecFromSubplotSpec

def plot_document_with_topic(fig, gsi, index, topic_mixtures=None,
                             vmin=0, vmax=32):
    ax_doc = fig.add_subplot(gsi[:5,:])
    ax_doc.matshow(documents[index].reshape(5,5), cmap='gray_r',
                   vmin=vmin, vmax=vmax)
    ax_doc.set_xticks([])
    ax_doc.set_yticks([])

    if topic_mixtures is not None:
        ax_topic = plt.subplot(gsi[-1,:])
        ax_topic.matshow(topic_mixtures[index].reshape(1,-1), cmap='Reds',
                         vmin=0, vmax=1)
        ax_topic.set_xticks([])
        ax_topic.set_yticks([])

def plot_lda_topics(documents, nrows, ncols, with_colorbar=True,
                    topic_mixtures=None, cmap='Viridis', dpi=160):
    fig = plt.figure()
    gs = GridSpec(nrows, ncols)
    
    vmin, vmax = (0, documents.max())
    
    for i in range(nrows):
        for j in range(ncols):
            index = i*ncols + j
            gsi = GridSpecFromSubplotSpec(6, 5, subplot_spec=gs[i,j])
            plot_document_with_topic(fig, gsi, index, topic_mixtures=topic_mixtures,
                                     vmin=vmin, vmax=vmax)
            
    return fig
        
    
# plot the documents with their topic mixtures
fig = plot_lda_topics(documents, 3, 4, topic_mixtures=topic_mixtures)
fig.suptitle(r'$(w,\theta)$ - Sample Document Word Counts and Topic Mixtures')
fig.set_dpi(160)

# plot the known probability distributions (for reference)
fig = plot_lda(known_beta, nrows=1, ncols=10)
fig.suptitle(r'Known $\beta$ - Topic-Word Probability Distributions')
fig.set_dpi(160)
fig.set_figheight(1.5)

# Training

---

In this section we will give some insight into how AWS SageMaker LDA fits an LDA model to a corpus, create an run a SageMaker LDA training job, and examine the output trained model.



## Topic Estimation using Tensor Decompositions

Given a document corpus, Amazon SageMaker LDA uses a spectral tensor decomposition technique to determine the LDA model $(\alpha, \beta)$ which most likely describes the corpus. See [1] for a primary reference of the theory behind the algorithm. The spectral decomposition, itself, is computed using the CPDecomp algorithm described in [2].

The overall idea is the following: given a corpus of documents $\mathcal{W} = \{w^{(1)}, \ldots, w^{(M)}\}, \; w^{(m)} \in \mathbb{R}^V,$ we construct a statistic tensor,

$$T \in \bigotimes^3 \mathbb{R}^V$$

such that the spectral decomposition of the tensor is approximately the LDA parameters $\alpha \in \mathbb{R}^K$ and $\beta \in \mathbb{R}^{K \times V}$ which maximize the likelihood of observing the corpus for a given number of topics, $K$,

$$T \approx \sum_{k=1}^K \alpha_k \; (\beta_k \otimes \beta_k \otimes \beta_k)$$

This statistic tensor encapsulates information from the corpus such as the document mean, cross correlation, and higher order statistics. For details, see [1].


> [1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade, and Matus Telgarsky. *"Tensor Decompositions for Learning Latent Variable Models"*, Journal of Machine Learning Research, 15:2773–2832, 2014.
>
> [2] Tamara Kolda and Brett Bader. *"Tensor Decompositions and Applications"*. SIAM Review, 51(3):455–500, 2009.




## Creating and Running a Training Job

To run this job all you need to provide is an AWS S3 bucket to store the training input and output along with an access role for SageMaker to access this bucket.

In [None]:
s3_access_role = 'arn:aws:iam: <<<PROVIDE ACCESS ROLE>>>'
bucket = '<<<PROVIDE BUCKET>>>'


s3_access_role = 'arn:aws:iam::874786414999:role/ease-access-role'
bucket = 'lda-notebook-example'

### Convert Data to RecordIO Protobuf Format

In [None]:
import boto3
from mxnet.recordio import MXRecordIO
from record_pb2 import Record

def save_documents(fname, documents):
    """Saves a Numpy array of documents to RecordIO Protobuf format."""
    feature_size = documents.shape[1]
    
    # convert to protobuf
    protobuf = [
        list_to_record_bytes(
            document.astype(np.float32).tolist(),
            feature_size=feature_size)
        for document in documents
    ]

    # write to recordio
    recordio = MXRecordIO(fname, "w")
    for datum in protobuf:
        recordio.write(datum)
    recordio.close()
    

def list_to_record_bytes(values, keys=None, label=None, feature_size=None):
    """Takes a list and returns a serialized bytestring (using the vector/record representation)"""
    record = Record()
    record.features['values'].float32_tensor.values.extend(values)
 
    if keys is not None:
        if feature_size is None:
            raise ValueError("For sparse tensors the feature size must be specified.")
        record.features['values'].float32_tensor.keys.extend(keys)

    if feature_size is not None:
        record.features['values'].float32_tensor.shape.extend([feature_size])
 
    if label is not None:
        record.label['values'].float32_tensor.values.extend([label])
        
    return record.SerializeToString()

    
def libsvm_record_converter(label, keys, values, feature_size=None):
    record = Record()
    record.features['values'].float32_tensor.values.extend(values)
 
    if keys is not None:
        if feature_size is None:
            raise ValueError("For sparse tensors the feature size must be specified.")
        record.features['values'].float32_tensor.keys.extend(keys)

    if feature_size is not None:
        record.features['values'].float32_tensor.shape.extend([feature_size])
 
    if label is not None:
        record.label['values'].float32_tensor.values.extend([label])
 
    return record


fname = 'data.pbr'
save_documents(fname, documents)
key = 'lda-science-notebook/training/' + fname
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file(fname)
print('Uploaded document data "{}" to "{}/{}"'.format(fname, bucket, key))

### Training Parameters

* `num_topics = 10` - In this example we know a priori that the training corpus was generated by ten topics. Let's see if we can recover these topics.
* `alpha0 = 1.0` - Some documents have "decent" mixing whereas others are singularly represented by a topic. Setting `alpha0` to 1 tells Amazon SageMaker LDA to not favor sparse or dense topic mixtures.

In [None]:
import time

job_name_prefix = 'lda-science-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp

training_image = '462891221994.dkr.ecr.us-west-2.amazonaws.com/lda:1'

training_params = {
    'AlgorithmSpecification': {
        'TrainingImage': training_image,
        'TrainingInputMode': 'File',
    },
    'HyperParameters': {
        'num_topics': str(10),
        'feature_dim': str(25),
        'mini_batch_size': str(len(documents)),
        'alpha0': str(1.0),
    },
    'InputDataConfig': [
        {
            'ChannelName': 'train',
            'CompressionType': 'None',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://{}/{}/training/'.format(bucket, job_name_prefix),
                    'S3DataDistributionType': 'FullyReplicated',
                }
            },
            'RecordWrapperType': 'None',
        }
    ],
    'OutputDataConfig': {
        'S3OutputPath': 's3://{}/{}/output'.format(bucket, job_name_prefix),
    },
    'ResourceConfig': {
        'InstanceCount': 1,
        'InstanceType': 'ml.c4.2xlarge',
        'VolumeSizeInGB': 50,
    },
    'RoleArn': s3_access_role,
    'StoppingCondition': {
        'MaxRuntimeInSeconds': 60*60,
    },
    'TrainingJobName': job_name,
}


print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# create a training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))


# wait for the job to finish and report the ending status
sagemaker.get_waiter('TrainingJob_Created').wait(TrainingJobName=job_name)
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)
if status == 'Failed':
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

## Inspecting the Trained Model

Let's compare the trained model to the known one used to generate the document corpus.

In [None]:
import os, os.path, tarfile
import mxnet as mx

# download and extract the model file from S3
#
model_fname = 'model.tar.gz'
model_key = os.path.join('lda-science-notebook', 'output', job_name, 'output', model_fname)
boto3.Session().resource('s3').Bucket(bucket).Object(model_key).download_file(fname)
print('Downloaded model tarball {}'.format(model_key))

with tarfile.open(fname) as tar:
    tar.extractall()
print('Extracted model tarball')

model_list = [
    fname for fname in os.listdir('.')
    if fname.startswith('model_')
]
model_fname = model_list[0]
print('Found model file: {}'.format(model_fname))


# get the model from the model file and store in Numpy arrays
#
alpha, beta = mx.ndarray.load(model_fname)
found_alpha_permuted = alpha.asnumpy()
found_beta_permuted = beta.asnumpy()

Presumably, SageMaker LDA has found the topics most likely used to generate the training corpus. However, even if this is case the topics would not be returned in any particular order. Therefore, we match the found topics to the known topics closest in L1-norm in order to find the topic permutation.

Note that we will use the `permutation` later during inference to match known topic mixtures to found topic mixtures.

In [None]:
from generate_example_data import match_estimated_topics

permutation, found_beta = match_estimated_topics(known_beta, found_beta_permuted)
found_alpha = found_alpha_permuted[permutation]

We plot the known topic-word probability distribution, $\beta \in \mathbb{R}^{K \times V}$ next to the distribution found by the SageMaker LDA as well as the L1-norm errors between the two.

In [None]:
fig = plot_lda(np.vstack([known_beta, found_beta]), 2, 10)
fig.set_dpi(160)
fig.suptitle('Known vs. Found Topic-Word Probability Distributions')
fig.set_figheight(3)

beta_error = np.linalg.norm(known_beta - found_beta, 1)
alpha_error = np.linalg.norm(known_alpha - found_alpha, 1)
print('L1-error (beta) = {}'.format(beta_error))
print('L1-error (alpha) = {}'.format(alpha_error))

# Inference

With a trained model in hand we will now perform inference on the input training document to recover their topic mixtures. We'll compare the inferred mixtures to those known from generating the example data.

## Setup

We first create a SageMaker Model, SageMaker Endpoint Configuration, and SageMaker Endpoint.

In [None]:
import boto3, time

# get the location of the model generated by the above training job
model_name = job_name
model_data = training_info['ModelArtifacts']['S3ModelArtifacts']
model_params = {
    'ExecutionRoleArn': s3_access_role,
    'ModelName': model_name,
    'PrimaryContainer': {
        'Image': training_image,
        'ModelDataUrl': model_data,
    },
}

model_response = sagemaker.create_model(**model_params)
print('Model name: {}'.format(model_name))
print('ModelArn:   {}\n'.format(model_response['ModelArn']))

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-endpoint-config' + timestamp
endpoint_config_params = {
    'EndpointConfigName': endpoint_config_name,
    'ProductionVariants': [
        {
            'InstanceType': 'ml.c4.xlarge',
            'InitialInstanceCount': 1,
            'ModelName': model_name,
            'VariantName': 'AllTraffic'
        }
    ]
}

endpoint_config_response = sagemaker.create_endpoint_config(**endpoint_config_params)
print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

In [None]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-endpoint' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}

endpoint_response = sagemaker.create_endpoint(**endpoint_params)
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))

sagemaker.get_waiter('Endpoint_Created').wait(EndpointName=endpoint_name)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

# print the final status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))
if status != 'InService':
    raise Exception('Endpoint creation failed.')

## Invoke Inference on Training Data

We infer the topic mixture $\theta \in \mathbb{R}^K$ from a set of input documents $w \in \mathbb{R}^V$.

In [None]:
import io, json
import boto3
import numpy as np

lda_runtime = boto3.client('sagemaker-runtime')
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

payload = np2csv(documents[:12])
invoke_endpoint_params = {
    'EndpointName': endpoint_name,
    'ContentType': 'text/csv',
    'Body': payload,
}
response = lda_runtime.invoke_endpoint(**invoke_endpoint_params)
results = json.loads(response['Body'].read().decode())

inferred_topic_mixtures_permuted = np.array([prediction['topic_mixture'] for prediction in results['predictions']])
print('Computed inferred topic mixtures (permuted)')

Because we knew the topics a priori we were able to match known topics to found topics. To more easily compare known topic mixtures to found topic mixtures we apply the same permutation to the results, here.

In [None]:
inferred_topic_mixtures = inferred_topic_mixtures_permuted[:,permutation]

print('Known topic mixture:\n{}'.format(topic_mixtures[0]))
print('Found topic mixture:\n{}'.format(inferred_topic_mixtures[0]))

In [None]:
width = 0.4
x = np.arange(10)

nrows, ncols = 3, 4
fig, ax = plt.subplots(nrows, ncols)
for i in range(nrows):
    for j in range(ncols):
        index = i*ncols + j
        ax[i,j].bar(x, topic_mixtures[index], width, color='C0')
        ax[i,j].bar(x+width, inferred_topic_mixtures[index], width, color='C1')
        ax[i,j].set_xticks(range(num_topics))
        ax[i,j].set_xticklabels([])
        ax[i,j].set_yticks(np.linspace(0,1,10))
        ax[i,j].set_yticklabels([])
        
fig.suptitle('Known vs. Inferred Topic Mixtures (Training Data)')
fig.set_dpi(160)

One of the benefits of the LDA model as opposed to, say, [Probabilistic latent semantic indexing (pLSI)](https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis) is that we can infer topic mixtures of documents outside the training corpus. Let's also apply inference to the first few test documents and compare with their known topic mixtures.

Finally, let's measure the inference error across the training and test sets.

## Stop / Close the Endpoint

Finally, we should delete the endpoint before we close the notebook.

To restart the endpoint you can follow the code above using the same `endpoint_name` we created or you can navigate to the "Endpoints" tab in the SageMaker console, select the endpoint with the name stored in the variable `endpoint_name`, and select "Delete" from the "Actions" dropdown menu. 

In [None]:
#sagemaker.delete_endpoint(EndpointName=endpoint_name)