# SageMaker Latent Dirichlet Allocation - An End-to-End SageMaker Example

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Training](#Training)
1. [Inference](#Inference)

# Introduction
***

Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Latent Dirichlet Allocation (LDA) is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

In this notebook we will use the Amazon SageMaker LDA algorithm to train an LDA model on some example synthetic data. We will then use this model to classify (perform inference on) the data. The main goals of this notebook are to,

* learn how to obtain and store data for use in Amazon SageMaker,
* create an AWS SageMaker training job on a data set to produce an LDA model,
* use the LDA model to perform inference with an Amazon SageMaker endpoint.

The following are ***not*** goals of this notebook:

* understand the LDA model,
* understand how the Amazon SageMaker LDA algorithm works,
* interpret the meaning of the inference output

If you would like to know more about these things take a minute to run this notebook and then check out the SageMaker LDA Documentation and the [LDA - Science](http://www.example.com) notebook.

# Setup

***

Before we do anything at all, we need data! We also need to setup our AWS credentials so that AWS SageMaker can store and access data. In this section we will do four things:

1. [Setup AWS Credentials](#SetupAWSCredentials)
1. [Obtain Example Dataset](#ObtainExampleDataset)
1. [Inspect Example Data](#InspectExampleData)
1. [Store Data on S3](#StoreDataonS3)

## Setup AWS Credentials

> **NOTE** To run this notebook all you need to provide is an AWS S3 bucket to store the training input and output along with an access role for SageMaker to access this bucket. The only other user input is the location of the AWS SageMaker LDA algorithm Docker image. You shouldn't need to edit this.

In [None]:
s3_access_role = 'arg:aws:iam::<<<ACCESS ROLE>>>'
bucket = '<<<BUCKET>>>'

s3_access_role = 'arn:aws:iam::874786414999:role/ease-access-role'
bucket = 'lda-notebook-example'

training_image = '462891221994.dkr.ecr.us-west-2.amazonaws.com/lda:1'

## Obtain Example Data


We generate some example synthetic document data. For the purposes of this notebook we will omit the details of this process. All we need to know is that each piece of data, commonly called a "document", is a vector of integers representing "word counts" within the document. In this particular example there are a total of 25 words in the "vocabulary".

In [None]:
!conda install -y scipy

In [None]:
import numpy as np
from generate_example_data import generate_griffiths_data

# generate the sample data
num_documents = 2000
known_alpha, known_beta, documents, topic_mixtures = generate_griffiths_data(
    num_documents=num_documents, num_topics=10, seed=0)
num_topics, vocabulary_size = known_beta.shape


# separate the generated data into training and tests subsets
num_documents_training = int(0.8*num_documents)
num_documents_test = num_documents - num_documents_training

documents_training = documents[:num_documents_training]
documents_test = documents[num_documents_training:]

topic_mixtures_training = topic_mixtures[:num_documents_training]
topic_mixtures_test = topic_mixtures[num_documents_training:]

## Inspect Example Data

*What does the example data actually look like?* Below we print an example document as well as its corresponding *known* topic mixture. Later, when we perform inference on the training data set we will compare the inferred topic mixture to this known one.

As we can see, each document is a vector of word counts from the 25-word vocabulary

In [None]:
print('First training document =\n\t{}'.format(documents[0]))
print('\nVocabulary size = {}'.format(vocabulary_size))

In [None]:
import numpy as np
np.set_printoptions(precision=4, suppress=True)

print('Known topic mixture of first document =\n\t{}'.format(topic_mixtures_training[0]))
print('\nNumber of topics = {}'.format(num_topics))

Because we are visual creatures, let's try plotting the documents. In the below plots, each pixel of a document represents a word. The greyscale intensity is a measure of how frequently that word occurs. Below we plot the first tes documents of the training set reshaped into 5x5 pixel grids.

In [None]:
%matplotlib inline

from generate_example_data import plot_lda

fig = plot_lda(documents_training, nrows=3, ncols=4, cmap='gray_r', with_colorbar=True)
fig.suptitle('Example Document Word Counts')
fig.set_dpi(160)

## Store Data on S3

A SageMaker training job needs access to training data stored in an S3 bucket. Although training can accept data of various formats we convert the documents MXNet RecordIO Protobuf format before uploading to the S3 bucket defined at the beginning of this notebook.

In [None]:
import boto3
from mxnet.recordio import MXRecordIO
from record_pb2 import Record

def save_documents(fname, documents):
    """Saves a Numpy array of documents to RecordIO Protobuf format."""
    feature_size = documents.shape[1]
    
    # convert to protobuf
    protobuf = [
        list_to_record_bytes(
            document.astype(np.float32).tolist(),
            feature_size=feature_size)
        for document in documents
    ]

    # write to recordio
    recordio = MXRecordIO(fname, "w")
    for datum in protobuf:
        recordio.write(datum)
    recordio.close()
    

def list_to_record_bytes(values, keys=None, label=None, feature_size=None):
    """Takes a list and returns a serialized bytestring (using the vector/record representation)"""
    record = Record()
    record.features['values'].float32_tensor.values.extend(values)
 
    if keys is not None:
        if feature_size is None:
            raise ValueError("For sparse tensors the feature size must be specified.")
        record.features['values'].float32_tensor.keys.extend(keys)

    if feature_size is not None:
        record.features['values'].float32_tensor.shape.extend([feature_size])
 
    if label is not None:
        record.label['values'].float32_tensor.values.extend([label])
        
    return record.SerializeToString()

    
def libsvm_record_converter(label, keys, values, feature_size=None):
    record = Record()
    record.features['values'].float32_tensor.values.extend(values)
 
    if keys is not None:
        if feature_size is None:
            raise ValueError("For sparse tensors the feature size must be specified.")
        record.features['values'].float32_tensor.keys.extend(keys)

    if feature_size is not None:
        record.features['values'].float32_tensor.shape.extend([feature_size])
 
    if label is not None:
        record.label['values'].float32_tensor.values.extend([label])
 
    return record


# convert and upload the training document
fname = 'data.pbr'
save_documents(fname, documents_training)
key = 'lda-rosetta-stone-notebook/training/' + fname
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file(fname)
print('Uploaded document data "{}" to "{}/{}"'.format(fname, bucket, key))

# Training

***

Once the data is preprocessed and available in a recommended format the next step is to train our model on the data. There are number of parameters required by SageMaker LDA configurng the model and defining the computational environment in which training will take place.

An LDA model uses the following hyperparameters:

* **`num_topics`** - The number of topics or categories in the LDA model. Usually, this is not known a prior. However, here we know that this example data is generated by five topics.

* **`feature_dim`** - The size of the *"vocabulary"*, in LDA parlance. In this case, this is equal 25. (Each pixel represents one word in the vocabulary.)

* **`mini_batch_size`** - The number of training *"documents"*, in LDA parlance. In this case, the total number of documents in the training set. (5000)

* **`alpha0`** - *(optional)* a measurement of how "mixed" the topics are for each digit. When `alpha0` is small the data tends to be represented by one or few topics. When `alpha0` is large the data tends to be an even combination of several or many topics. In this context it makes sense that an image is representative of only one digit, not a combination of multiple digits.

In addition to these LDA model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role.

In [None]:
import time


# create a name for this training job. to better distinguish this training job
# from others we append a timestamp.
job_name_prefix = 'lda-rosetta-stone-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp


# set up the parameters for the SageMaker create_raining_job call. This includes
# things like
#   * which algorithm image to use
#   * algorithm hyperparameters
#   * S3 locations of input/output data
training_params = {
    'AlgorithmSpecification': {
        'TrainingImage': training_image,
        'TrainingInputMode': 'File',
    },
    'HyperParameters': {
        'num_topics': str(num_topics),
        'feature_dim': str(vocabulary_size),
        'mini_batch_size': str(num_documents_training),
        'alpha0': str(1.0),
    },
    'InputDataConfig': [
        {
            'ChannelName': 'train',
            'CompressionType': 'None',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://{}/{}/training/'.format(bucket, job_name_prefix),
                    'S3DataDistributionType': 'FullyReplicated',
                }
            },
            'RecordWrapperType': 'None',
        }
    ],
    'OutputDataConfig': {
        'S3OutputPath': 's3://{}/{}/output'.format(bucket, job_name_prefix),
    },
    'ResourceConfig': {
        'InstanceCount': 1,
        'InstanceType': 'ml.c4.2xlarge',
        'VolumeSizeInGB': 50,
    },
    'RoleArn': s3_access_role,
    'StoppingCondition': {
        'MaxRuntimeInSeconds': 60*60,
    },
    'TrainingJobName': job_name,
}


print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

Using the above configuration create a SageMaker client and use the client to create a training job.

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)


# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))


# wait for the job to finish and report the ending status
sagemaker.get_waiter('TrainingJob_Created').wait(TrainingJobName=job_name)
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)


# if the job failed, determine why
if status == 'Failed':
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

If you see the message,

> `Training job ended with status: Completed`

then that means training sucessfully completed and the output LDA model was stored in the output path specified by `training_params['OutputDataConfig']`.

You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the "Jobs" tab.

# Inference

***

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the topic mixture representing a given document.

This section involves several steps,

1. [Create Endpoint Configuration](#CreateEndpointConfiguration) - Create a configuration defining an endpoint.
1. [Create Endpoint](#CreateEndpoint) - Use the configuration to create an inference endpoint.
1. [Perform Inference](#Perform Inference) - Perform inference on some input data using the endpoint.

## Create Model

We now create a SageMaker Model from the training output. Using the model we can create an Endpoint Configuration.

In [None]:
import boto3

# get the location of the model generated by the above training job
model_name = job_name
model_data = training_info['ModelArtifacts']['S3ModelArtifacts']
model_params = {
    'ExecutionRoleArn': s3_access_role,
    'ModelName': model_name,
    'PrimaryContainer': {
        'Image': training_image,
        'ModelDataUrl': model_data,
    },
}
model_response = sagemaker.create_model(**model_params)
print('Model name: {}'.format(model_name))
print('ModelArn:   {}\n'.format(model_response['ModelArn']))

## Create Endpoint Configuration

Use the model to create an endpoint configuration. The endpoint configuration also contains information about the type and number of EC2 instances to use when hosting the algorithm.

SageMaker LDA is compute-bound so we will use an `ml.c4.2xlarge` instanace in this example. On problems with larger vocabulary size or large volume of data consider using `ml.c4.4xlarge` or `ml.c4.8xlarge` instances. (Or the `ml.c5` instances, once available!)

In [None]:
import boto3, time

# define the endpoint configuration parameters. these include things like
#   * the type of instance to use at the endpoint
#   * the number of instances to spin up at the endpoint
#   * the name of the model to use    
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-endpoint-config' + timestamp
endpoint_config_params = {
    'EndpointConfigName': endpoint_config_name,
    'ProductionVariants': [
        {
            'InstanceType': 'ml.c4.xlarge',
            'InitialInstanceCount': 1,
            'ModelName': model_name,
            'VariantName': 'AllTraffic'
        }
    ]
}


# create the endpoint configuration
endpoint_config_response = sagemaker.create_endpoint_config(**endpoint_config_params)
print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

## Create Endpoint

Use the configuration to create an endpoint.

In [None]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-endpoint' + timestamp
print('Endpoint name: {}'.format(endpoint_name))


endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

In [None]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))


# wait until the status has changed
sagemaker.get_waiter('Endpoint_Created').wait(EndpointName=endpoint_name)


# print the status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

If you see the message,

> `Endpoint creation ended with EndpointStatus = InService`

then congratulations! You now have a functioning inference endpoint. You can confirm the endpoint configuration and status by navigating to the "Endpoints" tab in the AWS SageMaker console.

We will finally create a runtime object from which we can invoke the endpoint.

In [None]:
import boto3

lda_runtime = boto3.client('sagemaker-runtime')

## Perform Inference

With this realtime endpoint at our fingertips we can finally perform inference on our training and test data.

### LDA Inference

We should first discuss the meaning of the SageMaker LDA inference output. For more information see [How LDA Works](http://www.example.com) and the [LDA-Science](http://www.example.com) notebook.

For each document we wish to compute its corresponding `topic_mixture`. Each topic mixture is a probability distribution over the number of topics, which is five in this example. Of the five topics discovered during LDA training each element of the topic mixture is the proportion to which the input document is represented by the corresponding topic.

For example, if the topic mixture of an input document $\mathbf{w}$ is,

$$\theta = \left[ 0.3, 0.2, 0, 0.5, 0 \right]$$

then $\mathbf{w}$ is 30% generated from the first topic, 20% from the second topic, and 50% from the fourth topic. Below, we compute the topic mixtures for the first ten traning documents.

In [None]:
import numpy as np
import io, json

# for demonstration purposes we show that input data can be passed in csv
# format. additional available formats are JSON, JSON sparse format, and 
# RecordIO Protobuf
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

payload = np2csv(documents_training[:10])
print('Input text/csv document payload:\n{}'.format(payload))

print('\nInvoking endpoint...')
invoke_endpoint_params = {
    'EndpointName': endpoint_name,
    'ContentType': 'text/csv',
    'Body': payload,
}
response = lda_runtime.invoke_endpoint(**invoke_endpoint_params)


print('\nObtaining results...')
results = json.loads(response['Body'].read().decode())


print('\nPrinting results...\n')
print(results)

It may be hard to see but the output format of SageMaker LDA inference endpoint is a Python dictionary with the following format.

```
{
  'predictions': [
    {'topic_mixture': [ ... ] },
    {'topic_mixture': [ ... ] },
    {'topic_mixture': [ ... ] },
    ...
  ]
}
```

We extract the topic mixtures, themselves, corresponding to each of the input documents.

In [None]:
computed_topic_mixtures = np.array([prediction['topic_mixture'] for prediction in results['predictions']])

print(computed_topic_mixtures)

If you decide to compare these results to the known topic mixtures generated in [Obtain Example Data](#ObtainExampleData) keep in mind that SageMaker LDA discovers topics in no particular order. That is, the approximate topic mixtures computed above may be permutations of the known topic mixtures corresponding to the same documents.

In [None]:
print(topic_mixtures_training[0])  # known topic mixture
print(computed_topic_mixtures[0])  # computed topic mixture

## Stop / Close the Endpoint

Finally, we should delete the endpoint before we close the notebook.

To restart the endpoint you can follow the code above using the same `endpoint_name` we created or you can navigate to the "Endpoints" tab in the SageMaker console, select the endpoint with the name stored in the variable `endpoint_name`, and select "Delete" from the "Actions" dropdown menu. 

In [None]:
#sagemaker.delete_endpoint(EndpointName=endpoint_name)