# Challenge Lab 6.3: Implementing Topic Extraction

In this lab, you will use either Amazon Comprehend or Amazon SageMaker NTM to extract topics from the plot summaries of the IMDB Movie dataset. 

## CMU Movie Summary Corpus

The CMU Movie Summary Corpus, a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age).  This data supports work in the following paper:

David Bamman, Brendan O'Connor and Noah Smith, "Learning Latent Personas of Film Characters," in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013.

There are two datasets you will use in this lab:

**plot_summaries.txt**

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

**movie.metadata.tsv**

Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)


## Lab Steps

To complete this lab, you will follow these steps:

1. [Install packages](#1.-Install-packages)
2. [Reviewing the dataset](#2.-Reviewing-the-dataset)

3. [Extracting Topics](#3.-Extracting-Topics)



## Submitting your work

1. In the lab console, choose `Submit` to record your progress and when prompted, choose **Yes**.

1. If the results don't display after a couple of minutes, return to the top of these instructions and choose `Grades`

     **Tip**: You can submit your work multiple times. After you change your work, choose **Submit** again. Your last submission is what will be recorded for this lab.

1. To find detailed feedback on your work, choose `Details` followed by **View Submission Report**.

## 1. Install packages
([Go to top](#Lab-6.3:-Implementing-Topic-Extraction))

Start by updating and installing the packages you will use in the notebook. 


In [None]:
%matplotlib inline

import boto3
import os, io, struct, json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import uuid
from time import sleep
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
bucket = "c46255a638438l1748394t1w538120888142-labbucket-12figcw8iu648"
job_data_access_role = 'arn:aws:iam::538120888142:role/service-role/c46255a638438l1748394t1w5-ComprehendDataAccessRole-1A1092NM0Q4C7'
prefix='lab63'

## 2. Reviewing the dataset
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))



Start by loading the **plot_summaries.tsv** data into a pandas dataframe.

There are only 2 columns in the file, the movie_id and the plot. The data is 'tab' separated, so the '\t' escape sequence is used as the separator.

In [None]:
df = pd.read_csv('../data/plot_summaries.tsv', sep='\t', names=['movie_id','plot'])

Take a look at the first few rows of data.

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

df.head(5)

You can check the number of rows and columns with the shape property:

In [None]:
df.shape

Now examine the metadata. From the [documentation](http://www.cs.cmu.edu/~ark/personas/data/README.txt) we see there are 9 fields. We can load the data into a pandas dataframe and specify the column names.

In [None]:
movie_meta_df = pd.read_csv('../data/movie.metadata.tsv', sep='\t', names=['movie_id','freebase_id','name','release_date','box_office_revenue','runtime','languages','countries','genres'])
movie_meta_df.head()

You can set the index to the movie_id, which will make merging this dataset with the plot much easier.

In [None]:
movie_meta_df.set_index('movie_id', inplace=True)

Since you only need the movie name and some way of linking this metadata to the plot (movie_id), you can drop the remaining columns.

In [None]:
movie_meta_df=movie_meta_df.drop(['freebase_id','release_date','box_office_revenue','runtime','languages','countries','genres'], axis=1)
movie_meta_df.head()

## 3. Extracting Topics
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

You must now decide if you are going to use Amazon Comprehend or Amazon SageMaker NTM to extract your topics. Both will do a good job of giving you topics, but have different data requirements. You can refer to the notebooks from lab 6.1 and 6.2 for any code snippets you might need for each solution. Experiment with the number of topics to see if you can get better results. 

Questions to address:

1. What data cleanup will you need to perform?

2. How many topics will give you the best results?

    


# Amazon Comprehend Solution

In [None]:
df.head()

Need to clean the data and write it to a file with a thing on each line

In [None]:
stop = stopwords.words('english')
lem = WordNetLemmatizer()
from nltk.tokenize import word_tokenize

def clean(sent):
    # Implement this function
    sent = sent.lower()
    sent = re.sub('\s+', ' ', sent)
    sent = sent.strip()
    sent = re.compile('<.*?>').sub('',sent)
    # remove special characters and digits
    sent=re.sub("(\\d|\\W)+"," ",sent)
    sent=re.sub("br","",sent)
    filtered_sentence = []
    
    for w in word_tokenize(sent):
        # You are applying custom filtering here. Feel free to try different things.
        # Check if it is not numeric, its length > 2, and it is not in stopwords
        if(not w.isnumeric()) and (len(w)>2) and (w not in stop):  
            # Stem and add to filtered list
            filtered_sentence.append(lem.lemmatize(w))
    final_string = " ".join(filtered_sentence) #final string of cleaned words
    return final_string

In [None]:
df['X'] = df.apply(lambda row : clean(row['plot']), axis = 1)

In [None]:
df.head()

upload file to S3

In [None]:
df['plot'].to_csv(r'pandas.txt', header=None, index=None, sep=' ', mode='w')

In [None]:
s3 = boto3.resource('s3')
s3.Bucket(bucket).upload_file('pandas.txt', 'comprehend/pandas.txt')

In [None]:
number_of_topics = 20

input_s3_url = f"s3://{bucket}/comprehend"
input_doc_format = "ONE_DOC_PER_LINE"
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}

output_s3_url = f"s3://{bucket}/outputfolder/"
output_data_config = {"S3Uri": output_s3_url}

job_uuid = uuid.uuid1()
job_name = f"top-job-{job_uuid}"

print(input_s3_url)

Now you can start the Amazon Comprehend topic detection job.

In [None]:
comprehend_client = boto3.client(service_name='comprehend')
start_topics_detection_job_result = comprehend_client.start_topics_detection_job(NumberOfTopics=number_of_topics,
                                                                              InputDataConfig=input_data_config,
                                                                              JobName=job_name,
                                                                              OutputDataConfig=output_data_config,
                                                                              DataAccessRoleArn=job_data_access_role
                                                                            )




In [None]:
# Get current job status
from time import sleep
job = comprehend_client.describe_topics_detection_job(JobId=start_topics_detection_job_result['JobId'])

# Loop until job is completed
waited = 0
timeout_minutes = 40
while job['TopicsDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(60)
    waited += 60
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    print('.', end='')
    job = comprehend_client.describe_topics_detection_job(JobId=start_topics_detection_job_result['JobId'])

print('Ready')

In [None]:
topic_comprehend_output_file = job['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
print(f'output filename: {topic_comprehend_output_file}')

topics_comprehend_bucket, topics_comprehend_key = topic_comprehend_output_file.replace("s3://", "").split("/", 1)
print(topics_comprehend_key)
s3r = boto3.resource('s3')
s3r.meta.client.download_file(topics_comprehend_bucket, topics_comprehend_key, 'output.tar.gz')

In [None]:
# Extract the tar file
import tarfile
tf = tarfile.open('output.tar.gz')
tf.extractall()

The compressed file you downloaded from Amazon Comprehend contained 2 files:

- **topic-terms.csv** is a list of topics in the collection. For each topic the list includes the top 10 terms according to the weight.
- **doc-topics.csv** lists the documents associated with a topic and the proportion of the document that is concerned with the topic. Yes, documents will belong to more than 1 topic.

We can load in the topic-terms.csv into a pandas dataframe.

In [None]:
import pandas as pd
dftopicterms = pd.read_csv("topic-terms.csv")

Take a look at the first few rows.

In [None]:
dftopicterms.head()

<!-- ID Comment: This needs to be more direct on the action you want the learner to take. Currently, it reads as a statement. -->  
You can print each of the topics, and the top words for each topic. 

In [None]:
# selecting rows based on condition
for t in range(0,number_of_topics):
    rslt_df = dftopicterms.loc[dftopicterms['topic'] == t]
    topic_list = rslt_df['term'].values.tolist()
    print(f'Topic {t:2} - {topic_list}')

Note the topics are not named. This is an exercise for the human to determine. Based on the words above, try to think of a word that describes each topic and update the cell below, replacing the placeholder text with your chosen word.

In [None]:
colnames = pd.DataFrame({'topics':['topic 0', 'topic 1', 'topic 2', 'topic 3', 'topic 4', 'topic 5', 'topic 6','topic 7','topic 8','topic 9',
       'topic 10', 'topic 11', 'topic 12', 'topic 13', 'topic 14', 'topic 15', 'topic 16','topic 17','topic 18','topic 19']})

Next you can read in the doc-topics.csv file.

In [None]:
dfdoctopics = pd.read_csv("doc-topics.csv")
dfdoctopics.head()

To make visualizations easier, you can select 5 articles to plot.

In [None]:
to_chart = dfdoctopics.loc[dfdoctopics['docname'].isin(['pandas.txt:1000','pandas.txt:2000','pandas.txt:3000','pandas.txt:4000','pandas.txt:5000'])]

Next, use the pivot_table function to map the column values in the docname column to actual columns. This will make creating a chart easier.

In [None]:
to_chart = to_chart.pivot_table(values='proportion', index='topic', columns='docname')
to_chart.head()

Finally you can plot the topics belonging to your newsgroup documents. You will see some documents have a single topic, while others contain a mixture. If you have too many topics you can try changing the number of topics to get better results.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fs = 12
# df.index = colnames['topic']
to_chart.plot(kind='bar', figsize=(16,4), fontsize=fs)
plt.ylabel('Topic assignment', fontsize=fs+2)
plt.xlabel('Topic ID', fontsize=fs+2)

# NTM Topic Extraction Solution

In [None]:
%%time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vocab_size = 2000
print('Tokenizing and counting, this may take a few minutes...')

# vectorizer = CountVectorizer(input='content', max_features=vocab_size, max_df=0.95, min_df=2)
vectorizer = CountVectorizer(input='content', max_features=vocab_size)
# vectors = vectorizer.fit_transform(data)
vectors = vectorizer.fit_transform(df['X'])
vocab_list = vectorizer.get_feature_names()

print('vocab size:', len(vocab_list))

Optionally, you may consider removing very short documents, the following cell removes documents shorter than 25 words. This certainly depends on the application, but there are also some general justifications. It is hard to imagine very short documents express more than one topic. Topic modeling tries to model each document as a mixture of multiple topics, thus it may not be the best choice for modeling short documents.

In [None]:
threshold = 25
vectors = vectors[np.array(vectors.sum(axis=1)>threshold).reshape(-1,)]
print('removed short docs (<{} words)'.format(threshold))        
print(vectors.shape)

The output from `CountVectorizer` are sparse matrices with their elements being integers. 

In [None]:
print(type(vectors), vectors.dtype)
print(vectors[0])

Because all the parameters (weights and biases) in the NTM model are `np.float32` type you'd need the input data to also be in `np.float32`. It is better to do this type-casting upfront rather than repeatedly casting during mini-batch training.

In [None]:
import scipy.sparse as sparse
vectors = sparse.csr_matrix(vectors, dtype=np.float32)
print(type(vectors), vectors.dtype)

As a common practice in modeling training, you should have a training set, a validation set, and a test set. The training set is the set of data the model is actually being trained on. But what you really care about is not the model's performance on training set but its performance on future, unseen data. Therefore, during training, you periodically calculate scores (or losses) on the validation set to validate the performance of the model on unseen data. By assessing the model's ability to generalize you can stop the training at the optimal point via early stopping to avoid over-training. 

Note that when you only have a training set and no validation set, the NTM model will rely on scores on the training set to perform early stopping, which could result in over-training. Therefore, you should always supply a validation set to the model.

Here you use 80% of the data set as the training set and the rest for validation set and test set. You will use the validation set in training and use the test set for demonstrating model inference.

In [None]:
from sklearn.model_selection import train_test_split
def split_data(df):
    train, test_validate = train_test_split(df,
                                            test_size=0.2,
                                            shuffle=True,
                                            random_state=324
                                            )
    test, validate = train_test_split(test_validate,
                                            test_size=0.5,
                                            shuffle=True,
                                            random_state=324
                                            )
    return train, validate, test

In [None]:
train_vectors, val_vectors, test_vectors = split_data(vectors)

In [None]:
print(train_vectors.shape, val_vectors.shape)


## Save the vocabulary file

To make use of the auxiliary channel for vocabulary file, you first save the text file with the name vocab.txt in the auxiliary directory.


In [None]:
import os
import shutil
def check_create_dir(dir):
    if os.path.exists(dir):  # cleanup existing data folder
        shutil.rmtree(dir)
    os.mkdir(dir)
    
data_dir = './'
aux_data_dir = os.path.join(data_dir, 'auxiliary')
check_create_dir(aux_data_dir)
with open(os.path.join(aux_data_dir, 'vocab.txt'), 'w', encoding='utf-8') as f:
    for item in vocab_list:
        f.write(item+'\n')



## Store Data on S3

The NTM algorithm, as well as other first-party SageMaker algorithms, accepts data in [RecordIO](https://mxnet.apache.org/api/python/io/io.html#module-mxnet.recordio) [Protobuf](https://developers.google.com/protocol-buffers/) format. The SageMaker Python API provides helper functions for easily converting your data into this format. Below you will convert the from numpy/scipy data and upload it to an Amazon S3 destination for the model to access it during training.


### Setup AWS Credentials

You first need to specify data locations and access roles. In particular, you need the following data:

- The S3 `bucket` and `prefix` that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM `role` is used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

**Note**: These values will have been supplied when the lab starts.

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()

sess = sagemaker.Session()


In [None]:
prefix = 'movies-ntm'

train_prefix = os.path.join(prefix, 'train')
val_prefix = os.path.join(prefix, 'val')
aux_prefix = os.path.join(prefix, 'auxiliary')
output_prefix = os.path.join(prefix, 'output')

s3_train_data = os.path.join('s3://', bucket, train_prefix)
s3_val_data = os.path.join('s3://', bucket, val_prefix)
s3_aux_data = os.path.join('s3://', bucket, aux_prefix)
output_path = os.path.join('s3://', bucket, output_prefix)
print('Training set location', s3_train_data)
print('Validation set location', s3_val_data)
print('Auxiliary data location', s3_aux_data)
print('Trained model will be saved at', output_path)

Here you define a helper function to convert the data to RecordIO Protobuf format and upload it to S3. In addition, you will have the option to split the data into several parts specified by `n_parts`.

The algorithm inherently supports multiple files in the training folder ("channel"), which could be very helpful for large data set. In addition, when you use distributed training with multiple workers (compute instances), having multiple files allows you to distribute different portions of the training data to different workers conveniently.

Inside this helper function you use `write_spmatrix_to_sparse_tensor` function provided by [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to convert scipy sparse matrix into RecordIO Protobuf format.

In [None]:
def split_convert_upload(sparray, bucket, prefix, fname_template='data_part{}.pbr', n_parts=2):
    import io
    import boto3
    import sagemaker.amazon.common as smac
    
    chunk_size = sparray.shape[0]// n_parts
    for i in range(n_parts):

        # Calculate start and end indices
        start = i*chunk_size
        end = (i+1)*chunk_size
        if i+1 == n_parts:
            end = sparray.shape[0]
        
        # Convert to record protobuf
        buf = io.BytesIO()
        smac.write_spmatrix_to_sparse_tensor(array=sparray[start:end], file=buf, labels=None)
        buf.seek(0)
        
        # Upload to s3 location specified by bucket and prefix
        fname = os.path.join(prefix, fname_template.format(i))
        boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
        print('Uploaded data to s3://{}'.format(os.path.join(bucket, fname)))

In [None]:
split_convert_upload(train_vectors, bucket=bucket, prefix=train_prefix, fname_template='train_part{}.pbr', n_parts=8)
split_convert_upload(val_vectors, bucket=bucket, prefix=val_prefix, fname_template='val_part{}.pbr', n_parts=1)

Upload the vocab.txt file.

In [None]:
boto3.resource('s3').Bucket(bucket).Object(aux_prefix+'/vocab.txt').upload_file(aux_data_dir+'/vocab.txt')


You have created the training and validation data sets and uploaded them to S3. Next, configure a SageMaker training job to use the NTM algorithm on the data you prepared

In [None]:
from sagemaker.image_uris import retrieve
container = retrieve('ntm',boto3.Session().region_name)

The code in the cell below automatically chooses an algorithm container based on the current region. In the API call to `sagemaker.estimator.Estimator` you also specify the type and count of instances for the training job. Because the 20NewsGroups data set is relatively small, you can use a CPU only instance (`ml.c4.xlarge`), but do feel free to change to [other instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/). NTM fully takes advantage of GPU hardware and in general trains roughly an order of magnitude faster on a GPU than on a CPU. Multi-GPU or multi-instance training further improves training speed roughly linearly if communication overhead is low compared to compute time.

In [None]:
import sagemaker
sess = sagemaker.Session()
ntm = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=2, 
                                    instance_type='ml.c4.xlarge',
                                    output_path=output_path,
                                    sagemaker_session=sagemaker.Session())

## Hyperparameters

Here are a few hyperparameters. For information about the full list of available hyperparameters, please refer to [NTM Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html).

- **feature_dim** - the "feature dimension", it should be set to the vocabulary size
- **num_topics** - the number of topics to extract
- **mini_batch_size** - this is the batch size for each worker instance. Note that in multi-GPU instances, this number will be further divided by the number of GPUs. Therefore, for example, if we plan to train on an 8-GPU machine (such as `ml.p2.8xlarge`) and wish each GPU to have 1024 training examples per batch, `mini_batch_size` should be set to 8196.
- **epochs** - the maximal number of epochs to train for, training may stop early
- **num_patience_epochs** and **tolerance** controls the early stopping behavior. Roughly speaking, the algorithm will stop training if within the last `num_patience_epochs` epochs there have not been improvements on validation loss. Improvements smaller than `tolerance` will be considered non-improvement.
- **optimizer** and **learning_rate** - by default we use `adadelta` optimizer and `learning_rate` does not need to be set. For other optimizers, the choice of an appropriate learning rate may require experimentation.



In [None]:
num_topics = 20
ntm.set_hyperparameters(num_topics=num_topics, 
                        feature_dim=vocab_size, 
                        mini_batch_size=256, 
                        num_patience_epochs=10, 
                        optimizer='adam')

Next, you need to specify how the training data and validation data will be distributed to the workers during training. There are two modes for data channels:

- `FullyReplicated`: all data files will be copied to all workers
- `ShardedByS3Key`: data files will be sharded to different workers, i.e. each worker will receive a different portion of the full data set.

At the time of writing, by default, the Python SDK will use `FullyReplicated` mode for all data channels. This is desirable for validation (test) channel but not suitable for training channel. The reason is that when you use multiple workers you would like to go through the full data set by each of them going through a different portion of the data set, so as to provide different gradients within epochs. Using `FullyReplicated` mode on training data not only results in slower training time per epoch (nearly 1.5X in this example), but also defeats the purpose of distributed training. To set the training data channel correctly you specify `distribution` to be `ShardedByS3Key` for the training data channel as follows.

In [None]:
from sagemaker.inputs import TrainingInput
# sagemaker.inputs.TrainingInput
s3_train = TrainingInput(s3_train_data, distribution='ShardedByS3Key') 
s3_val = TrainingInput(s3_val_data, distribution='FullyReplicated')

The final step before training is to define the auxilary file. This will replace integers in the log files with the actual words.

In [None]:
s3_aux = TrainingInput(s3_aux_data, distribution='FullyReplicated', content_type='text/plain')

Now you are ready to train. The following cell takes a few minutes to run. The command below will first provision the required hardware. You will see a series of dots indicating the progress of the hardware provisioning process. Once the resources are allocated, training logs will be displayed. With multiple workers, the log color and the ID following `INFO` identifies logs emitted by different workers.

In [None]:
# ntm.fit({'train': s3_train, 'validation': s3_train, 'auxiliary': s3_aux})
ntm.fit({'train': s3_train, 'validation': s3_val, 'auxiliary': s3_aux})

If you see the message

> `===== Job Complete =====`

at the bottom of the output logs then that means training successfully completed and the output NTM model was stored in the specified output path. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the "Jobs" tab and select training job matching the training job name, below:

In [None]:
print('Training job name: {}'.format(ntm.latest_training_job.job_name))

In the cell that contains the log information for the training job. Scroll to the bottom until you find a line similiar to the one in the cell below. 

**Tip:** look for the phrase *Topics from epoch:final*

    [05/04/2021 02:01:05 INFO 140593644394304] Topics from epoch:final (num_topics:20) [wetc 0.33, tu 0.68]

There are 2 numbers of interest here; wetc and tu.

- **wetc** is the *word embedding topic coherence* and indicates the degree of topic coherence. A higher number indicates a higher degree of topic coherence.
- **tu** is the *topic uniqueness* metric and indicates how unique the terms are within the topic. The higher the number, the more unique the topic terms.

In the example above, the wetc is average at 0.33 and the tu is above average at 0.68.

After the line displaying the overal wetc and tu metrics you should see a list of topics that were identified along with the words that comprise that topic. Note the topics are not named, that is a task that still requires a human. For each topic, you see its WETC and TU scores, as well as the top words within that topic. 

You can use these words to try and determine a name for the topic.

Try and attach your own topic names to each topic.


A trained NTM model does nothing on its own. You now want to use the model you computed to perform inference on data. For this example, that means predicting the topic mixture representing a given document.

You create an inference endpoint using the SageMaker Python SDK `deploy()` function from the job you defined above. You specify the instance type where inference is computed as well as an initial number of instances to spin up.

In [None]:
ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Congratulations! You now have a functioning SageMaker NTM inference endpoint. You can confirm the endpoint configuration and status by navigating to the "Endpoints" tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name, below: 

In [None]:
print('Endpoint name: {}'.format(ntm_predictor.endpoint_name))


### Data Serialization/Deserialization

You can pass data in a variety of formats to the inference endpoint. First, you will pass CSV-formatted data. You can make use of the SageMaker Python SDK utilities `csv_serializer` and `json_deserializer` when configuring the inference endpoint.

### Inference with CSV

In [None]:
ntm_predictor.content_types = 'text/csv'
ntm_predictor.serializer = sagemaker.serializers.CSVSerializer()
ntm_predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Let's pass 5 examples from the test set to the inference endpoint

In [None]:
test_data = np.array(test_vectors.todense())
results = ntm_predictor.predict(test_data[:5])
print(results)

You can see the output format of SageMaker NTM inference endpoint is a Python dictionary with the following format.

```
{
  'predictions': [
    {'topic_weights': [ ... ] },
    {'topic_weights': [ ... ] },
    {'topic_weights': [ ... ] },
    ...
  ]
}
```

You can extract the topic weights, themselves, corresponding to each of the input documents.

In [None]:
predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
print(predictions)

Replace the topic names in the cell below with the topic names you decided.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
colnames = pd.DataFrame({'topics':['topic 0', 'topic 1', 'topic 2', 'topic 3', 'topic 4', 'topic 5', 'topic 6','topic 7','topic 8','topic 9',
       'topic 10', 'topic 11', 'topic 12', 'topic 13', 'topic 14', 'topic 15', 'topic 16','topic 17','topic 18','topic 19']})

Now you can take a look at how the 20 topics are assigned to the 5 test documents with a bar plot.

In [None]:
fs = 12
df=pd.DataFrame(predictions.T)
df.index = colnames['topics']
df.plot(kind='bar', figsize=(16,4), fontsize=fs)
plt.ylabel('Topic assignment', fontsize=fs+2)
plt.xlabel('Topic ID', fontsize=fs+2)

Some ways you could improve the model include adding or removing specific words to influence topics, increasing or decreasing the number of topics, and trying different hyperparameters.

## Stop / Close the Endpoint

Finally, you should delete the endpoint before you close the notebook.

To restart the endpoint you can follow the code above using the same `endpoint_name`.

In [None]:
sagemaker.Session().delete_endpoint(ntm_predictor.endpoint_name)

## Model Exploration


***Note: The following section is meant as a deeper dive into exploring the trained models. The demonstrated functionalities may not be fully supported or guaranteed. For example, the parameter names may change without notice.***


The trained model artifact is a compressed package of MXNet models from the two workers. To explore the model, you first need to install mxnet.

In [None]:
# If you use conda_mxnet_p36 kernel, mxnet is already installed, otherwise, uncomment the following line to install.
!pip install mxnet 
import mxnet as mx

Here you download and unpack the artifact

In [None]:
model_path = os.path.join(output_prefix, ntm._current_job_name, 'output/model.tar.gz')
model_path

In [None]:
boto3.resource('s3').Bucket(bucket).download_file(model_path, 'downloaded_model.tar.gz')

In [None]:
!tar -xzvf 'downloaded_model.tar.gz'

In [None]:
# use flag -o to overwrite previous unzipped content
!unzip -o model_algo-2

You can load the model parameters and extract the weight matrix $W$ in the decoder as follows

In [None]:
model = mx.ndarray.load('params')

W = model['arg:projection_weight']

In [None]:
print(W)

You can visualize each topic as a word cloud with the size of each word be proportional to the pseudo-probability of the words appearing under each topic.

In [None]:
!pip install wordcloud
import wordcloud as wc

In [None]:
import matplotlib.pyplot as plt
word_to_id = dict()
for i, v in enumerate(vocab_list):
    word_to_id[v] = i

limit = 24
n_col = 4
counter = 0

plt.figure(figsize=(20,16))
for ind in range(num_topics):

    if counter >= limit:
        break

    title_str = 'Topic{}'.format(ind)

    #pvals = mx.nd.softmax(W[:, ind]).asnumpy()
    pvals = mx.nd.softmax(mx.nd.array(W[:, ind])).asnumpy()

    word_freq = dict()
    for k in word_to_id.keys():
        i = word_to_id[k]
        word_freq[k] =pvals[i]

    wordcloud = wc.WordCloud(background_color='white').fit_words(word_freq)

    plt.subplot(limit // n_col, n_col, counter+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(title_str)
    #plt.close()

    counter +=1

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*
