## Setup

Install and import dependencies 

In [None]:
!pip install wordcloud

%matplotlib inline

from collections import defaultdict
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import os
import boto3
import wordcloud as wc
import csv

Provide details of the S3 bucket you just created

In [None]:
bucket='<fill-in-your-bucket-name>'
prefix='data/translated/politics-2000'

!aws s3api head-bucket --bucket $bucket # verify bucket 

Create a tmp folder for storing outputs

In [None]:
OUTPUT_FOLDER='output'

def make_tmp_folder(folder_name):
    try:
        os.makedirs(folder_name)
    except OSError as e:
        print("{} folder already exists".format(folder_name))

make_tmp_folder(OUTPUT_FOLDER)

Copy the translated text into your bucket

In [None]:
!aws s3 sync s3://large-text-understanding/data/THUCNews/translated/political-news-2000/ s3://$bucket/$prefix/ --quiet

## Use Amazon Comprehend to run a topic modeling job

You can use Amazon Comprehend to examine the content of a collection of documents to determine common themes without providing pre-labeled data. 

Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.

In [None]:
input_s3_location = f's3://{bucket}/{prefix}/'
print(f'Topic modeling job input s3 location: {input_s3_location}')

Use the **Amazon Comprehend** console to run a topic modeling job: [https://console.aws.amazon.com/comprehend/home](https://console.aws.amazon.com/comprehend/home)


## Download topic modeling results from Amazon Comprehend

When the topic detection job is finished, the service creates an output file in S3. The `S3Uri` field of the job output configuration contains the location of the output file, called `output.tar.gz`. It is a compressed archive that contains the ouput of the topic detection job: 
* `topic-terms.csv` - a list of topics in the collection. For each topic, the list includes the top 10 terms by topic according to their weight.
* `doc-topics.csv` -lists the documents associated with a topic and the proportion of the document that is concerned with the topic.

In [None]:
topic_modeling_job_id = '<fill-in-job-id>' 


In [None]:
comprehend_client = boto3.client('comprehend')

In [None]:
topics_detection_response = comprehend_client.describe_topics_detection_job(
    JobId=topic_modeling_job_id
)
output_file = topics_detection_response['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
num_topics = topics_detection_response['TopicsDetectionJobProperties']['NumberOfTopics']

print(f'output file: {output_file}')
print(f'number of topics: {num_topics}')

download_output=os.path.join(OUTPUT_FOLDER, f'output-{num_topics}-topics.tar.gz')
!aws s3 cp $output_file $download_output
print(f'downloaded topic modeling output to: {download_output}')


In [None]:
!tar -xvzf $download_output -C $OUTPUT_FOLDER/

In [None]:
topic_terms_csv = os.path.join(OUTPUT_FOLDER, 'topic-terms.csv')
doc_topics_csv = os.path.join(OUTPUT_FOLDER, 'doc-topics.csv')

Review the `topic-terms.csv` file content to see the top terms that appear for each topic detected: 

In [None]:
!cat $topic_terms_csv

Review first few lines of the `doc-topics.csv` file content to see what the output format looks like on which topics are detected for each document 

In [None]:
!head -n 15 $doc_topics_csv

## Initial visualizations of the topic modeling result

Using `matplotlib` and `wordcloud` we can perform some initial exploration and visualization on the topic modeling output

### Topic visualization using wordcloud
`wordcloud` use the size of words to reflect the relative weights of the terms. This can help in having a quick sense of what each topic is about

In [None]:
def parse_topic_terms(topic_terms_csv):
    topics=defaultdict(dict)
    with open(topic_terms_csv) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        for row in csv_reader:
            if line_count == 0:
                print(f'Column names are {", ".join(row)}')
                line_count += 1
            else:
                topic = row[0]
                term = row[1]
                freq= float(row[2])
                topics[topic][term]=freq
                line_count += 1
        print(f'Processed {line_count} lines.')
    return topics

def plot_topic_word_cloud(topics):      
    plt.figure(figsize=(20,16))

    n_col = 6

    for i, item in enumerate(topics):

        title_str = 'Topic{}'.format(item)

        wordcloud = wc.WordCloud(background_color='white').fit_words(topics[item])

        plt.subplot(len(topics) // n_col+1, n_col, i+1)
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis("off")
        plt.title(title_str)
                      

In [None]:
topics = parse_topic_terms(topic_terms_csv)
plot_topic_word_cloud(topics)

### Topic distribution
What's the makeup of topics across the whole corpus?

In [None]:
def parse_doc_topic_distribution(doc_topics_csv):
    doc_topic_df = pd.read_csv(doc_topics_csv)
    return doc_topic_df

def summarize_topic_distributions(doc_topic_df):
    df_summary = doc_topic_df.groupby(['topic']).sum()
    df_summary['proportion'] = df_summary['proportion']/(df_summary['proportion'].sum())
    df_summary = df_summary.sort_values(by=['proportion'], ascending=False)
    return df_summary

def summarize_topic_frequency(doc_topic_df):
    df_summary = doc_topic_df.groupby('topic').size().to_frame()
    df_summary.columns = ['count']
    df_summary = df_summary.sort_values(by=['count'], ascending=False)
    return df_summary


The `doc-topics.csv` output gets loaded into a pandas dataframe. Then we sort the topics by the proprotion they make up of the entire corpus

In [None]:
doc_topic_df = parse_doc_topic_distribution(doc_topics_csv)
topic_distribution_df = summarize_topic_distributions(doc_topic_df)
topic_distribution_df.plot(kind='bar', figsize=(16,4))

In [None]:
topic_frequency_df =  summarize_topic_frequency(doc_topic_df)
topic_frequency_df.plot(kind='bar', figsize=(16,4),  title='topic distribution by document count')

##### Top 5 topics in the corpus

In [None]:
df_summary.head(5)

##### Top 5 topics in the corpus by number of documents containing it 

In [None]:
topic_frequency_df.head(5)

#### Look for documents most highly related to given topic 

In [None]:
topic_of_interest=4

In [None]:
def find_top_document_for_topic(topic_of_interest, max_results=10):
    filtered_by_topic_df = doc_topic_df[doc_topic_df['topic']==topic_of_interest]
    filtered_by_topic_df = filtered_by_topic_df.sort_values(by=['proportion'], ascending=False)
    return filtered_by_topic_df.head(max_results)

In [None]:
top_10_docs = find_top_document_for_topic(topic_of_interest, max_results=10)
top_10_docs

In [None]:
doc_id = top_10_docs.iloc[0,0]
doc_id

In [None]:
!aws s3 cp s3://$bucket/$prefix/$doc_id $OUTPUT_FOLDER/

In [None]:
!cat $OUTPUT_FOLDER/$doc_id

## Using a graph database (Amazon Neptune) to better explore the topic modeling output

Using a graph database such as [Amazon Neptune](https://aws.amazon.com/neptune/) allows us to understand the relationships between topics and documents in a more natural way. 

### Load results into Amazon Neptune

To [load data into Neptune](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html), we will be using the Neptune **Load** command. The **Load** API requires the input to be in a [specific CSV format](https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html). Therefore, we will first convert our topic modeling output files into the required format.

#### Convert result files to Neptune accepted format
We need to convert our data into 2 csv files:
* vertex csv file  (represents nodes in the graph) 
* edge  csv file (represents edges in the graph) 

In [None]:
neptune_vertex_csv = os.path.join(OUTPUT_FOLDER, 'neptune-nodes.csv')
neptune_edge_csv = os.path.join(OUTPUT_FOLDER, 'neptune-edges.csv')
neptune_vertex_csv_s3_path = f's3://{bucket}/neptune/neptune-nodes.csv'
neptune_edge_csv_s3_path = f's3://{bucket}/neptune/neptune-edges.csv'

!python ./neptune/neptune_csv_converter/loader.py --topictermscsv $topic_terms_csv --doctopiccsv $doc_topics_csv --edgeoutput $neptune_edge_csv --vertexoutput $neptune_vertex_csv

In [None]:
!aws s3 cp $neptune_vertex_csv $neptune_vertex_csv_s3_path
!aws s3 cp $neptune_edge_csv $neptune_edge_csv_s3_path
    

#### Set Neptune endpoint parameters 
Get parameters such as the neptune endpoint from cloudformation output

In [None]:
cloudformation_stack_name='large-text-understanding'

In [None]:
client = boto3.client('cloudformation')

response = client.describe_stacks(
    StackName=cloudformation_stack_name,
    NextToken='string'
)

cloudformation_outputs=response['Stacks'][0]['Outputs']

NEPTUNE_CLUSTER_ENDPOINT=''
NEPTUNE_LOAD_FROM_S3_ROLE_ARN=''
NEPTUNE_CLUSTER_PORT=8182
AWS_REGION=''
for output in cloudformation_outputs: 
    if output['OutputKey'] == 'NeptuneEndpoint':
        NEPTUNE_CLUSTER_ENDPOINT = output['OutputValue'] 
    if output['OutputKey'] == 'NeptuneLoadFromS3IAMRoleArn':
        NEPTUNE_LOAD_FROM_S3_ROLE_ARN = output['OutputValue'] 
    if output['OutputKey'] == 'AWSRegion':
        AWS_REGION = output['OutputValue'] 


%env NEPTUNE_CLUSTER_ENDPOINT=$NEPTUNE_CLUSTER_ENDPOINT
%env NEPTUNE_CLUSTER_PORT=$NEPTUNE_CLUSTER_PORT
%env NEPTUNE_LOAD_FROM_S3_ROLE_ARN=$NEPTUNE_LOAD_FROM_S3_ROLE_ARN
%env AWS_REGION=$AWS_REGION


In [None]:
%run './neptune/neptune.py'

clean existing data in the Neptune database (if any)

In [None]:
neptune.clear(batch_size=5000)

#### Load the nodes csv into Neptune

In [None]:
vertex_load_params = {
    "source" : neptune_vertex_csv_s3_path,
      "format" : "csv",
      "iamRoleArn" : NEPTUNE_LOAD_FROM_S3_ROLE_ARN, 
      "region" : AWS_REGION,  
      "failOnError" : "FALSE", 
      "parallelism" : "HIGH" 
    }

vertex_params_json = os.path.join(OUTPUT_FOLDER, 'vertex_params.json')
with open(vertex_params_json, 'w') as outfile:
    json.dump(vertex_load_params, outfile)
    
!curl -X POST -H 'Content-Type: application/json' \
    https://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/loader -d @$vertex_params_json

replace below with load id from the above response

In [None]:
load_id = "<replace-with-loadId>"

In [None]:
!curl -G https://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/loader/$load_id

#### !!! ONLY PROCEED IF THE ABOVE SHOWS A STATUS OF `LOAD_COMPLETED`

Once all the nodes are loaded to the database. Now we load all the edges

In [None]:
edge_load_params = {
    "source" : neptune_edge_csv_s3_path,
      "format" : "csv",
      "iamRoleArn" : NEPTUNE_LOAD_FROM_S3_ROLE_ARN, 
      "region" : AWS_REGION,  
      "failOnError" : "FALSE", 
      "parallelism" : "HIGH" 
    }

edge_params_json = os.path.join(OUTPUT_FOLDER, 'edge_params.json')
with open(edge_params_json, 'w') as outfile:
    json.dump(edge_load_params, outfile)
    
!curl -X POST -H 'Content-Type: application/json' \
    https://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/loader -d @$edge_params_json

replace below with load id from the above response

In [None]:
load_id = "<replace-with-loadId>"
!curl -G https://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/loader/$load_id

### Querying the graph using Gremlin

Amazon Neptune is compatible with [Apache TinkerPop3](https://tinkerpop.apache.org/docs/current/reference/#intro) and Gremlin 3.4.1. This means that you can connect to a Neptune DB instance and use the Gremlin traversal language to query the graph.

we can use Gremlin to verify we have the right number of nodes and edges loaded into our database 

In [None]:
g = neptune.graphTraversal()

In [None]:
vertices = g.V().groupCount().by(T.label).toList()
edges  = g.E().groupCount().by(T.label).toList()
print('nodes:')
print(vertices)
print('edges:')
print(edges)


### Use a web app to interactively explore graph data

Choose an development environment:
1. if you have `npm`, `node`, `git` installed on your laptop, you can choose to use your laptop
1. alternatively, open [Cloud9](https://us-west-2.console.aws.amazon.com/cloud9/home?region=us-west-2) and create a new cloud9 environment

On either your laptop or the cloud9 environment:
1. In a terminal window: clone the github repo 
    ```
    git clone https://github.com/angelarw/topic-modeling-visualizations
    ```
1. open the `topic-modeling-visualizations/webapp/src/amplify-config.js` file. Replace all configuration with values from the Cloudformation output
1. Install angular cli
   ```
   npm install -g @angular/cli
   ```
1. Install code dependencies 
   ```
   cd topic-modeling-visualizations/webapp/
   npm install
   ```
1. Run the webapp 
   ```
   ng serve --port 8080 --disable-host-check
   ```