# Challenge Lab 6.3: Implementing Topic Modeling

In this lab, you will use either Amazon Comprehend or the Amazon SageMaker Neural Topic Model (NTM) to extract topics from the [CMU Movie Summary Corpus](http://www.cs.cmu.edu/~ark/personas/). 

## CMU Movie Summary Corpus

The CMU Movie Summary Corpus is a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenue, genre, and date of release) and character level (including gender and estimated age).  This data supports work in the following paper:

David Bamman, Brendan O'Connor, and Noah Smith. "Learning Latent Personas of Film Characters." Presented at the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013. http://www.cs.cmu.edu/~dbamman/pubs/pdf/bamman+oconnor+smith.acl13.pdf.

You will use two datasets in this lab:

**plot_summaries.txt**

This dataset contains plot summaries of 42,306 movies, extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

**movie.metadata.tsv**

This dataset contains metadata for 81,741 movies, extracted from the November 4, 2012 dump of Freebase. The data is tab-separated and contains the following columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)

## Lab steps

To complete this lab, you will follow these steps:

1. [Installing the packages](#1.-Installing-the-packages)
2. [Reviewing the dataset](#2.-Reviewing-the-dataset)
3. [Extracting topics](#3.-Extracting-topics)

## Submitting your work

1. In the lab console, choose **Submit** to record your progress and when prompted, choose **Yes**.

1. If the results don't display after a couple of minutes, return to the top of these instructions and choose **Grades**.

     **Tip**: You can submit your work multiple times. After you change your work, choose **Submit** again. Your last submission is what will be recorded for this lab.

1. To find detailed feedback on your work, choose **Details** followed by **View Submission Report**.

## 1. Installing the packages
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

First, update and install the packages that you will use in the notebook.

In [None]:
%matplotlib inline

import boto3
import os, io, struct, json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import uuid
from time import sleep
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
bucket = "c51302a798363l1767466t1w753256443787-labbucket-vzr6xg8irt81"
job_data_access_role = '<DATABASE-ARN-GOES-HERE>'
prefix='lab63'

## 2. Reviewing the dataset
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

First, load the plot_summaries.tsv data into a pandas DataFrame.

The file contains two columns: **movie_id** and **plot**. The data is tab-separated, and the '\t' escape sequence is used as the separator.

In [None]:
df = pd.read_csv('../data/plot_summaries.tsv', sep='\t', names=['movie_id','plot'])

Review the first few rows of data to get an overview of how the data is structured.

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

df.head(5)

To check the number of rows and columns, use the `shape` property.

In [None]:
df.shape

Now examine the metadata. The [dataset documentation](http://www.cs.cmu.edu/~ark/personas/data/README.txt) explains that the data contains nine fields. Load the data into a pandas DataFrame and specify the column names.

In [None]:
movie_meta_df = pd.read_csv('../data/movie.metadata.tsv', sep='\t', names=['movie_id','freebase_id','name','release_date','box_office_revenue','runtime','languages','countries','genres'])
movie_meta_df.head()

Set the index to **movie_id**, which will make it easier to merge this dataset with the plot.

In [None]:
movie_meta_df.set_index('movie_id', inplace=True)

Because you only need the movie name and something to link this metadata to the plot (**movie_id**), drop the remaining columns.

In [None]:
movie_meta_df=movie_meta_df.drop(['freebase_id','release_date','box_office_revenue','runtime','languages','countries','genres'], axis=1)
movie_meta_df.head()

## 3. Extracting topics
([Go to top](#Challenge-Lab-6.3:-Implementing-Topic-Extraction))

You must now decide if you are going to use Amazon Comprehend or the SageMaker NTM algorithm to extract your topics. Both will do a good job of giving you topics, but each has different data requirements.

Refer to the notebooks from labs 6.1 and 6.2 for any code snippets you might need for each solution. Experiment with the number of topics to see if you can get better results. 

Questions to address:

1. What data cleanup do you need to perform?

2. How many topics will give you the best results?

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*
