Interpreting Topic Modeling Tool Results - Panama Papers News Coverage
"Panama Papers" News Story
The "Panama Papers" were about 12 million leaked documents that shed light upon people around the world who were hiding money in offshore entities through the Panamanian company Mossack Fonseca. The news coverage in the immediate aftermath of the leak tended to focus on a wide range of issues from the technical aspects of a hack to the implications of various world leaders and celebrities who were exposed. In the weeks following the leak, I selected 34 articles about the issue and extracted the text content into text files (see articles below.)
Manual Topic Modeling Activity
Topic Modeling is an algorithm that finds “a recurring pattern of co-occurring words” in a corpus of text. This does take overall usage of a word across the entire corpus into consideration, so words used often in many documents will not appear in every topic. So, for example, "Panama" will almost certainly be used at least once in every article — the average is about 6.3 times per article. However, "Panama" will only be included in an article's keywords if it's used a significant amount of times and has close usage relationships with other words.
Today you'll skim a few articles and try your best to imitate topic modeling algorithms. Your assigned articles can be found below.
Comparing and Analyzing Results
I used Topic Modeling Tool (TMT), an easy-to-use tool for using MALLET for topic modeling. TMT produces a set of CSV files and a set of HTML files with your output. Take a look at these results with 20 topics. (Remember, these aren't labeled topics, they're clusters of words that likely represent a topic.)
Are there any identifiable "topics" here? Are there any "topics" that don't seem to make sense?
If you click on one of the topics, you'll see the list of documents ordered by how closely each document corresponds with the topic. The number in parentheses is the number of times words in the topic appear in the document. Now click on one of the text files. This will show you the full text of the file, and it will also show the topics that align closely with your topic.
Take a few minutes and explore these results. Click through the network of topics and documents and see if you can find any patterns.
Once you've examined the results with 20 topics, take a look at the same articles run with 40 topics.
What differences do you see between 20 topics and 40 topics? Which set do you think are most useful to you? Why?