## Some Simple Topic Analysis Network Vis with Python, a Topic GUI (running Mallet), Excel, Gephi

### By Lynn Cherny (@arnicas) (March, 2015 and Oct, 2015) for PyLadies Boston and UMiami CCS

Topic modeling (sometimes known by the technical description LDA -- for "latent dirichlet allocation") is a statistical method for exploring the words in document collections and document relationships to one another via those words. In topic modeling, a topic is inferred from documents as a collection of likely words that are found in those documents. Some documents may be associated strongly with some topics and less strongly with others.   Here's an overview borrowed from this [presentation](http://www.slideshare.net/vitomirkovanovic/topic-modeling-for-learning-analytics-researchers-lak15-tutorial?utm_content=buffer89b5f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer):

<img src="images/goal_topic_modeling.png">

For instance, a news article about a Chinese soccer team might be loosely associated with a topic that includes words about Chinese politics, but more strongly associated with a topic that includes words about sports and soccer. Documents with similar vocabularies (content) will generally end up associated with similar topics, because the topics are constructed out of the observed frequencies of words in the documents.

###A Few More Technical (and Non-Technical) References For Those Interested in LDA

* ["Probabilistic Topic Models"](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf) by David Blei in CACM
* [Dimensionality Reducation and Latent Topic Models](http://pages.cs.wisc.edu/~jerryzhu/cs769/latent.pdf) -  notes by Xiaojin Zhu, covers LSA and other methods, not just LDA
* ["Topic modeling made just simple enough"](http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) by Ted Underwood from Digital Humanities perspective

### But We're Mainly Trying to Visualize Results In This Demo


There are a lot of more technical introductions to this topic, but this is meant to be a short talk illustrating the easiest way to get visual results without a lot of coding.

We'll be using some chapters of well-known books from the [Gutenberg Project](https://www.gutenberg.org/) as well as some excerpts from modern best-sellers: 

* **Twilight (Stephanie Meyer)**, 
* **Fifty Shades of Gray (EL James)**, 
* and **The DaVinci Code** and **Angels and Demons (both by Dan Brown)**.  
* **Jane Austen's Pride and Prejudice** and **Sense and Sensibility**, 
* **JM Barrie's Peter Pan**, 
* **Joseph Conrad's The Secret Agent**, 
* **AC Doyle's Sherlock Holmes** stories, 
* **George Eliot's Middlemarch**, 
* **Grimm's Fairytales**, 
* **CS Lewis's The Lion, the Witch, and the Wardrobe,** and 
* **RL Stevenson's Treasure Island.**

I picked these since we'd expect to be able to see patterns in their content across both chapters and authors, but we might also learn something interesting, too.

#### Verify the contents of the data folder we'll be using: 

In [None]:
ls data/mixed_chapters/

*Alert: If you're on a Mac, check for a .DS_Store file here and delete it if you can before running the topic modeler.*

The topic tool GUI we will use is available in this repo (and duplicated in my github repo, but it may not run for you without a fresh download, especially on Windows): https://code.google.com/p/topic-modeling-tool/.  This tool runs a GUI wrapper around [mallet](http://mallet.cs.umass.edu/), a state-of-the-art command line tool for topic modeling. After you've done this exercise, you can try running it without the GUI and exploring the other options in that package.

Download it if you don't have it already.  The tool will be called **TopicModelingTool.jar.**

Create a directory for output files, called **topic_output,** for example.  Doubleclick on the jar file to run it. Select the input directory **data/mixed_chapters** and the output directory you created. Click on **Advanced...** and change the settings to match.


#### It should look something like this:
<img src="images/topicModelingTool_advanced.png">

Run it by clicking "Learn Topics." Look at the 2 folders that have appeared in your ouput directory (e.g., topic_output), if this ran correctly:

* output_csv
* output_html

If you want to browse around the output produced in the HTML folder, feel free (click on the _all_topics.html_ file to load it in a browser). **NB: You will probably see slightly different results... LDA training is non-deterministic and the element of randomness may lead to different results.**  But they will probably nevertheless be comparable; I've run several times on this data and gotten similar results each time.  For instance, with 12 topics, here is one set of results:

<img src="images/topicModeling_allTopicsHTML.png">

Each topic is described by ten words associated with the documents in that topic. When you click on a topic, you can see some of the documents and the matching score for each:

<img src="images/topicModeling_docsInTopic.png">

And if you click on a document, you can see how well it matches other topics in the set of topics. 

<img src="images/topicModeling_docMatches.png">

I find the HTML non-visual display kind of confusing. So we're going to make a network diagram instead. In this code, you want to set the path to your output directory for the csv files:


In [None]:
import csv
import collections

DIR = 'topic_output/output_csv/'
topicWords = DIR + 'Topics_Words.csv'
topicDocs = DIR + 'TopicsInDocs.csv'
docsTopics = DIR + 'DocsInTopics.csv'


In [None]:
# This will give you the 10 words per topic; we asked for 12 topics.
# Be careful in your code - the topics are numbered starting with 1, not 0.

def list_words_for_topics(filename):
    """ Expects the Topics_Words csv file."""
    words = {}
    with open(filename, 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for index, row in enumerate(reader):
            if index > 0:  # skip first row
                words[row[0]] = [x.strip() for x in row[1].split() if len(x.strip())>0]
                print row
    return words

In [None]:
word_list = list_words_for_topics(topicWords)
len(word_list)  # you will have as many as topics you requested - so, 12.

In [None]:
word_list['4'] # listing the words in the topic number

Before we move on, what are some things you notice about the words in the topics? Anything odd, or any patterns here?  Why?  Is it good, or bad?



#### Let's get the filenames mapped to the document id's which are used in most of these data files. The document filename and id is found in the DocsInTopics.csv.  While we're at it, let's get the authors, too.

In [None]:
def get_names_for_ids(filename):
    """ Expects the DocsinTopics csv file."""
    
    doc_titles = {}  # dictionary, the keys will be the doc id, the filename the value
    doc_authors = {}
    with open(filename, 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for index, row in enumerate(reader):
            print row
            if index > 0:  # skip first row making the data structure
                title = row[3].split('/')[-1:][0]  # last element of the split
                author = title.split('_')[0]
                id = row[2]
                print "Parsed out ", title, author
                doc_titles[id] = title
                doc_authors[id] = author
    return doc_titles, doc_authors

In [None]:
doc_ids, doc_authors = get_names_for_ids(docsTopics)

In [None]:
doc_ids['1']

*Note here: If you see ".DS_Store" as a document, you need to delete that file from the directory you input to the topic modeling tool GUI and rerun that.*

In [None]:
doc_authors

#### One of the output files, the most useful for network drawing, is in an INSANE format where topic numbers alternate with scores for the percentage of the document matched to the topic.


<img src="images/TopicsInDocsCSV.png">

From looking at it, we can see that some documents - like 3, 4, 5, 6 - don't have as many topics associated with them.  We used a cutoff value of .05 (5%) in the GUI when we ran the modeling, so we don't report any associations weaker than that. 

**But let's parse it nicely so we can use this in a visual.**

In [None]:
# This gets you the topic assigments and strength of association for each document and topic.

def parse_topicsDocs(filename):
    """ Filename input is the TopicsInDocs.csv file path. """
    
    docs = {}
    with open(filename, "rb") as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for index, row in enumerate(spamreader):
            if index > 0:  # skip first row
                print "row", row
                docid = row[0]
                topics = row[2:]
                topics_dict = dict(zip(topics[::2],topics[1::2]))  #alternating
                print docid, topics_dict
                docs[docid] = topics_dict
    return docs

In [None]:
docs_alltopics = parse_topicsDocs(topicDocs)

In [None]:
# This should match the first row of the spreadsheet!  Although it's not ordered the same way.

docs_alltopics['1']

#### Check you have the right number of documents in your new dictionary, to match files count (note if you have .DS_Store you will have a mismatch!)

In [None]:
ls data/mixed_chapters | wc -l

In [None]:
# Verify the length of the dict is the number of docs in the data directory

len(docs_alltopics)

## Now let's do a simple first visualization of the data in Excel.


To make a much simpler list of the data, I want to format it "long form," with a document-topic pair per row.  The filename and author are just helpful for understanding the data, and for grouping by author if we want to.

In [None]:
# Make a simple csv format we could see in Excel: doc #, topic, strength, title

def doc_topics_for_excel_table(docs_alltopics, doc_ids, doc_authors, filename):
    ''' Produce a csv of docs, topics, scores, and filename.
    
    Args:
        docs_alltopics: the output from read_doctopics
        doc_ids: document ids to filename dict
        filename: to save to
    Output:
        A csv file we can open in Excel.
    '''
    with open(filename, "w") as handle:
        print "DocID,Topic,Score,File,Author"  # first row headers for cell below
        
        handle.write("DocID,Topic,Score,File,Author\n")  # the header of the file
        for id, topics in docs_alltopics.iteritems():
            #print x, docs_alltopics[x].keys()
            for topic, score in topics.iteritems():
                #print topic, score
                print ','.join([str(id), "Topic"+str(topic), str(score), doc_ids[id], doc_authors[id]])
                handle.write(','.join([str(id), "Topic"+str(topic), str(score), doc_ids[id], doc_authors[id] + "\n"]))

In [None]:
doc_topics_for_excel_table(docs_alltopics, doc_ids, doc_authors, 'data/for_excel.csv')

#### Now open that file in Excel and do some nice analysis / viewing of the topics in a pivot table!

####  In a pivot table, you can see the matrix of files and topics and scores, and sort as you like. Note that this table view will have different numbers than yours...
<img src="images/Excel_Pivot.png" width="90%">

Some things are clear from this view: the Austen files are all associated strongly with the same topic numbers (7 and  11 in this picture), and not with most other topics. Topics 11 and 9 are pretty diffuse - weakly matching most of the documents.  The strength of author to topic, even across books, is remarkable for some authors, like Dan Brown, Doyle, and EL James.  But we should expect that, knowing what we know about their writing!

Let's try a network view now.  Let's generate that data...

# Network Views

One way to visualize this data is in a network - since we have relationships among a bunch of objects. 
A simple first pass is to imagine 2 node types: documents, and topics.  The links between these nodes 
are the relations of the documents to topics.  We can also scale the edge lines by the strength of the relationship to the topic.

<img src="images/simple_network.png" width="30%">

In the [Gephi network visualization tool](http://gephi.org), we can import simple nodes (the dots above) and edges, the links between them, from CSV files.  So we have to do some data munging to write those out.

In [None]:
def make_nodes_file(doc_ids, doc_authors, topic_list=None, filename='nodes.csv'):
    ''' Produce a csv of docs, topics, scores, and filename.
    
    Args:
        doc_ids: dict with document id keys to filename values
        doc_authors: dict with doc id keys to authors of docs
        topic_list (opt'l): a dict of the words for each topic, the output from list_words_for_topics
        filename: file to to save to, default nodes.csv
    Output:
        A dict of node id numbers for docs and topics, keys being 'Doc' + id or 'Topic' + id.
        This is because Gephi needs unique node id's.
    '''
    
    idNumbering = {}
    
    counter = 1
    for doc in doc_ids.keys():
        idNumbering['Doc' + doc] = counter
        counter += 1
    
    if topic_list:
        for topic in word_list:
            idNumbering['Topic' + topic] = counter
            counter += 1
                
    with open(filename, "w") as handle:
        print "Id,Label,Type,Author"  # first row headers for cell below
        handle.write("Id,Label,Type,Author\n")  # the header of the file
        for doc in doc_ids.keys():
            string = ','.join([str(idNumbering['Doc' + doc]),
                        doc_ids[doc].replace('.txt',''), 'Doc',
                        str(doc_authors[doc])])
            print string
            handle.write(string + '\n')
        if topic_list:
            for topic in word_list.keys():
                string = ','.join([str(idNumbering['Topic' + topic]),
                            ':'.join(['Topic' + topic] + word_list[topic][0:4]), 'Topic',
                            'None'])
                print string
                handle.write(string + '\n')
    
    return idNumbering

In [None]:
numbering = make_nodes_file(doc_ids, doc_authors, topic_list=word_list, filename = 'data/nodes.csv')

In [None]:
def make_edge_file(doctopics, filename = 'edges.csv'):
    """ 
    Print out the edges, the links from documents to topics, with scores, as csv.
    Args:
        doctopics: the result of parse_topicsDocs
        filename: file to save result to as csv
    Output:
        None
    """

    with open(filename, 'w') as handle:
        handle.write("Source,Target,Weight\n")
        print "Source,Target,Weight"
        for doc, topics in doctopics.iteritems():
            for topic, score in topics.iteritems():
                string = ','.join([str(numbering['Doc' + str(doc)]), 
                                   str(numbering["Topic"+str(topic)]), 
                                   str(score)])
                print string
                handle.write(string + '\n')

In [None]:
make_edge_file(docs_alltopics, filename = 'data/edges.csv')

## Now open Gephi.

Load the nodes and edges files into Gephi.  Get the Data Table window open. 

<img src='images/Gephi_ImportSpreadsheet.png'>

Load the nodes file and the edges file using the "import spreadsheet" button.

<img src='images/Gephi_ImportNodes.png'>

Don't forget the edges file too!

<img src='images/Gephi_ImportEdges.png'>

If you need more help, try this [Gephi help-page](https://github.com/gephi/gephi/wiki/Import-CSV-Data).

Now switch to the Overview tab to lay out your nodes and size/color things.  Use the Preview tab to make a pretty version to print. There are some helpful instructions at the beginning of the [GephiToSigmaJS pdf](files/GephiToSigmaJS_Grimms.pdf) in the files directory, but the end of it isn't relevant here.

Some things to try:
* Color ("partition") nodes by type - topic or document.
* Color by author - so you can see documents by the same author, regardless of the chapter/title.
* Size nodes by degree - the number of edges attached to them.  This shows you the more "popular" topics.
* Use a force (force atlas 2) layout, so that items more connected are closer together.
* Use "label adjust" to shift nodes a little, to prevent overlap of label and node. You'll still need to hand adjust for a good final version.
* Adjust in Overview, and fine-tune the visual for export in Preview.


When you're done, your Preview might look something like this, where thicker edges mean stronger relationship:

<img src='images/Docs_Topics_Gephi.png' width="80%">

In the image, you can see the Austen files grouped together closely to a topic with "elinor" in it.  The EL James files are close to their topic node.  Thicker lines represent stronger ties.  You can see some strong ties crossing the entire network.

## Suppose we want a network of just the documents, removing the topics in between?  

We can do a little data munging to link documents that belong to a topic, based on a scaled average of the weights.  First, let's use the nicely formatted excel file to get a dictionary of the topics and their documents and weights:

In [None]:
from collections import defaultdict
topics = defaultdict(list)

with open('data/for_excel.csv', 'r') as handle:
    reader = csv.reader(handle)
    next(reader)  # skip the first row which is labels
    for row in reader:
        doc = row[0]
        topic = row[1]
        weight = row[2]
        topics[topic].append((doc, weight))

In [None]:
# This is a dict of topics with a list of doc-weight pairs per topic.
topics

In [None]:
topics['Topic2']

We'll do a little clever data munging to make relationships between all the documents that are associated with a topic.

In [None]:
# A short way to make a set of pairs of items in a list
import itertools
pairs = [24,34,454,54]
[x for x in itertools.combinations(pairs,2)]

In [None]:
# we can apply this principle to the pairs of the doc and score, too:

[x for x in itertools.combinations(topics['Topic2'],2)]

#### Now let's create the edges file without the topics in the mix...  We'll also do a little dirty math to try to adjust the scores to make a variant that relates two documents to each other, based on their original relationship to the topic.

In [None]:
edgesScores = defaultdict(list)

for topic, doclist in topics.iteritems():
    # make the pairs from all the document, weight items.  First filter at whatever weight you want.
    filtered = [x for x in doclist if float(x[1]) >= .30]
    
    # filtering because there are just TOO MANY LINKS otherwise.  Trust me, I tried it first without.
    print "Topic", topic, "originally", len(doclist), "filtered to", len(filtered)
    combos = [x for x in itertools.combinations(filtered,2)]
    for pair in combos:
        #print pair
        node1 = pair[0][0]
        node2 = pair[1][0]
        weight1 = float(pair[0][1])
        weight2 = float(pair[1][1])
        if weight1 and weight2:
            # an approximation of the distance based on similarity in this topic
            if weight1 != weight2:
                weight = (1 / abs(weight1 - weight2)) / 10
            else:  # just make it high if the scores are the same
                weight = 100
            if node2 < node1:
                # swap to keep the same ordering everywhere
                node1, node2 = node2, node1
        edgesScores[(node1,node2)].append(weight)

In [None]:
list(edgesScores.iteritems())[0:2]

In [None]:

with open('data/edges2.csv', 'w') as handle:
    print 'Source,Target,Weight'
    handle.write('Source,Target,Weight\n')
    for pair, weights in edgesScores.iteritems():
        source = str(numbering['Doc' + pair[0]])
        target = str(numbering['Doc' + pair[1]])
        weight = round(sum(weights)/len(weights),2)  # avg, rounded to 2 decimal places
        print ','.join([source, target, str(weight)])
        handle.write(','.join([source, target, str(weight)]) + '\n')

#### Write out a new nodes files without the topics in there.  That's why it was optional!

In [None]:
numbering = make_nodes_file(doc_ids, doc_authors, filename = 'data/nodes2.csv')

### The view in Gephi differs now...

After loading that into Gephi and doing some work on it, you get something like this... where, as expected, most of the chapters by the same author are linked.  There are a few interesting oddities, though!
<img src="images/Docs_Only.png" width="80%">

###Export a Web Graph

If you want to export an interactive web page for your graph, you can use the Gephi Plugin for sigma.js. There are instructions in the file [GephiToSigmaJS_Grimms.pdf](https://github.com/arnicas/TopicsNetworksPyladies/blob/master/files/GephiToSigmaJS_Grimms.pdf).

Before you run the server to display your sigma.js graph, you should make some modifications to the display settings in the config file.  They are described in the PDF.  Or, you can run your server using files/run_network.py and it will modify the file for you:

````
>python run_network.py [network_dir] [optl port]
````


### More Reading/Code

For some more fun, you can run topic modeling in Python directly using Gensim (or other packages like [lda](http://pythonhosted.org/lda/)).

* A recent example of topic modeling with gensim on Shakespeare's sonnets: http://nbviewer.ipython.org/github/sgsinclair/alta/blob/master/ipynb/TopicModelling.ipynb
* A tutorial slidedeck focused mostly on R, not Python: http://www.slideshare.net/vitomirkovanovic/topic-modeling-for-learning-analytics-researchers-lak15-tutorial?utm_content=buffer89b5f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
* The gensim package's intro material on LDA: http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
* Slightly more stuff in my longer tutorial on github: [TopicsPyhonGephi](http://github.com/arnicas/TopicsPythonGephi)
