# Topic Modeling in Python
<ol><li>Prep</li>
<li>Pre-Process</li>
<li>Topic Model</li>
<li>Interpreting the Model</li>
<li>Revising Model Inputs</li></ol>

# 1. Prep

In [None]:
from datascience import *
import nltk
modules = ["punkt", "words", "stopwords", "averaged_perceptron_tagger", "maxent_ne_chunker"]
for module in modules:
    nltk.download(module)

### Corpus Description
English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on disk, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files.

### Metadata Columns
<ol><li>Filename: Name of file on disk</li>
<li>ID: Unique ID in Piper corpus</li>
<li>Language: Language of novel</li>
<li>Date: Initial publication date</li>
<li>Title: Title of novel</li>
<li>Gender: Authorial gender</li>
<li>Person: Textual perspective</li>
<li>Length: Number of tokens in novel</li></ol>

## Import Corpus

In [None]:
# Read metadata
metadata_tb = Table.read_table('txtlab_Novel150_English.csv')

In [None]:
metadata_tb

In [None]:
# Set location of corpus folder
fiction_path = 'txtalb_Novel150_English/'

In [None]:
# Read novel plaintext from file

# Create empty list, entries will be list of tokens from each novel
novel_list = []

# Iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # Read in novel text as single string, make lowercase
    with open(fiction_path+filename, 'r') as file_in:
        novel = file_in.read()
    
    # Add list of tokens to master list
    novel_list.append(novel)

# 2. Pre-Process

Typically, this is the point where we would process texts into a document-term matrix. In this case, our workflow is tailored to the topic-modeling package's format.

## Tokenize

In [None]:
# Even though I love nltk's word_tokenize(), I just need a
# quick and dirty tokenizer. Emphasis on quick.

def fast_tokenizer(text):
    
    # Get a list of punctuation marks
    from string import punctuation
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in text if char not in punctuation])
    
    # Split text over whitespace
    tokens = no_punct.split()
    
    return tokens

In [None]:
# Compare tokenizers on a test sentence

wuthering_heights = "1801.—I have just returned from a visit to my landlord—the solitary neighbour that I shall be troubled with."

print(fast_tokenizer(wuthering_heights))
print(nltk.word_tokenize(wuthering_heights))

In [None]:
# Tokenize
noveltokens_list = [fast_tokenizer(novel.lower()) for novel in novel_list]

In [None]:
# Inspect tokens from first novel
noveltokens_list[0]

## Gensim Dictionary

In [None]:
# Import Topic Model package
import gensim

In [None]:
# Create dictionary based on corpus tokens
# Each token is mapped to its own unique ID

dictionary = gensim.corpora.dictionary.Dictionary(noveltokens_list)

In [None]:
# Map lists of tokens to the dictionary IDs
dictionary.doc2bow(['pride','prejudice', 'pride'])

In [None]:
# Remove stopwords, (some!) proper names from dictionary
from nltk.corpus import stopwords, words

In [None]:
stopwords.words('english')

In [None]:
words.words()

In [None]:
# Proper name test
'Ishmael' in words.words()

In [None]:
proper_names = [word.lower() for word in words.words() if word.istitle()]
bad_words = stopwords.words('english')+proper_names

In [None]:
# Map stopwords, proper names to dictionary IDs
stop_ids = [_id for _id,count in dictionary.doc2bow(bad_words)]

# Remove stopwords from dictionary mappings
dictionary.filter_tokens(bad_ids = stop_ids)

In [None]:
# Remove terms by document frequency
dictionary.filter_extremes(no_below=15)

## Bag-of-Words

In [None]:
# Create list of dictionary mappings by novel
# This is gensim's version of a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in noveltokens_list]

In [None]:
# Inspect corpus element
corpus[0]

# 3. Topic Model

### Latent Dirichlet Allocation (LDA) Models
LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."

### Topic Model Features
<ul><li>Corpus: Pre-processed textual corpus</li>
<li>Number of Topics: Choosing this is the art of Topic Modeling </li>
<li>Alpha (Hyperparameter): Prior reflecting expected distribution of topics over documents</li>
<li>Iterations: TM initially uses random distribution, iteratively tweaks model</li>
<li>Passes: Bootstrap method for evaluating model; primarily seen in Gensim implementation</li></ul>

## Training

In [None]:
# Train Topic Model
lda_model = gensim.models.LdaModel(corpus, num_topics=40, alpha='auto', \
                                   id2word=dictionary, iterations=2500, passes = 4)

# If you have more than two cores at your disposal, then perhaps try:
#lda_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics=40, id2word=dictionary, iterations=2500, passes = 4)

## Topics

In [None]:
# Quick look at n topics among those inferred
lda_model.show_topics(10)

In [None]:
# Deeper look at particular topic
lda_model.show_topic(8, topn=20)

In [None]:
# Deeper look but labels with id2word mapping
lda_model.get_topic_terms(8, topn=20)

In [None]:
## EX. The 'topn' argument returns only the given number of terms
##     for each topic. Rewrite that argument to return all values.

In [None]:
## EX. Return a list of all term values for topic 0. By default,
##     these are ordered by the probability associated with each term.
##     Instead, order the list according to the words id2word label

## CHALLENGE: Create a table that contains all topic-term distributions.
##     Each row is a topic and each column is labelled by the word it represents.

## Corpus

In [None]:
# Measure of model's "fit" to data
# Related to the probability of seeing text given inferred model

lda_model.log_perplexity(corpus)

In [None]:
# Most present topics in corpus
lda_model.top_topics(corpus)

## Documents

In [None]:
# Most prominent topics in a given document
lda_model.get_document_topics(corpus[0])

In [None]:
# Distribution of all topics over a document
lda_model.get_document_topics(corpus[0], minimum_probability=0)

In [None]:
## EX. Return a list of the most prominent topics in document 10.
##     What terms are most prominent in those topics?

## EX. Compare your answers to the previous exercise with a classmate.
##     Do similar topics come up? Different ones?

# 4. Interpeting the Model

### Metadata
There are many strategies that can be used to interpret the output of a topic model. In this case, we will look for any correlations between the topic distributions and metadata.

In [None]:
# Create list of all document-topic distributions
list_of_doctopics = [lda_model.get_document_topics(corpus[i], minimum_probability=0) for i in range(len(corpus))]

In [None]:
list_of_doctopics[0]

In [None]:
# In the list above, each topic got represented as a tuple containing
# the label of the topic and its probability within the given document

# Create list containing only the probabilities (remains ordered by topic label)
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]

In [None]:
list_of_probabilities[0]

In [None]:
# We'll put these into a labeled column format so that we can add
# document-topic distributions to our original metadata table

# Note that this means a cumbersome switch from lists that represent rows
# to lists that represent columns

labeled_columns = [['Topic '+str(i),[document[i] for document in list_of_probabilities]] for i in range(50)]

In [None]:
labeled_columns[0]

In [None]:
# Add these as new columns to the metadata table
metatopic_tb = metadata_tb.with_columns(labeled_columns)

In [None]:
# Quick and dirty correlation function

def correlator(tb, col_1, col_2):
    import numpy as np
    col_1_in_su = [(x-np.mean(tb[col_1]))/np.std(tb[col_1]) for x in tb[col_1]]
    col_2_in_su = [(x-np.mean(tb[col_2]))/np.std(tb[col_2]) for x in tb[col_2]]
    col_mult = [col_1_in_su[i]*col_2_in_su[i] for i in range(len(col_1_in_su))]
    r = np.mean(col_mult)
    return r

In [None]:
correlator(metatopic_tb, 'date', 'Topic 0')

In [None]:
## EX. Find any topics that have an r^2 value greater than 0.1.
##     Return the top terms for those topics. Are the correlations
##     positive or negative?

## EX. Try running the topic model without removing any words from
##     the dictionary. How do the topics change?
##                     Try changing the minimum document frequency.

# 5. Revising Model Inputs

In [None]:
## EX. Some proper names and titles still came through our filter.
##     Use nltk's NER function to remove names in a more targeted way.

## EX. In Matt Jockers's study of literary theme, he included only
##     nouns for topic modeling. Use nltk's POS tagger to remove all
##     words from the corpus that are not common nouns.

## EX. Jockers also found it useful to split texts into 1000-noun chunks
##     after the POS filter. Run the topic model over these smaller chunks.