# Computational Text Analysis (LATIS/Libraries), Spring 2020

## Intro
lorem ipsum

### Strings
To work with text in Python it's important to be able to manipulate string variables. Let's create a variable called text_string and print it as output.

In [None]:
text_string = 'Hi there'
print(text_string)

We can check what kind of a variable this is by using the built-in type() function in Python.

In [None]:
type(text_string)

We can refer to specific characters in the string by slicing it using brackets:

In [None]:
print(text_string[0])

In [None]:
text_string[1]

In [None]:
text_string[0:3]

In [None]:
text_string[3:]

We can also 'add' or *concatenate* strings together using the plus sign.

In [None]:
text_string + '!'

In [None]:
text_string += '!'

In [None]:
text_string + ' ' + 'How are you today?'

Note that since we didn't save the concatentations above to a variable the original text_string is unchanged.

In [None]:
len(text_string)

### Built-in string methods
Strings also have built-in methods that can operate on them. These include *join*, *find*, *replace*, *lower* and *upper*

In [None]:
how_string = 'How are you today, Mike?'
how_string.replace('H', 'C')

In [None]:
how_string.replace('are you today', 'were you yesterday')

In [None]:
how_string.lower()

In [None]:
how_string.upper()

In [None]:
'x'.join(how_string)

### Reading files

Usually you want to work with text from files though, and not manually create string variables. 

The first step is to read in the files containing the data. Common file types for text data are: 
* `.txt`
* `.csv`
* `.json`
* `.html` 
* `.xml`

Each file format requires specific Python tools or methods to read, but for our case, we'll be working with .txt files.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

Let's take a look at the first file in our directory (folder) of State of the Union addresses (`/sotu_text`):

In [None]:
# create a new variable called file1 and read ("r") the first file in the sotu_text folder
file1 = open("sotu_text/215.txt","r") 

In [None]:
# but when we print the variable, it's not yet stored as a string
type(file1)

In [None]:
# to view the text, let's read in the file1 object to a new variable called "text" using .read() and then print out the first250 characters
text = file1.read()

In [None]:
print(text[0:250])

### Tokenization

Once we've read in the data, a common next step is to split a longer string into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace
Let's save each word to a list variable called 'tokens'

In [None]:
# use the split() function to split the text variable up by whitespace into a tokens list
tokens = text.split()

In [None]:
# what kind of a variable is tokens?
type(tokens)

### Lists 
You can view each item in a Python list using the same syntax we used above to slice a str variable. The first item in the tokens list is at ```tokens[0]``` and the second is ```tokens[1]```. You can view a range of the first 10 as follows:

In [None]:
tokens[0:10]

In [None]:
# the first one
tokens[0]

In [None]:
# the last ten
tokens[-10:]

In [None]:
# Note: you can also slice the string variables stored inside of a list
tokens[1][0:5]

### Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences, and provides a different way to tokenize our text.

#### Sentence segmentation by splitting on punctuation

In [None]:
# instead of the default whitespace for split(), you can identify the character or characters you'd like to split on
sentences = text.split('.')
sentences[0]

We can check how many items are in any list using the len() function.

In [None]:
len(sentences)

In [None]:
# note that this method doesn't break out sentences that end with other punctuation, like question marks
sentences[35]

### Regular Expressions
We could improve on this by using regular expressions. They allow us to split strings using specific characters or patterns that match different *kinds* of characters. Regex is a very powerful tool, but we won't go into it much today. For help figuring out and working with regular expressions we recommend https://regex101.com/

In [None]:
import re

In [None]:
# this pattern matches periods, question marks, or exclamation marks
boundary_pattern = r'[.?!]'
sentences_re = re.split(boundary_pattern, text)

In [None]:
# there are now a few more sentences in our list
len(sentences_re)

In [None]:
# and this sentence ends at the question mark
sentences_re[35]

### Strip whitespace

This is an extremely common step in text cleaning. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [None]:
string = " Hi there! "
string

In [None]:
string.strip()

We can also use ```strip()``` to remove "line breaks" from strings. Line breaks are often represented with "escape characters" such as ```\n``` in text files.

In [None]:
text[-25:]

In [None]:
# we can remove whitespace at the beginning and end of a string using .strip()
stripped_text = text.strip()
stripped_text[-25:]

You can also run more complex find/replace patterns using regex. Here we use ```re.sub()``` to match any \s+ characters with a single space.

In [None]:
# we can use regular expressions to remove whitespace throughout the string
# note that we are replacing any of the matching whitespace patterns with a single space ' '.
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text[-25:]

### Text normalization
Text normalization can help us clean our text to fit some standard patterns. One common normalization step is to remove case from the text.

If you want to count the frequencies of words, for example, using lower case will ensure you don't count "Death" and "death" as two separate words.

In [None]:
caps_string = "Hi There! Can you believe it's 2020?"
caps_string.lower()

In [None]:
clean_text = clean_text.lower()
clean_text[0:250]

Depending on your analysis, you might also want to throw out numerals.

In [None]:
# remove digits using regex
digits = r'\d+'
re.sub(digits, '', caps_string)

In [None]:
# note that since we didn't assign the changes to the string variable, the changes aren't "saved"
caps_string

#### Removing punctuation

Sometimes you might want to keep only the alphanumeric characters (the letters and numbers) and ditch the punctuation. Here's how we can do that.


In [None]:
import string
string.punctuation

In [None]:
# strip() will remove punctuation from the beginning or end of the string
caps_string.strip(string.punctuation)

The following code looks a little complex, but essentially it will move through each character in our ```caps_string``` variable, and replace any punctuation mark from our ```string.punctuation``` list with a blank string, ```''```

In [None]:
# this code will return all punctuation from the caps_string variable string
''.join(word.strip(string.punctuation) for word in caps_string)

In [None]:
# let's remove punctuation from our SOTU speech
clean_text = ''.join(word.strip(string.punctuation) for word in clean_text)

### List comprehension

This is what is called a *comprehension* in Python. A way to iterate or loop over multiple similar items, perform a task, and capture the result of that task for each of the items in a single object. It is very consise, and a powerful way to think about repetitive tasks, like text cleaning.

They're easiest to understand by going backwards from the loop and conditions and then seeing what is done to them.

So first, we're looping over each item (which we're calling *word* in the list *clean_text*:
Then, we're running the .strip() method on that item and stripping all punctuation as listed in string.punctuation

In [None]:
# as a for-loop, this would look like the following, but the output wouldn't be saved
for word in clean_text:
    word.strip(string.punctuation)

In [None]:
# as a list comprehension, we can capture the output into a single list
output = [word.strip(string.punctuation) for word in clean_text]
output[0:10]

Lastly, we want that result as a single string, rather than a list of separate words, so we're using some Python slight of hand to make that happen: we're joining, or concatenating each item of that list to an empty string

In [None]:
''.join(output)

We're also overwriting the clean_text with the puctuation-stripped string, which is why you see that same variable on both the left and the right hand of the equals.

In [None]:
clean_text = ''.join(word.strip(string.punctuation) for word in clean_text)

#### Remove anything but letters
We can use a regular expression that matches only upper and lower case letters to remove everything else.

In [None]:
# in this case we sub any non-letter characters out with a space, ' '
letters_only = r'[^A-Za-z]+'
re.sub(letters_only, ' ', caps_string)

### Tokenizing with the Natural Language Toolkit (NLTK)

We can also use the Natural Language Toolkit (NLTK) to accomplish many of the steps we showed manually above. Or in the case of our clean_text variable, where we've already removed punctuation, we can use the word_tokenize module to break the text up into its consitutent tokens:

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
tokens = word_tokenize(clean_text)
tokens[:10]

In [None]:
from nltk.probability import FreqDist

Now that we have a list of tokens we can count their frequencies in the speech. Let's use a builtin NLTK function called FreqDist() to look at our most common words.

In [None]:
#apply the FreqDist function to our tokens variable
fdist = FreqDist(tokens)

#fdist is a dictionary of unique words and the number of times they occur
fdist

### Dictionaries
A Python dictionary is a way to hold an unordered list of items, using something called a 'key:value' pair. Above you can see the list of keys (tokens) and values (word counts) from the FreqDist dictionary. A good way to differentiate a Python dictionary from a Python list is to look at the brackets used:

* ```{}``` curly brackets for dictionaries
* ```[]``` square brackets for lists

In [None]:
#it also includes a handy method to find the most common words 
fdist.most_common(10)

#### Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

NLTK includes a stopwords module we can use. Not all stopwords lists are equal though: for your own research you might want to customize a stopwords list, or find one that is best-suited to your domain.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# how many stopwords are on the list?
len(stop)

In [None]:
# what are the first ten word on the stopword list?
stop[0:10]

Let's create a new list of tokens, removing our stopwords along the way. 

This loop checks each word in our original tokens list, and if it does *not* appear on the stopword list, it adds it to a new list called tokens_clean.

In [None]:
tokens_clean = [] 
  
for w in tokens: 
    if w not in stop: 
        tokens_clean.append(w)
tokens_clean[0:10]

In [None]:
# advanced we can do the same thing quite efficiently with a list comprehension
tokens_clean = [w for w in tokens if w not in stop]
tokens_clean[0:10]

In [None]:
# now we can re-count the most common words after stop words are removed
freq = FreqDist(tokens_clean)
freq.most_common(10)

Hmmm, still not terribly interesting but getting better...

#### Stemming

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm.

In [None]:
# import the PorterStemmer and then stem the word "states" as an example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('states')

In [None]:
stemmer.stem('united')

In [None]:
stemmer.stem('government')

In a similar manner as the stopwords loop above, we can create a new list of stemmed tokens:

In [None]:
tokens_stemmed = []
for t in tokens_clean:
    tokens_stemmed.append(stemmer.stem(t))

In [None]:
#or as a comprehension:
tokens_stemmed = [stemmer.stem(t) for t in tokens_clean]

In [None]:
tokens_stemmed[0:10]

Now that the words are stemmed, are the most common words any different? 

Here are the stemmed top ten.

In [None]:
freq_stemmed = FreqDist(tokens_stemmed)
for f in freq_stemmed.most_common(10):
    print(f)

And the unstemmed top ten:

In [None]:
for f in freq.most_common(10):
    print(f)

Similar, but with some important differences. Notice that "work" went from 42 to 69 after stemming.  

Why would that be?

### Reading in multiple files

Often, our text data is split across multiple files in a folder. We can read them all into a single variable using a Python tool called glob.

In [None]:
import glob

In [None]:
# save all of the files that end with .txt in the sotu_text/ folder to a variable called sotu_all
sotu_all = glob.glob("sotu_text/*.txt")

In [None]:
# this just saves the file-paths to a list though
sotu_all[0:10]

Those are out of order though. Let's sort the list so that the list index is in the same order as the speeches themselves (sotu_all[0] would equal 001.txt).

In [None]:
sotu_all.sort()

In [None]:
sotu_all[0:10]

Now that we have a list of all the files we need to cycle through each one and save the text from the file.

To do that we'll create a new list variable, speeches. For each file in the sotu_all variable we'll open and read the file, and save the text to the speeches list. 

In [None]:
speeches = []
for speech in sotu_all:
    s = open(speech, 'r')
    text = s.read()
    speeches.append(text)

In [None]:
# now we can refer to each speech from the list using the list index
speeches[45][0:250]

In [None]:
#which file is that?
sotu_all[45]

Here's a short function to tidy the open/append loop.

In [None]:
speeches = [open(speech, 'r').read() for speech in sotu_all]

In [None]:
len(speeches)

In [None]:
speeches[235][0:250]

# Part 2
## Functions
Now that we have all the text data loaded, we can think about working on the corpus as a whole.

Let's create a function that combines all of our cleaning protocols so that we can clean each State of the Union speech with a single piece of code. 

The function definition opens with the keyword ```def``` followed by the name of the function (clean_speech) and a parenthesized list of parameter names (speech). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the return value.

In [None]:
def clean_speech(speech):
    speech = ''.join(word.strip(string.punctuation) for word in speech.lower())
    speech = [stemmer.stem(w) for w in word_tokenize(speech) if w not in stop]
    return speech

You can call the function using the name of the function, and the variable you'd like to process as its parameter. To process only the first speech, for example, you could call:

```clean_speech(speeches[0])```

You could also assign the function's output to a variable so you can work with it later:

```first_cleaned = clean_speech(speeches[0])```

Let's put it all together and clean all of the speeches, and assign them to a new list, ```cleaned_speeches```.

In [None]:
cleaned_speeches = [clean_speech(speech) for speech in speeches]

In [None]:
len(cleaned_speeches)

Notice that each item in the cleaned_speeches list is also a list.

In [None]:
type(cleaned_speeches[0])

### Word Frequencies (reprise)

# Note to Mike: 
I played around with ConditionalFreqDist but it quickly got a little dense for my taste, in terms of teaching. Tough to unpack the "what's going on here" aspect. Might just be me though. Happy to re-focus here if you think we should. 

Instead I go back to the FreqDist tool and sort of manually build some of these visualizations across the corpus. I like how that reinforces some basic lessons on lists, dictionaries and loops. The downside is that it requires introducing pandas and matplotlib. 

In [None]:
'''from nltk import ConditionalFreqDist

##need to fix this
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target)) [1]
cfd.plot()'''

We know we can get word frequencies from a speech. Let's apply that to the entire cleaned and stemmed corpus we've created.

First, remember how we can look at the frequencies for a single document:

In [None]:
fd = FreqDist(cleaned_speeches[100])
fd.most_common()[0:10]

### Matplotlib
We can use a tool called matplotlib to help visualize some of our results. NLTK uses matplotlib as the engine for their .plot() function, but let's install it here and also run some Jupyter "magic" to make sure the matplotlib visualizations appear inline (in the display) of our Jupyer view.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Now we can use the built-in plot() function to visualize the FreqDist for our first ten words. Unfortunately the .plot() tool here only work with line graphs. A bar chart would be better! We can do that later with the full matplotlib library.

In [None]:
# we can use the built-in plot function for FreqDist 
fd.plot(10)

#### enumerate() & building a Python dictionary

The following short loop uses a couple of essential Python tools. 

1. Let's create our own Python dictionary, which we'll call freq_dicts, to keep track of each speech. Each dictionary entry will represent a speech from the corpus. The dictionary "key" will be the index for the speech, and the "value" will be the FreqDist data for that speech. Remember that the FreqDist data is also stored as a dictionary. So the value of each speech is itself a dictionary.

2. Since we want to reference the index for each speech in the loop, we can use the Python enumerate() function. Enumerate allows us to basically keep a count in a for loop, and to reference the enumerate variable. We'll call our enumerate variable 'idx' for index. 

In [None]:
freq_dists = {}
for idx, speech in enumerate(cleaned_speeches):
    freq_dists[idx] = FreqDist(speech) 

Now we can look at the last speech in the freq_dists dictionary by referencing its key in the same way we would reference a list index. 

In [None]:
freq_dists[235]

We can then hone in on a specific key from the dictionary that is inside of the value of speech 235 (confusing, I know!), by again calling the key name. Note that in this case the key is a string (american) so we need to put it in single quotes.

In [None]:
freq_dists[235]['american']

Now let's create a new for loop to look at the occurrence of the term american in all of our speeches. We can use a very similar technique as above to capture the data in a new dictionary. 

* ```americans[speech[0]]``` assigns the key from freq_dists (which is the index number) to our new americans dictionary
* ```speech[1]['american']``` assigns the value of the dictionary within the dictionary (FreqDist) that matches the 'american' key (which is the count of the number of times 'american' appears. 

In [None]:
americans = {}
for speech in freq_dists.items():
    americans[speech[0]] =  speech[1]['american']

In [None]:
#americans

### Pandas & dataframes
Pandas is a flexible data analysis tool for Python. R users will be familiar with data frames (and probably be a little frustrated with how Pandas implements dataframes, because they're a little different!). We're going to use Pandas in a very simple way, to take a closer look at our data in a tabular form, and then to visualize using matplotlib.

The following defines a new variable df_americans, in which we instantiate the .DataFrame.from_dict() tool. We tell it to create the dataframe from the americans dictionary, we set the orientation so that the dictionary keys are the index for the dataframe, then we assign the column name "count" for the word count column (which is from the dictionary values).

In [None]:
## Convert dictionary to pandas
import pandas as pd
df_americans = pd.DataFrame.from_dict(americans, orient='index')
df_americans.columns = ['count']

You can easily peak at the first 5 rows of a pandas dataframe using .head().

In [None]:
df_americans.head()

And we can sort the data by any column using sort_values(). Let's see who says 'american' the most!

In [None]:
df_americans.sort_values(by='count', ascending=False)[0:5]

Since our index matches our ```speeches``` index we can take a closer look at speech 123.

In [None]:
# who says "americans" the most?
speeches[123][-250:]

Now let's use matplotlib to chart the occurrence of the word 'american' over time in the full corpus.

* The x-axis is the index for the df_americans dataframe (each speech index)
* The y-axis is the word count number from the dataframe.
* Then we can just add some labels, a title, and call the matplotlib function ```.show()``` to take a look!

In [None]:
x_speech = df_americans.index
y_count = df_americans['count']
plt.plot(x_speech, y_count)
plt.ylabel("Count of \'Americans\'")
plt.xlabel("Speech")
plt.title("Use of \'Americans\' over time")
plt.show()

## Challenge questions:
* What's suboptimal about the chart above? 
* How could we clean up this data to create a more meaningful chart? 
* What's missing in the data we're using here? 
* How could we visualize it better?

## Sentiment Analysis

Sentiment analysis is an exploratory data analysis technique that "seeks to quantify the emotional intensity of words and phrases within a text." (quote from the [Programming Historian SA tutorial](https://programminghistorian.org/en/lessons/sentiment-analysis))

We can use more NLTK tools to run a simple sentiment analysis on our SOTU corpus. To download files from NLTK we first need to import the full package. We'll also want some sentiment-specific modules.

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment

Now that we have the toolbox we need, let's create a sentiment tokenizer using the English language sentiment library. 

In [None]:
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Then we can initialize the vader SentimentIntensityAnalyzer and save it to a variable called sid.

In [None]:
sid = SentimentIntensityAnalyzer()

Now we can use the features of the sentiment analysis tool. You can take a look at some of those features by typing sid. and then tabbing through the options.

In [None]:
#sid

First let's look at the 'polarity_scores' for a specific speech. For Sentiment analysis we don't need the cleaned speech, so we'll go back to our original speeches list.

polarity_scores will give us positive and negative scores. This feature is built into VADER and can be requested on demand.

In [None]:
scores = sid.polarity_scores(speeches[100])
scores

### Dictionaries
We can format those by looping through the scores dictionary. Remember that dictionaries are key:value pairs stored in curly brackets. We can cycle through the scores dictionary like so: 

In [None]:
for key in sorted(scores):
    print('{0}: {1}'.format(key, scores[key]), end='\n')

Now let's look at the scores for the entire speeches corpus.
We'll create another dictionary, 'all_scores', that will use the speeches index as the key, and the scores as its value. Note that this means that the value for each item in 'all_scores' will contain *another* dictionary.

This might take a few minutes to run because it has to analyze all 235 speeches.

In [None]:
all_scores = {}
for idx, speech in enumerate(speeches):
    all_scores[idx] = sid.polarity_scores(speech)

Now we can take a look at the scores for specific speeches by referencing the index/key of all_scores:

In [None]:
all_scores[235]

We can look at a specific score by referencing the key within the scores dictionary. 

In [None]:
all_scores[235]['neg']

From here, we can list all of the negative scores for the corpus. 

To keep it somewhat simple, let's just create a new dictionary that will only contain negative scores. We can create an empty dictionary called negative, then cycle through each key:value item in the all_scores dictionary from above. For each item, we'll assign the index number as its key and the negative score as its value.

In [None]:
negative = {}
for score in all_scores.items():
    negative[score[0]] =  score[1]['neg']

We'll use a Python tool called Pandas to look at the negative dictionary in a dataframe.

In [None]:
import pandas as pd
df = pd.DataFrame.from_dict(negative, orient='index')
df.columns = ['neg_value']

We can sort the dataframe using a builtin method of Pandas called sort_values.

In [None]:
df.head()

We can use Matplotlib again here to plot out the negative scores for the entire corpus.

In [None]:
x_speech = df.index
y_neg = df['neg_value']
plt.plot(x_speech, y_neg)
plt.ylabel("Negative value")
plt.xlabel("Speech")
plt.title('Negativity of speeches over time')
#plt.show()

#### Most negative speeches
The graph gives us a nice visualization of some overall trends, but it's hard to identify specific speeches here. We can just sort our dataframe to look at the most negatively scored speeches in the corpus.

In [None]:
df.sort_values(by='neg_value', ascending=False)[0:5]

In [None]:
speeches[222][0:500]

#### Least negative speeches
We can use the default sort (ascending values) to view the least negative speeches in the corpus.

In [None]:
df.sort_values(by='neg_value')[0:5]

## Topic Models

In topic models each document is represented as a distribution over topics, and each topic is represented as a distribution over words. 

Topic model algorithms require documents to be transformed into a document-term-matrix (DTM). In a DTM each row represents a document (for us, a SOTU speech), each column represents a specific word or token in the corpus, and the matrix contains the number of times each word appears in a given document. A document can be a tweet, a novel, or an entire corpus of an author's work. How you define the boundaries of a document makes a huge difference to the output of a topic model algorithm.

Let's consider two documents, each one sentence long: 

In [None]:
sentence_1 = "The brown fox jumped over the white cow."
sentence_2 = "The red hen ran away from the brown fox."
tokens_s1 = clean_speech(sentence_1)
tokens_s2 = clean_speech(sentence_2)
print(tokens_s1)
print(tokens_s2)

Here's how we would represent those two documents in a document-term-matrix, after stemming and throwing out stop words:

|brown | fox | jump | white | cow | red | hen | ran | away|
|--|--| --| --| --| --| --| --| --|
|1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0|
|1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1|

## NEED DAVID TO INSTALL GENSIM IN THE HUB

### Gensim

Gensim is a popular Python library built specifically for topic modeling. While other popular data science libraries, such as scikit-learn, can be used for topic modeling, gensim has a lot of handy built-in features that will help us out.

In [None]:
'''import sys
!{sys.executable} -m pip install gensim'''

In [None]:
import gensim
from gensim import corpora

Let's apply gensim to our cleaned_speeches corpus. 

First, we want to run the ```corpora.Dictionary()``` function from Gensim to create a dictionary of all of the terms in the corpus. Within the dictionary object is a long list of every unique word/token. In our case there are 23,662.

In [None]:
dictionary = corpora.Dictionary(cleaned_speeches)
print("dictionary:", dictionary)

Also available within the gensim dictionary object, is a Python dictionary, ```.token2id```, that contains the unique terms from the tokens list as keys, and the values are an index for each term.

In [None]:
#dictionary.token2id

Next we want to convert the lists of tokens from cleaned_speeches to a bag-of-words model using the ```doc2bow()``` function. This bit of code uses list comprehension to cycle through every tokenized speech (here called 'text') in cleaned_speeches and apply the doc2.bow() function to it.

In [None]:
corpus = [dictionary.doc2bow(text) for text in cleaned_speeches]

The corpus doesn't look like much initially. Here are the first 20 items in a list for the first item/speech in the entire corpus. 

In [None]:
corpus[0][0:20]

The function doc2bow() counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. So the above is a list of the first 20 token ids + counts from the first speech, corpus[0].

Say what?! Well, let's take a look at a simple example using a sentence from before.

In [None]:
print(tokens_s1)

In [None]:
new_vec = dictionary.doc2bow(tokens_s1)
print(new_vec)

Since we're using the dictionary object for the cleaned_speeches corpus still, the first number in each pair above — ```(2114,1)``` - refers to the id for that token in the dictionary from cleaned_speeches. The second number refers to the number of times it occurs in our tokens list *tokens_s1*.

If we want to save our corpus and dictionary objects to work with later, we can do the following:

In [None]:
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

### LDA topic models
Now let's use gensim to create an LDA (Latent Dirichlet Allocation) topic model with our corpus.

Topic models require us to set the number of topics ahead of time. Let's guess there are about 20 topics in these speeches.

LDA on a large-ish corpus like ours will take a little while to run since it uses machine learning to iterate over the corpus in many different passes, adjusting its findings as it "learns" what topics "fit."

In [None]:
n_topics = 20

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = n_topics, id2word=dictionary, passes=15)

Now we have the model, but LDA doesn't actually tell us what topics are called. It doesn't actually understand a topic at all, that's our job. What it can tell us is which words are most strongly associated with each topic. From there we might see clear patterns and label the topics as we like. 

Here we use the print_topics() function, which takes an argument for how many words per topic we want to see, and cycle through each topic to view the associated words, along with their "prevalence" score to the topic. 

In [None]:
topics = ldamodel.print_topics(num_words = 10)
for topic in topics:
    print(topic)

Wow, those topics are really repetitive and kind of meaningless! How can we improve these?

Seems like we want to remove more words from our corpus since the most common words (state, nation, govern...) show up over and over here. We could go back and manually add those stop words to our stop word list, but we have an easier option: we modify our gensim dictionary with a filter to remove the *n* most frequent terms. 

Let's try to remove the most common 25 words and re-run our model.

In [None]:
dictionary.filter_n_most_frequent(25)

# since we've updated the dictionary we have to re-assign the corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_speeches]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = n_topics, id2word=dictionary, passes=15)
ldamodel.save('model_20.gensim')
#https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21


If we re-run the topic word list code — adding an enumerate() counter so we can print out each topic number a little more clearly - the topics become more interesting and some are easy to label. 

Some of these are still weird though! Topic models can be a nice way to see what parts of the data still need to be cleaned. Looking at this output, what else do you think we could do to clean up our corpus?

In [None]:
topics = ldamodel.print_topics(num_words = 10)
for i, topic in enumerate(topics):
    print("Topic", i, "\n", topic, "\n")

We can look at a specific speech, to see what topics are strongly associated with it, using the dictionary .doc2bow() function.

In [None]:
bush_bow = dictionary.doc2bow(cleaned_speeches[235])
print(ldamodel.get_document_topics(bush_bow))

In [None]:
kennedy_bow = dictionary.doc2bow(cleaned_speeches[177])
print(ldamodel.get_document_topics(kennedy_bow))

In [None]:
washington_bow = dictionary.doc2bow(cleaned_speeches[0])
print(ldamodel.get_document_topics(washington_bow))

## Visualizing topic models

There's one more cool tool we can use with gensim - pyLDAvis - to actually visualize what these topics look like, what words are associated with them, and how distant they are from each other in a vector space.

In [None]:
import pyLDAvis.gensim
lda = gensim.models.ldamodel.LdaModel.load('model20.gensim')
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

If you create a great visualization using pyLDAvis you can save it as an html file!

In [None]:
pyLDAvis.save_html(lda_display,'lda_viz.html')

### Challenge - can you change the model to have fewer, more meaningful topics.

### Challenge - can you remove less frequent terms that might be creating unwanted noise in the model?

## Acknowledgements
Some of the code, descriptions, and examples above are taken from:
* UC Berkeley's D-Lab [workshop on Text Analysis Fundamentals](https://dlab.berkeley.edu/training/text-analysis-fundamentals-unsupervised-approaches-10).
* Software Carpentry's open source [Python curriculum](http://swcarpentry.github.io/python-novice-inflammation/).