# Intro to Text as Data (UMN LATIS/Libraries)

## Intro
This workshop will cover how to:
- Read and write text files in Python
- Manipulate ‘strings’ of text
- Pre-process and clean text for analysis
- Count and plot word frequencies
- Conduct sentiment analysis
- Run basic topic models

### Prereqs
Be familiar with Python. Have some previous experience with an introductory Python workshop, for example.

## JupyterLab - Get attendees set up
We'll share JupyterLab features as we go, but to get started let's create a new blank `workshop.ipynb` file that you can work with throughout the workshop. 

Note the control icons above the Notebook. You can use these to:
- add new cells
- run code (which you can also do, using control-shift)
- and switch cells from code to markdown

### Reading files

Let's start to work with text by reading it in from a series of text files.

For most text as data projects, your first step is going to be to read in the files containing the data. Common file types for text data are: 
* `.txt`
* `.csv`
* `.json`
* `.html` 
* `.xml`

Each file format requires specific Python tools or methods to read, but for our case, we'll be working with .txt files.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

Let's take a look at the first file in our directory (folder) of State of the Union addresses (`/sotu_text`):

In [None]:
# create a new variable called file1 and read ("r") the first file in the sotu_text folder
file1 = open('sotu_text/215.txt','r') 

If we try to print out the file, however, it's not yet stored as a Python string, but as an encoded text "wrapper" from the io (input/output) Python module.

In [None]:
# Printing doesn't print the contents
print(file1)

To access the text, we can use the .read() method of the `TextIOWrapper` and then use the string index of the text object to show the first 250 characters.

In [None]:
text = file1.read()
print(text[0:250])

### Tokenization

Once we've read in the data, a common next step is to split a longer string into words or-—more helpfully-—"word chunks" called "tokens." This step is referred to as "tokenization". Tokens can be of any length (sentences, words, or parts of words), but usually we want to reduce our text into meaningful word chunks. 

#### Tokenizing by whitespace
An easy way to tokenize a string, though not the most accurate, is to split up a string using whitespace. Let's save each word to a list variable called 'tokens.' `.split()` creates a Python list of the words from the string. If we leave the argument for `split()` blank, it will split on whitespace characters in the string.

In [None]:
tokens = text.split()

In [None]:
tokens[0:10]

### Splitting on other characters

Sentence segmentation involves identifying the boundaries of sentences, and provides a different way to tokenize our text.

#### Sentence segmentation by splitting on punctuation

In [None]:
# instead of the default whitespace for split(), you can identify the character or characters you'd like to split on
sentences = text.split('.')
sentences[0]

We can check how many items are in any list using the len() function.

In [None]:
len(sentences)

In [None]:
# note that this method doesn't break out sentences that end with other punctuation, like question marks
sentences[35]

### Counting words

Once we have a list of terms, we can count them up using a Python module called "Counter" in the collections package.

In [None]:
#import the Counter module from the collections package
from collections import Counter

token_counts = Counter(tokens)

#print(token_counts)

If you hit tab after ```token_counts.``` you can see Counter methods that are available to examine this data more closely. We can use ```.most_common(5)``` to look at the five most common words in the corpus. (You can change the argument to print out as many of the most common words as you want)

In [None]:
token_counts.most_common(5)

You can also look at specific word counts by calling the term in the same way you would refer to a Python dictionary key:

In [None]:
token_counts['President']

We can improve this token list for later analysis by using built-in string methods to clean the data a bit. We'll create a new list, called "unigrams," to hold each token. Unigrams refer to single words, while "bigrams" are two word pairs, and "trigrams" three word pairs. Researchers often refer to these collections of tokens of different lengths as "ngrams."

We'll use the built-in string method `.lower()` to convert the text all to lower case. This helps us count up word frequencies regardless of capitalization. 

The `.replace()` method will replace specific characters in the first argument with the second argument. `'text'.replace('t', '')` for example, would change the string 'text' to 'ex', since it would replace every 't' character with an empty string.

In [None]:
#create a new empty list
unigrams = []

# Loop through each token in the tokens list
for token in tokens:
    token = token.lower() # lowercase tokens
    token = token.replace('.', '') # remove periods
    token = token.replace('!', '') # remove exclamation points
    token = token.replace('?', '') # remove question marks
    unigrams.append(token) #append each updated token to the new unigrams list

Let's convert this list of unigrams using Counter and see if there were any changes in the most common five words.

In [None]:
word_counts = Counter(unigrams)
print('Original list:', token_counts.most_common(5))
print('Cleaned up list:', word_counts.most_common(5))

Why do you think the counts have changed?

### Tokenizing with the Natural Language Toolkit (NLTK)* *

The steps we took above are pretty time-consuming, and require us to identify each specific character to remove. Most researchers will use the Natural Language Toolkit (NLTK) or Spacy to accomplish many of the steps we showed manually above with fewer steps. 

We can use word_tokenize tool to do a lot of the tokenizing work for us.

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


Let's convert our original sotu string, called ```text```, to lowercase, and tokenize it using `word_tokenize()`.

In [None]:
word_tokens = word_tokenize(text.lower())
print(word_tokens[0:25])

#### Removing punctuation

In the example above, we have a lot of punctuation marks as tokens in our data. 

We can manually remove punctuation from strings, which can be useful in certain settings. Let's import the `string` library which includes a quick dictionary of common punctuation marks. 

In [None]:
import string
string.punctuation

In [None]:
# strip() will remove punctuation from the beginning or end of the string
"?Test 1, 2, 3!".strip(string.punctuation)

To remove all of the punctuation in a string we can loop through the every character in the string and use the ```.join()``` method to recombine them with an empty string. First let's see how ```.join()``` works on a simpler example:

In [None]:
'x'.join('Put an x in between every character')

In [None]:
# let's remove all punctuation from our SOTU speech
clean_text = ''.join(char.strip(string.punctuation) for char in text)

### Tokenize with word_tokenize
Now that we removed the punctuation from our text, we can use retokenize the corpus using `clean_text`.

In [None]:
tokens = word_tokenize(clean_text.lower())
tokens[:10]

Now that we have a list of tokens we can count their frequencies in the speech. Let's use a builtin NLTK function called `FreqDist()` to look at our most common words. This is similar to the `Counter` library we worked with earlier, though provides some extended functionality.

In [None]:
from nltk.probability import FreqDist

In [None]:
#apply the FreqDist function to our tokens variable
fdist = FreqDist(tokens)

#fdist is a dictionary of unique words and the number of times they occur
fdist

In [None]:
#fdist also includes a handy method to find the most common words 
fdist.most_common(10)

...which, to make things even more complicated, returns a list (see the square brackets continaing comma-separated items) containing tuples (those objects in parentheses, also containing comma-separated items). But we needn't get overly worried about that here. 

#### Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

NLTK includes a stopwords module we can use. Not all stopwords lists are equal though: for your own research you might want to customize a stopwords list, or find one that is best-suited to your domain.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# how many stopwords are on the list?
len(stop)

In [None]:
# what are the first ten word on the stopword list?
stop[0:10]

Let's create a new list of tokens, removing our stopwords along the way. 

This loop checks each word in our original tokens list, and if it does *not* appear on the stopword list, it adds it to a new list called tokens_clean.

In [None]:
tokens_clean = [] 
  
for w in tokens: 
    if w not in stop: 
        tokens_clean.append(w)
tokens_clean[0:10]

In [None]:
tokens[0:10]

In [None]:
# advanced we can do the same thing quite efficiently with a list comprehension
tokens_clean = [w for w in tokens if w not in stop]
tokens_clean[0:10]

In [None]:
# now we can re-count the most common words after stop words are removed
freq = FreqDist(tokens_clean)
freq.most_common(10)

Hmmm, still not terribly interesting but getting better...

#### Stemming

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm.

In [None]:
# import the PorterStemmer and then stem the word "states" as an example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('states')

In [None]:
stemmer.stem('government')

In [None]:
stemmer.stem('roosevelt')

In a similar manner as the stopwords loop above, we can create a new list of stemmed tokens:

In [None]:
tokens_stemmed = []
for t in tokens_clean:
    tokens_stemmed.append(stemmer.stem(t))

In [None]:
#or as a comprehension:
tokens_stemmed = [stemmer.stem(t) for t in tokens_clean]

In [None]:
tokens_stemmed[0:10]

Now that the words are stemmed, are the most common words any different? 

Here are the stemmed top ten.

In [None]:
freq_stemmed = FreqDist(tokens_stemmed)
for f in freq_stemmed.most_common(10):
    print(f)

And the unstemmed top ten:

In [None]:
for f in freq.most_common(10):
    print(f)

Similar, but with some important differences. Notice that "work" went from 42 to 69 after stemming.  

Why would that be?

### Reading in multiple files

Often, our text data is split across multiple files in a folder. We can read them all into a single variable using a Python tool called glob.

In [None]:
import glob

In [None]:
# save all of the files that end with .txt in the sotu_text/ folder to a variable called sotu_all
sotu_all = glob.glob("sotu_text/*.txt")

In [None]:
# this just saves the file-paths to a list though
sotu_all[0:10]

Those are out of order though. Let's sort the list so that the list index is in the same order as the speeches themselves (sotu_all[0] would equal 001.txt).

*Something important to note is that **glob** can pull files differently on different systems (Windows/Mac OS/Linux). If you have a numeric identifier to your files, sorting them is always a good idea for reproducibility of your code, regardless of what system it may be run on*

In [None]:
sotu_all.sort()

In [None]:
sotu_all[0:10]

Now that we have a list of all the files we need to cycle through each one and save the text from the file.

To do that we'll create a new list variable, speeches. For each file in the sotu_all variable we'll open and read the file, and save the text to the speeches list. 

In [None]:
speeches = []
for speech in sotu_all:
    s = open(speech, 'r')
    text = s.read()
    speeches.append(text)

In [None]:
# now we can refer to each speech from the list using the list index
speeches[45][0:250]

In [None]:
#which file is that?
sotu_all[45]

Here's a short comprehension version to tidy the open/append loop found above.

In [None]:
speeches = [open(speech, 'r').read() for speech in sotu_all]

In [None]:
len(speeches)

In [None]:
speeches[235][0:250]

### Creating a Cleaning Function*
Now that we have all the text data loaded, we can think about working on the corpus as a whole.

Let's create a function that combines all of our cleaning protocols so that we can clean each State of the Union speech with a single piece of code. 

The function definition opens with the keyword ```def``` followed by the name of the function (clean_speech) and a parenthesized list of parameter names (speech). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the return value.

In [None]:
def clean_speech(speech):
    speech = ''.join(word.strip(string.punctuation) for word in speech.lower())
    speech = [stemmer.stem(w) for w in word_tokenize(speech) if w not in stop]
    return speech

You can call the function using the name of the function, and the variable you'd like to process as its parameter. To process only the first speech, for example, you could call:

```clean_speech(speeches[0])```

You could also assign the function's output to a variable so you can work with it later:

```first_cleaned = clean_speech(speeches[0])```

Let's put it all together and clean all of the speeches, and assign them to a new list, ```cleaned_speeches```.

In [None]:
# this cell might take a few minutes to run!
cleaned_speeches = [clean_speech(speech) for speech in speeches]

In [None]:
len(cleaned_speeches)

Notice that each item in the cleaned_speeches list is also a list.

In [None]:
type(cleaned_speeches[0])

In [None]:
cleaned_speeches[0][0:10]

### Word frequencies across the corpus


#### enumerate()
* The built-in enumerate() function allows us to keep a count of our place in a for loop, and to reference the enumerated variable counter later on. We'll call our counter variable 'i' for index. 
* The first for-loop iterates through each speech in cleaned_speeches
* The second for-loop iterates through each word in the speech at hand.
* We'll unpack the ```cfd['americ'][i]+=1``` code a bit more later on, but note for now that ```+=1``` counts each occurence of any word that starts with 'americ' and ```[i]``` refers to the speech index. 

In [None]:
freqs = []

for i,speech in enumerate(cleaned_speeches):
    freqs.append(FreqDist(speech))
    

`FreqDist.N()` gives us the total 'outcomes', in other words the total number of words from each document. We can use this to calculate the relative frequency of a specific term to each document and track that value over time. 

In [None]:
print(freqs[0].N())
print(freqs[0]['war'])
print(freqs[0]['war']/freqs[0].N())

Let's edit the loop above to look at the token 'war' across the full corpus and plot its frequency in each speech. We'll create a Python dictionary where the key is the index of the cleaned_speeches and the value is the ratio of 'war' within the total number of words in each speech. 

In [None]:
war_ratios = []

for i,speech in enumerate(cleaned_speeches):
    freq = FreqDist(speech)
    total_words = freq.N()
    war_count = freq['war']
    ratio = freq['war']/freq.N()   
    tuple = (i, total_words, war_count, ratio)
    war_ratios.append(tuple)

We can convert this list of tuples into a Pandas DataFrame so we can work with the data in a more easy-to-read tabular format.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(war_ratios, columns =['speech', 'total_words', 'war_count', 'ratio'])
df

### Plotly
Let's use an interactive plotting library, Plotly, to build a plot that will allow us to hover over specific points to find more information. This will allow us to pinpoint the spikes in the plot that show the highest frequency of 'war' mentions.

In [None]:
#conda install plotly
import plotly.express as px

In [None]:
fig = px.line(x=df.index, y=df['ratio'])
fig.update_layout(xaxis_title="Speech",
    yaxis_title="Relative frequency of 'war'")
fig.show()

In [None]:
speeches[155][:800]

Alternately we could look at the plain word counts of 'war' from each speech. While the results are similar, it removes some spikes from speeches 24 and 128, which shows that 'war' was a frequent topic relative to the length of those speeches. 

In [None]:
fig = px.line(x=df.index, y=df['war_count'])
fig.update_layout(xaxis_title="Speech",
    yaxis_title="N of 'war'")
fig.show()

In [None]:
# james madison, 1813, war of 1812
print(speeches[24][-300:])

# woodrow wilson, 1917, wwi
print(speeches[128][:800])

### Sentiment Analysis*

Sentiment analysis is an exploratory data analysis technique that "seeks to quantify the emotional intensity of words and phrases within a text." (quote from the [Programming Historian SA tutorial](https://programminghistorian.org/en/lessons/sentiment-analysis))

We can use more NLTK tools to run a simple sentiment analysis on our SOTU corpus. We'll download the vader_lexicon for sentiment analysis and the Vader and Sentiment modules. Don't worry if you see a warning that we don't have the twython library. We won't be using that since we're not analyzing twitter text.

In [None]:
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment

Note that we could use a tokenizer that works best for sentiment analysis (see the commented out code below). Since we've already tokenized our text we'll stick with that corpus. 

In [None]:
#nltk.download('punkt')
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Let's initialize the vader SentimentIntensityAnalyzer and save it to a variable called sid.

In [None]:
sid = SentimentIntensityAnalyzer()

Now we can use the features of the sentiment analysis tool. You can take a look at some of those features by typing sid. and then tabbing through the options.

In [None]:
#sid

First let's look at the 'polarity_scores' for a specific speech. For Sentiment analysis we don't need the cleaned speech, so we'll go back to our original speeches list.

polarity_scores will give us positive and negative scores. This feature is built into VADER and can be requested on demand.

In [None]:
scores = sid.polarity_scores(speeches[100])
scores

### Dictionaries
We can format the output by looping through the scores dictionary. Remember that dictionaries are key:value pairs stored in curly brackets. We can cycle through the scores dictionary like so: 

In [None]:
for key in sorted(scores):
    print('{0}: {1}'.format(key, scores[key]), end='\n')

Now let's look at the scores for the entire speeches corpus.
We'll create another dictionary, 'all_scores', that will use the speeches index as the key, and the scores as its value. Note that this means that the value for each item in 'all_scores' will contain *another* dictionary.

This might take a few minutes to run because it has to analyze all 235 speeches.

In [None]:
all_scores = {}
for idx, speech in enumerate(speeches):
    all_scores[idx] = sid.polarity_scores(speech)

Now we can take a look at the scores for specific speeches by referencing the index/key of all_scores:

In [None]:
all_scores[235]

We can look at a specific score by referencing the key within the scores dictionary. 

In [None]:
all_scores[235]['neg']

From here, we can list all of the negative scores for the corpus. 

To keep it somewhat simple, let's just create a new dictionary that will only contain negative scores. We can create an empty dictionary called negative, then cycle through each key:value item in the all_scores dictionary from above. For each item, we'll assign the index number as its key and the negative score as its value.

In [None]:
negative = {}
for score in all_scores.items():
    negative[score[0]] =  score[1]['neg']

In [None]:
x_speech = list(negative.keys())
y_neg = list(negative.values())
fig = px.line(x=x_speech, y=y_neg)
fig.update_layout(xaxis_title="Speech",
    yaxis_title="Negative value")
fig.show()

#### Most negative speeches
The graph gives us a nice visualization of some overall trends, and we can take a closer look at some of the most negative speeches here. We could also sort our dictionary, using the `sorted()` method, to list the speeches with the most negative scores in the corpus.

In [None]:
sorted(negative, key=negative.get, reverse=True)[:5]

In [None]:
speeches[222][0:500]

#### Least negative speeches
We can use the default sort (ascending values) to view the least negative speeches in the corpus.

In [None]:
sorted(negative, key=negative.get)[:5]

## Topic models (sklearn)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
speeches[0][0:100]

In [None]:
# Initialize regex tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1),
                        tokenizer = tokenizer.tokenize)

# Fit and Transform the documents
train_data = tfidf.fit_transform(speeches)  

In [None]:
n_components = 12

model=LatentDirichletAllocation(n_components=n_components)

# Fit and Transform SVD model on data
lda_matrix = model.fit_transform(train_data)

# Get Components 
lda_components=model.components_


In [None]:
# Print the topics with their terms
terms = tfidf.get_feature_names_out()

for index, component in enumerate(lda_components):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:7]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

## Topic models (Gensim)

In [None]:
import gensim
from gensim.models import CoherenceModel
import pyLDAvis.gensim_models

In [None]:
# had issues with gensim and recent scipy package. had to back install 1.12.0
#scipy.__version__

In [None]:
dictionary = gensim.corpora.Dictionary(cleaned_speeches)

In [None]:
doc_count = len(cleaned_speeches)
num_topics = 12 # Change the number of topics
passes = 10 # The number of passes used to train the model
# Remove terms that appear in less than 50 documents and terms that occur in more than 90% of documents.
dictionary.filter_extremes(no_below=50, no_above=0.95)

In [None]:
bow_corpus = [dictionary.doc2bow(speech) for speech in cleaned_speeches]

In [None]:
%%time
# Train the LDA model
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=passes
)

In [None]:
# Compute the coherence score using UMass
# u_mass is measured from -14 to 14, higher is better
coherence_model_lda = CoherenceModel(
    model=model,
    corpus=bow_corpus,
    dictionary=dictionary, 
    coherence='u_mass'
)

# Compute Coherence Score using UMass
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

In [None]:
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.gensim_models.prepare(model, bow_corpus, dictionary)

In [None]:
lda_viz

## Where to Go for Help
- Contact LATISresearch@umn.edu