# Manitoba Premier Scott Moe speech text analysis

*February 1, 2022*

Let's use a bit of code to count the number of phrases from the transcript of a speech by Moe. Let's count how many bigrams (two word phrases) that he uses. We'll import a number of libraries first.

In [8]:
from nltk import ngrams             # Used to split our text into ngrams (phrases of n length).
from collections import Counter     # Used to count ngram occurences after we split them up.
import pandas as pd                 # A standard data analysis library.
import string                       # Used to remove all punctuation from our text file.

This will read in the text to be processed from a file in our raw data folder. We'll also take this opportunity to strip out punctuation, make everything lowercase, and remove extra spaces.

In [9]:
with open('../raw/RAW 2022 MOESPEECH.txt', 'r') as file:
    text = (file
            .read()
            .rstrip()
            .translate(str.maketrans('', '', string.punctuation))
            .lower()
            )
    
print(text[0:100])

well good morning everyone and thank you for joining us here this morning last september during what


Now a few bits of setup. We're going to create an empty list that we'll use to hold our ngrams. We'll also define a small function that will convert a tuple (the output of the ngrams library we're using) to a string.

In [10]:
# An empty list to collect all of the phrases for counting.
all_bigrams = []

# A small function that will convert a tuple to a string.
def convertTuple(tup):
    str = ' '.join(tup)
    return str

Now, the fun stuff. We'll use n to define how many words we want in each ngram. In this example, I've set it to 3. Then we use the ngrams library's main method to split the text file we read in earlier, and our above-defined function to convert all the tuples into strings.

In [15]:
n = 3

bigrams = ngrams(text.split(), n)

for gram in bigrams:
    all_bigrams.append(convertTuple(gram))
    
print(all_bigrams[:5])

['well good morning', 'good morning everyone', 'morning everyone and', 'everyone and thank', 'and thank you']


Now we use a library defined above to count occurences of ngrams in the list we just made. It spits out a Python dictionary that we can then read into a Pandas dataframe.

In [18]:
phrase_counts = Counter(all_bigrams)

df = (pd.DataFrame(phrase_counts, index=["Count"])
      .transpose()
      .sort_values(by="Count", ascending=False)
      )

display(df.head())

Unnamed: 0,Count
its time for,28
proof of vaccination,28
public health orders,28
whether or not,24
in this province,24


This is a pretty quick and dirty way of doing this. You may want to remove certain phrases from this (for instance, remove the word "and" before you split into ngrams.). But from here, you can turn this into a word cloud or whatever else you like!

\-30\-