<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-3:-Train-bigram-and-trigram-models-and-use-them-on-all-speeches" data-toc-modified-id="Part-3:-Train-bigram-and-trigram-models-and-use-them-on-all-speeches-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 3: Train bigram and trigram models and use them on all speeches</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-use-SpaCy-to-tokenize-and-POS-tag-each-speech" data-toc-modified-id="We-use-SpaCy-to-tokenize-and-POS-tag-each-speech-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>We use SpaCy to tokenize and POS tag each speech</a></span></li><li><span><a href="#Lazy-load-the-speeches" data-toc-modified-id="Lazy-load-the-speeches-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Lazy load the speeches</a></span></li><li><span><a href="#Save-speeches-alone-to-a-text-file-to-speed-up-processing" data-toc-modified-id="Save-speeches-alone-to-a-text-file-to-speed-up-processing-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Save speeches alone to a text file to speed up processing</a></span></li><li><span><a href="#Create-a-bunch-of-helper-functions-to-help-us-read-speeches-from-the-file,-remove-punctuation-and-whitespace-and-lemmatize-words" data-toc-modified-id="Create-a-bunch-of-helper-functions-to-help-us-read-speeches-from-the-file,-remove-punctuation-and-whitespace-and-lemmatize-words-1.0.4"><span class="toc-item-num">1.0.4&nbsp;&nbsp;</span>Create a bunch of helper functions to help us read speeches from the file, remove punctuation and whitespace and lemmatize words</a></span></li><li><span><a href="#Lemmatize-all-words-in-speeches-and-store-them-in-text-file-to-save-memory" data-toc-modified-id="Lemmatize-all-words-in-speeches-and-store-them-in-text-file-to-save-memory-1.0.5"><span class="toc-item-num">1.0.5&nbsp;&nbsp;</span>Lemmatize all words in speeches and store them in text file to save memory</a></span></li><li><span><a href="#Learn-bigrams-in-speeches-and-save-model-to-disk" data-toc-modified-id="Learn-bigrams-in-speeches-and-save-model-to-disk-1.0.6"><span class="toc-item-num">1.0.6&nbsp;&nbsp;</span>Learn bigrams in speeches and save model to disk</a></span></li><li><span><a href="#Identify-bigrams-in-the-speeches-and-save-in-txt-file" data-toc-modified-id="Identify-bigrams-in-the-speeches-and-save-in-txt-file-1.0.7"><span class="toc-item-num">1.0.7&nbsp;&nbsp;</span>Identify bigrams in the speeches and save in txt file</a></span></li><li><span><a href="#Learn-trigrams-in-speeches-and-save-model-to-disk" data-toc-modified-id="Learn-trigrams-in-speeches-and-save-model-to-disk-1.0.8"><span class="toc-item-num">1.0.8&nbsp;&nbsp;</span>Learn trigrams in speeches and save model to disk</a></span></li><li><span><a href="#Identify-trigrams-in-the-speeches-and-save-in-txt-file" data-toc-modified-id="Identify-trigrams-in-the-speeches-and-save-in-txt-file-1.0.9"><span class="toc-item-num">1.0.9&nbsp;&nbsp;</span>Identify trigrams in the speeches and save in txt file</a></span></li><li><span><a href="#Now-process-all-speeches-from-plain-text-to-unigram-(lemmatized),-bigram-and-finally-trigram-representation" data-toc-modified-id="Now-process-all-speeches-from-plain-text-to-unigram-(lemmatized),-bigram-and-finally-trigram-representation-1.0.10"><span class="toc-item-num">1.0.10&nbsp;&nbsp;</span>Now process all speeches from plain text to unigram (lemmatized), bigram and finally trigram representation</a></span></li></ul></li></ul></li></ul></div>

# Analyse all house of commons speeches since 1970

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

[Part 2: Download all speeches belonging to MPs in list](MP_speeches-Part2.ipynb)

## Part 3: Train bigram and trigram models and use them on all speeches

[Part 4: Train an LDA topic model and process all speeches with it](MP_speeches-Part4.ipynb)

[Part 5: Analyse the results of the LDA model](MP_speeches-Part5.ipynb)

Bigrams (Trigrams) are any two (three) words that often go together.
For example, Maastricht treaty (House of Commons) would be converted to maastricht_treaty (house_of_commons) with such a model.

In [2]:
import pandas as pd
import bcolz
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.word2vec import LineSentence
import codecs
import os

In [3]:
import os.path
import time

while True:
    if os.path.isfile("raw_speeches.h5"):
        break
    time.sleep(60)

#### We use SpaCy to tokenize and POS tag each speech

In [4]:
if True:
    # Load english language model from spacy
    import spacy
    nlp = spacy.load("en")
    # If it complains, you may need to downgrade pip: pip install pip==9.0.1

In [5]:
# Directory to store Phrase models
from config import INTERMEDIATE_DIRECTORY

#### Lazy load the speeches

In [6]:
speeches = bcolz.open("speeches.bcolz")

#### Save speeches alone to a text file to speed up processing

In [8]:
!mkdir -p $INTERMEDIATE_DIRECTORY

In [6]:
# Save speeches to txt file first to make it quicker to process in batches with lower memory
speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, "speeches.txt")
# Set to True if you want to run this again
if False:
    with codecs.open(speeches_filepath, "w", encoding="utf_8") as f:
        for speech in speeches["body"]:
            f.write(speech + "\n")

#### Create a bunch of helper functions to help us read speeches from the file, remove punctuation and whitespace and lemmatize words

In [7]:
#%%writefile helper_functions.py

def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_speech(filename):
    """
    generator function to read in speeches from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for speech in f:
            yield speech.replace('\\n', '\n')

def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse speeches,
    lemmatize the text, and yield sentences
    """
    
    for parsed_speech in nlp.pipe(line_speech(filename),
                                  batch_size=10000, n_threads=8):
        
        for sent in parsed_speech.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

#### Lemmatize all words in speeches and store them in text file to save memory
Lemmatization is the process of stripping word endings to convert words to their stems

In [8]:
%%time
# this is a bit time consuming (takes about 1h) - make the if statement True
# if you want to execute data prep yourself.
unigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'unigram_sentences_all.txt')
if False:
    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(speeches_filepath):
            f.write(sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 16.2 µs


In [9]:
!tail intermediate/unigram_sentences_all.txt

zora 's writing and -PRON- work as a teacher hollywood scriptwriter and a newspaper columnist be all instrumental in -PRON- contribution to the american literary landscape
-PRON- be zora 's literary accomplishment -PRON- style of writing and the subject of the african- american experience that be indispensable in -PRON- major influence on such great contemporary female poet and author such as toni morrison maya angelou and alice walker
after zora 's death in 1960 the popularity of -PRON- writing increase
today zora 's name be highlight in the black female playwrights category and -PRON- have be induct into the women 's hall of fame and florida 's writer 's hall of fame
as a woman a minority and a former english teacher -PRON- pay tribute to zora neale hurston for all of -PRON- achievement and for put woman 's literary accomplishment on the map
-PRON- be not the only one to applaud zora for all that -PRON- achieve for -PRON- writing have also be instrumental in inspire the zora nea

#### Learn bigrams in speeches and save model to disk
Bigrams are any two words that often go together. For example, Maastricht treaty would be converted to maastricht_treaty with such a model.

In [10]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
bigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_model_all')
if False:
    # Open unigram sentences as a stream
    unigram_sentences = LineSentence(unigram_sentences_filepath)
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)
else:
    # load the finished model from disk
    bigram_model = Phrases.load(bigram_model_filepath)
# Phraser class is much faster than Phrases
bigram_phraser = Phraser(bigram_model)

CPU times: user 1min 29s, sys: 836 ms, total: 1min 29s
Wall time: 1min 30s


#### Identify bigrams in the speeches and save in txt file

In [11]:
%%time
# this is a bit time consuming (takes about 20 mins) - make the if statement True
# if you want to execute data prep yourself.
bigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_sentences_all.txt')
if False:
    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f: 
        for unigram_sentence in unigram_sentences:
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            f.write(bigram_sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 16.2 µs


In [12]:
!tail intermediate/bigram_sentences_all.txt

zora 's writing and -PRON- work as a teacher hollywood scriptwriter and a newspaper_columnist be all instrumental in -PRON- contribution to the american literary landscape
-PRON- be zora 's literary accomplishment -PRON- style of writing and the subject of the african-_american experience that be indispensable in -PRON- major influence on such great contemporary female poet and author such as toni_morrison maya_angelou and alice_walker
after zora 's death in 1960 the popularity of -PRON- writing increase
today zora 's name be highlight in the black female playwrights category and -PRON- have be induct_into the women 's hall of fame and florida 's writer 's hall of fame
as a woman a minority and a former english teacher -PRON- pay_tribute to zora_neale hurston for all of -PRON- achievement and for put woman 's literary accomplishment on the map
-PRON- be not the only one to applaud zora for all that -PRON- achieve for -PRON- writing have also be instrumental in inspire the zora_nea

#### Learn trigrams in speeches and save model to disk
Trigrams are any three words that often go together. For example, House of Commons would be converted to house_of_commons with such a model.

In [13]:
%%time
## Learn a trigram model from bigrammed speeches

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
trigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_model_all')
if False:
    # Open bigram sentences as a stream
    bigram_sentences = LineSentence(bigram_sentences_filepath)
    trigram_model = Phrases(bigram_sentences)
    trigram_model.save(trigram_model_filepath)
else:
    # load the finished model from disk
    trigram_model = Phrases.load(trigram_model_filepath)
trigram_phraser = Phraser(trigram_model)

CPU times: user 2min 9s, sys: 1.7 s, total: 2min 11s
Wall time: 2min 11s


#### Identify trigrams in the speeches and save in txt file

In [14]:
%%time
## Save speeches as trigrams in txt file

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
trigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_sentences_all.txt')
if False:
    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            f.write(trigram_sentence + '\n')
# Open trigrams file as stream
trigram_sentences = LineSentence(trigram_sentences_filepath)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 28.8 µs


In [15]:
!tail intermediate/trigram_sentences_all.txt

zora 's writing and -PRON- work as a teacher hollywood scriptwriter and a newspaper_columnist be all instrumental in -PRON- contribution to the american literary landscape
-PRON- be zora 's literary accomplishment -PRON- style of writing and the subject of the african-_american experience that be indispensable in -PRON- major influence on such great contemporary female poet and author such as toni_morrison maya_angelou and alice_walker
after zora 's death in 1960 the popularity of -PRON- writing increase
today zora 's name be highlight in the black female playwrights category and -PRON- have be induct_into the women 's hall of fame and florida 's writer 's hall of fame
as a woman a minority and a former english teacher -PRON- pay_tribute to zora_neale_hurston for all of -PRON- achievement and for put woman 's literary accomplishment on the map
-PRON- be not the only one to applaud zora for all that -PRON- achieve for -PRON- writing have also be instrumental in inspire the zora_nea

#### Now process all speeches from plain text to unigram (lemmatized), bigram and finally trigram representation
We previously learned the unigram, bigram and trigram models. Now we need to apply it to all the speeches.

In [16]:
# Load last names and pronouns into stopwords so that they are filtered out
from spacy.en.language_data import STOP_WORDS

for word in ["mr.", "mrs.", "ms.", "``", "sir", "madam", "gentleman", "colleague", "gentlewoman", "speaker", "-PRON-"] + list(pd.read_hdf("list_of_members.h5", "members").last_name.str.lower().unique()):
    STOP_WORDS.add(word)

In [17]:
def clean_text(parsed_speech):
   # lemmatize the text, removing punctuation and whitespace
    unigram_speech = [token.lemma_ for token in parsed_speech
                      if not punct_space(token)]

    # remove any remaining stopwords
    unigram_speech = [term for term in unigram_speech
                      if term not in STOP_WORDS]
    
    # apply the bigram and trigram phrase models
    bigram_speech = bigram_phraser[unigram_speech]
    trigram_speech = trigram_phraser[bigram_speech]

    # write the transformed speech as a line in the new file
    trigram_speech = u' '.join(trigram_speech) 
    
    return trigram_speech

In [22]:
clean_text(nlp("I congratulate the gentlewoman from Maryland (Mrs. Morella), the  gentleman from Tennessee (Mr. Gordon), and the gentleman from Michigan  (Mr. Barcia) for their hard work on this legislation. Also, we would  not be here without the assistance and support of the gentleman from  New York (Chairman Boehlert) and his efforts to bring this bill to the  floor. This a timely piece of legislation, Madam Speaker, and I would  urge my colleagues to support the bill.   Madam Speaker, I reserve the balance of my time.   Mr. HALL of Texas. Madam Speaker, I yield such time as he may consume  to the gentleman from Tennessee (Mr. Gordon), who was ranking member on  the Subcommittee on Environment, Technology, and Standards back when  this legislation first began and wrote the electronic authentication  provisions in it. He is now ranking member on the Subcommittee on Space  and Aeronautics.   Mr. HALL of Texas. Madam Speaker, I have no further requests for  time, and I yield back the balance of my time."))

'congratulate maryland tennessee michigan hard work legislation assistance support york chairman effort bring bill floor timely piece legislation urge support bill reserve balance time texas yield time consume tennessee rank_member subcommittee environment technology standards legislation begin write electronic_authentication provision rank_member subcommittee aeronautics texas request time yield balance time'

In [23]:
%%time

# this is a bit time consuming (takes about 2h) - make the if statement True
# if you want to execute data prep yourself.
trigram_speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_transformed_speeches_all.txt')
if True:
    with codecs.open(trigram_speeches_filepath, 'w', encoding='utf_8') as f:  
        for parsed_speech in nlp.pipe(line_speech(speeches_filepath),
                                      batch_size=10000, n_threads=4):
            f.write(clean_text(parsed_speech) + '\n')

CPU times: user 4h 1min 18s, sys: 9.39 s, total: 4h 1min 28s
Wall time: 1h 33min 46s


In [24]:
!tail -n 2 intermediate/trigram_transformed_speeches_all.txt

rise support california loretta duly_elect 46th_district california 10 month ago certify republican secretary state 10 month ago important bring close committee house oversight hear special session evidence close california win 900 vote plurality duly_elect let bring close let serve people district work american people
ros_lehtinen rise honor african-_american 's influential significant voice 20th_century zora_neale_hurston zora renowned distinguished writer interpreter southern african_american culture serve today 40 year death experienced role_model woman nation work contribution american culture literature fit commemorative_stamp recognize zora 's contribution american life beautiful elementary_school congressional_district gifted_artist privilege speak boy girl talented teacher staff daily work play learn zora_neale_hurston come age literature time woman recently right vote recognition female literary writer especially african_american woman unheard zora 's success ability overcom