<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-3:-Train-bigram-and-trigram-models-and-use-them-on-all-speeches" data-toc-modified-id="Part-3:-Train-bigram-and-trigram-models-and-use-them-on-all-speeches-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 3: Train bigram and trigram models and use them on all speeches</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-use-SpaCy-to-tokenize-and-POS-tag-each-speech" data-toc-modified-id="We-use-SpaCy-to-tokenize-and-POS-tag-each-speech-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>We use SpaCy to tokenize and POS tag each speech</a></span></li><li><span><a href="#Load-all-the-speeches-and-metadata-into-a-pandas-dataframe-and-save-into-an-hdf-file" data-toc-modified-id="Load-all-the-speeches-and-metadata-into-a-pandas-dataframe-and-save-into-an-hdf-file-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Load all the speeches and metadata into a pandas dataframe and save into an hdf file</a></span></li><li><span><a href="#Save-speeches-alone-to-a-text-file-to-speed-up-processing" data-toc-modified-id="Save-speeches-alone-to-a-text-file-to-speed-up-processing-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Save speeches alone to a text file to speed up processing</a></span></li><li><span><a href="#Create-a-bunch-of-helper-functions-to-help-us-read-speeches-from-the-file,-remove-punctuation-and-whitespace-and-lemmatize-words" data-toc-modified-id="Create-a-bunch-of-helper-functions-to-help-us-read-speeches-from-the-file,-remove-punctuation-and-whitespace-and-lemmatize-words-1.0.4"><span class="toc-item-num">1.0.4&nbsp;&nbsp;</span>Create a bunch of helper functions to help us read speeches from the file, remove punctuation and whitespace and lemmatize words</a></span></li><li><span><a href="#Lemmatize-all-words-in-speeches-and-store-them-in-text-file-to-save-memory" data-toc-modified-id="Lemmatize-all-words-in-speeches-and-store-them-in-text-file-to-save-memory-1.0.5"><span class="toc-item-num">1.0.5&nbsp;&nbsp;</span>Lemmatize all words in speeches and store them in text file to save memory</a></span></li><li><span><a href="#Learn-bigrams-in-speeches-and-save-model-to-disk" data-toc-modified-id="Learn-bigrams-in-speeches-and-save-model-to-disk-1.0.6"><span class="toc-item-num">1.0.6&nbsp;&nbsp;</span>Learn bigrams in speeches and save model to disk</a></span></li><li><span><a href="#Identify-bigrams-in-the-speeches-and-save-in-txt-file" data-toc-modified-id="Identify-bigrams-in-the-speeches-and-save-in-txt-file-1.0.7"><span class="toc-item-num">1.0.7&nbsp;&nbsp;</span>Identify bigrams in the speeches and save in txt file</a></span></li><li><span><a href="#Learn-trigrams-in-speeches-and-save-model-to-disk" data-toc-modified-id="Learn-trigrams-in-speeches-and-save-model-to-disk-1.0.8"><span class="toc-item-num">1.0.8&nbsp;&nbsp;</span>Learn trigrams in speeches and save model to disk</a></span></li><li><span><a href="#Identify-trigrams-in-the-speeches-and-save-in-txt-file" data-toc-modified-id="Identify-trigrams-in-the-speeches-and-save-in-txt-file-1.0.9"><span class="toc-item-num">1.0.9&nbsp;&nbsp;</span>Identify trigrams in the speeches and save in txt file</a></span></li><li><span><a href="#Now-process-all-speeches-from-plain-text-to-unigram-(lemmatized),-bigram-and-finally-trigram-representation" data-toc-modified-id="Now-process-all-speeches-from-plain-text-to-unigram-(lemmatized),-bigram-and-finally-trigram-representation-1.0.10"><span class="toc-item-num">1.0.10&nbsp;&nbsp;</span>Now process all speeches from plain text to unigram (lemmatized), bigram and finally trigram representation</a></span></li></ul></li></ul></li></ul></div>

# Analyse all house of commons speeches since 1970

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

[Part 2: Download all speeches belonging to MPs in list](MP_speeches-Part2.ipynb)

## Part 3: Train bigram and trigram models and use them on all speeches

[Part 4: Train an LDA topic model and process all speeches with it](MP_speeches-Part4.ipynb)

[Part 5: Analyse the results of the LDA model](MP_speeches-Part5.ipynb)

Bigrams (Trigrams) are any two (three) words that often go together.
For example, Maastricht treaty (House of Commons) would be converted to maastricht_treaty (house_of_commons) with such a model.

In [1]:
import pandas as pd

In [2]:
# Load the list of MPs from Part 1
mps = pd.read_hdf("list_of_mps.h5", "mps")

In [3]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
import codecs
import os

#### We use SpaCy to tokenize and POS tag each speech

In [4]:
if False:
    # Load english language model from spacy
    import spacy
    nlp = spacy.load("en")

In [43]:
# Directory to store Phrase models
from config import INTERMEDIATE_DIRECTORY

#### Load all the speeches and metadata into a pandas dataframe and save into an hdf file

In [7]:
### Uncomment these bash commands to copy all speeches into one csv file.
### This is essential if you want to run the next cell for the first time!

#!echo "body,date,debate_title,mp_constituency,mp_id,mp_name,mp_party,section_id,speech_id,speech_url,subsection_id,time" > ./speeches/speeches.csv
#!tail -n +2 -q mp-* >> speeches.csv

In [8]:
%%time
### Change False to True to compute. This may take a while!
if False:
    import pandas as pd
    # Did you run the command above first?
    try:
        speeches = pd.read_csv("./speeches/speeches.csv")
    except FileNotFoundError:
        raise FileNotFoundError("speeches.csv not found. Did you run the bash commands in the cell above?")
    
    # Strip honorifics from names
    import re
    honorifics = r'(Mr|Mrs|Ms|Miss|Advocate|Ambassador|Baron|Baroness|Brigadier|Canon|Captain|Chancellor|Chief|Col|Comdr|Commodore|Councillor|Count|Countess|Dame|Dr|Duke of|Earl|Earl of|Father|General|Group Captain|H R H the Duchess of|H R H the Duke of|H R H The Princess|HE Mr|HE Senora|HE The French Ambassador M|His Highness|His Hon|His Hon Judge|Hon|Hon Ambassador|Hon Dr|Hon Lady|Hon Mrs|HRH|HRH Sultan Shah|HRH The|HRH The Prince|HRH The Princess|HSH Princess|HSH The Prince|Judge|King|Lady|Lord|Lord and Lady|Lord Justice|Lt Cdr|Lt Col|Madam|Madame|Maj|Maj Gen|Major|Marchesa|Marchese|Marchioness|Marchioness of|Marquess|Marquess of|Marquis|Marquise|Master|Mr and Mrs|Mr and The Hon Mrs|President|Prince|Princess|Princessin|Prof|Prof Emeritus|Prof Dame|Professor|Queen|Rabbi|Representative|Rev Canon|Rev Dr|Rev Mgr|Rev Preb|Reverend|Reverend Father|Right Rev|Rt Hon|Rt Hon Baroness|Rt Hon Lord|Rt Hon Sir|Rt Hon The Earl|Rt Hon Viscount|Senator|Sir|Sister|Sultan|The Baroness|The Countess|The Countess of|The Dowager Marchioness of|The Duchess|The Duchess of|The Duke of|The Earl of|The Hon|The Hon Mr|The Hon Mrs|The Hon Ms|The Hon Sir|The Lady|The Lord|The Marchioness of|The Princess|The Reverend|The Rt Hon|The Rt Hon Lord|The Rt Hon Sir|The Rt Hon The Lord|The Rt Hon the Viscount|The Rt Hon Viscount|The Venerable|The Very Rev Dr|Very Reverend|Viscondessa|Viscount|Viscount and Viscountess|Viscountess|W Baron|W/Cdr)'
    h = re.compile(honorifics.replace("|", r" \b|\b"))
    speeches["mp_name"] = speeches["mp_name"].str.replace(h, "")
    speeches["body"] = speeches["body"].fillna("")
    # Concatenate all speeches by a particular MP in a particular debate into one
    speeches = speeches.groupby(['section_id', 'mp_name', "mp_id", 'debate_title', 'date']).apply(lambda x: " ".join(x.body)).reset_index()
    speeches = speeches.rename(columns={0:"body"})
    
    # Split into different parts of the hdf file because HDF seems to struggle with huge data frames
    speeches[:len(speeches) // 2].to_hdf("raw_speeches.h5", "speeches_0", mode="w")
    speeches[len(speeches) // 2:].to_hdf("raw_speeches.h5", "speeches_1", mode="a")
    # Remove from memory because it may be too big and unnecessary
    del speeches

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs


#### Save speeches alone to a text file to speed up processing

In [9]:
# Save speeches to txt file first
speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, "speeches.txt")
# Set to True if you want to run this again
if False:
    with codecs.open(speeches_filepath, "w", encoding="utf_8") as f:
        for speech in speeches["body"]:
            f.write(speech + "\n")

#### Create a bunch of helper functions to help us read speeches from the file, remove punctuation and whitespace and lemmatize words

In [52]:
%%writefile helper_functions.py

def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_speech(filename):
    """
    generator function to read in speeches from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for speech in f:
            yield speech.replace('\\n', '\n')

def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse speeches,
    lemmatize the text, and yield sentences
    """
    
    for parsed_speech in nlp.pipe(line_speech(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_speech.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Overwriting helper_functions.py


#### Lemmatize all words in speeches and store them in text file to save memory
Lemmatization is the process of stripping word endings to convert words to their stems

In [10]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
unigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'unigram_sentences_all.txt')
if False:
    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(speeches_filepath):
            f.write(sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 17.2 µs


In [30]:
!tail intermediate/unigram_sentences_all.txt

-PRON- just nip to the gentleman’s—
to ask the secretary of state for foreign and commonwealth affairs what recent discussion -PRON- have have with the danish government about the forthcoming referendum on the maastricht treaty
-PRON- know that whenever mean testing be put in place there be a cost because of the bureaucracy that be need to administer -PRON-
do -PRON- right hon
friend have any idea how much this particular method of mean testing will cost
the hon gentleman be entirely correct
-PRON- be particularly unfortunate that some of those who have be discuss at great length quite minor issue in the bill consider that -PRON- contain some serious issue that -PRON- could have be discuss say that -PRON- opposition to the original fur farming legislation be because -PRON- be a private member 's bill when -PRON- should have be a government bill
now -PRON- be a government bill and although -PRON- can not undertake to give the hon
gentleman full detail tomorrow -PRON- can confirm

#### Learn bigrams in speeches and save model to disk
Bigrams are any two words that often go together. For example, Maastricht treaty would be converted to maastricht_treaty with such a model.

In [25]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
bigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_model_all')
if False:
    # Open unigram sentences as a stream
    unigram_sentences = LineSentence(unigram_sentences_filepath)
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)
else:
    # load the finished model from disk
    bigram_model = Phrases.load(bigram_model_filepath)

CPU times: user 3.61 s, sys: 208 ms, total: 3.82 s
Wall time: 3.84 s


#### Identify bigrams in the speeches and save in txt file

In [12]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
bigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_sentences_all.txt')
if False:
    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f: 
        for unigram_sentence in unigram_sentences:
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            f.write(bigram_sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 17.6 µs


In [29]:
!tail intermediate/bigram_sentences_all.txt

-PRON- just nip to the gentleman’s—
to ask the secretary of state for foreign and commonwealth_affairs what recent_discussion -PRON- have have with the danish government about the forthcoming referendum on the maastricht_treaty
-PRON- know that whenever mean_testing be put in place there be a cost because of the bureaucracy that be need to administer -PRON-
do -PRON- right hon
friend have any idea how much this particular method of mean_testing will cost
the hon gentleman be entirely correct
-PRON- be particularly unfortunate that some of those who have be discuss at great length quite minor issue in the bill consider that -PRON- contain some serious issue that -PRON- could have be discuss say that -PRON- opposition to the original fur_farming legislation be because -PRON- be a private member 's bill when -PRON- should have be a government bill
now -PRON- be a government bill and although -PRON- can not undertake to give the hon
gentleman full detail tomorrow -PRON- can confirm

#### Learn trigrams in speeches and save model to disk
Trigrams are any three words that often go together. For example, House of Commons would be converted to house_of_commons with such a model.

In [13]:
%%time
## Learn a trigram model from bigrammed speeches

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
trigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_model_all')
if False:
    # Open bigram sentences as a stream
    bigram_sentences = LineSentence(bigram_sentences_filepath)
    trigram_model = Phrases(bigram_sentences)
    trigram_model.save(trigram_model_filepath)
else:
    # load the finished model from disk
    trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 4.04 s, sys: 180 ms, total: 4.22 s
Wall time: 4.22 s


#### Identify trigrams in the speeches and save in txt file

In [14]:
%%time
## Save speeches as trigrams in txt file

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
trigram_sentences_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_sentences_all.txt')
if False:
    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            f.write(trigram_sentence + '\n')
# Open trigrams file as stream
trigram_sentences = LineSentence(trigram_sentences_filepath)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 20.3 µs


In [31]:
!tail intermediate/trigram_sentences_all.txt

-PRON- just nip to the gentleman’s—
to ask the secretary of state for foreign and commonwealth_affairs what_recent_discussion -PRON- have have with the danish government about the forthcoming referendum on the maastricht_treaty
-PRON- know that whenever mean_testing be put in place there be a cost because of the bureaucracy that be need to administer -PRON-
do -PRON- right hon
friend have any idea how much this particular method of mean_testing will cost
the hon gentleman be entirely correct
-PRON- be particularly unfortunate that some of those who have be discuss at great length quite minor issue in the bill consider that -PRON- contain some serious issue that -PRON- could have be discuss say that -PRON- opposition to the original fur_farming legislation be because -PRON- be a private member 's bill when -PRON- should have be a government bill
now -PRON- be a government bill and although -PRON- can not undertake to give the hon
gentleman full detail tomorrow -PRON- can confirm

#### Now process all speeches from plain text to unigram (lemmatized), bigram and finally trigram representation
We previously only did this for some speeches (although this depends on how you ran this whole file). This is not super efficient so you might want to modify the next two cells.

In [37]:
def clean_text(parsed_speech):
   # lemmatize the text, removing punctuation and whitespace
    unigram_speech = [token.lemma_ for token in parsed_speech
                      if not punct_space(token)]

    # apply the bigram and trigram phrase models
    bigram_speech = bigram_model[unigram_speech]
    trigram_speech = trigram_model[bigram_speech]

    # remove any remaining stopwords
    trigram_speech = [term for term in trigram_speech
                      if term not in spacy.en.language_data.STOP_WORDS]

    # write the transformed speech as a line in the new file
    trigram_speech = u' '.join(trigram_speech) 
    
    return trigram_speech

In [56]:
%%time

# this is a bit time consuming (takes about 2h) - make the if statement True
# if you want to execute data prep yourself.
trigram_speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_transformed_speeches_all.txt')
if False:
    with codecs.open(trigram_speeches_filepath, 'w', encoding='utf_8') as f:  
        for parsed_speech in nlp.pipe(line_speech(speeches_filepath),
                                      batch_size=10000, n_threads=4):
            f.write(clean_text(parsed_speech) + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 16.9 µs


In [41]:
!tail -n 2 intermediate/trigram_transformed_speeches_all.txt

exactly happen dog lift -PRON- find amazing -PRON- appreciate -PRON- hon friend desperate -PRON- want consider -PRON- previous point -PRON- amendment 10 driver power stop vehicle order blind person dog -PRON- seriously mean -PRON- prepared table amendment bill aim protect blind disabled_people allow driver stop middle order dog blind disabled person dog irritate driver -PRON- seriously propose -PRON- agree thrust -PRON- hon friend 's argument -PRON- right point -PRON- perfectly_possible rational fear dog example person bite close -PRON- attack fear constitute medical_condition -PRON- perfectly_rational understandable respect -PRON- hon friend member hendon_mr._dismore hon member somerton frome_mr._heath think -PRON- possible new_clause -PRON- glimpse real meaning compassionate_conservatism
root_cause problem london_centric policy country -PRON- doubt individual board represent apart_from elitist group individual represent london fact matter whatev provocation -PRON- elect north_east m