<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-4:-Train-an-LDA-topic-model-and-process-all-speeches-with-it" data-toc-modified-id="Part-4:-Train-an-LDA-topic-model-and-process-all-speeches-with-it-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 4: Train an LDA topic model and process all speeches with it</a></span><ul class="toc-item"><li><span><a href="#Learn-the-dictionary-(list-of-words)-for-the-whole-corpus" data-toc-modified-id="Learn-the-dictionary-(list-of-words)-for-the-whole-corpus-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learn the dictionary (list of words) for the whole corpus</a></span></li><li><span><a href="#Turn-speeches-into-bag-of-words-representations" data-toc-modified-id="Turn-speeches-into-bag-of-words-representations-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Turn speeches into bag-of-words representations</a></span></li><li><span><a href="#Train-the-LDA-topic-model-on-the-speech-corpus" data-toc-modified-id="Train-the-LDA-topic-model-on-the-speech-corpus-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Train the LDA topic model on the speech corpus</a></span></li><li><span><a href="#Let's-explore-all-the-topics-in-the-LDA-model-we-just-created" data-toc-modified-id="Let's-explore-all-the-topics-in-the-LDA-model-we-just-created-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Let's explore all the topics in the LDA model we just created</a></span></li><li><span><a href="#Visualise-the-LDA-model-using-pyLDAvis" data-toc-modified-id="Visualise-the-LDA-model-using-pyLDAvis-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Visualise the LDA model using pyLDAvis</a></span></li><li><span><a href="#Load-previously-computed-bigram-and-trigram-models" data-toc-modified-id="Load-previously-computed-bigram-and-trigram-models-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Load previously computed bigram and trigram models</a></span></li><li><span><a href="#...and-the-helper-functions" data-toc-modified-id="...and-the-helper-functions-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>...and the helper functions</a></span></li><li><span><a href="#Apply-LDA-model-to-all-speeches-in-the-speeches-dataframe-and-save-to-disk" data-toc-modified-id="Apply-LDA-model-to-all-speeches-in-the-speeches-dataframe-and-save-to-disk-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Apply LDA model to all speeches in the speeches dataframe and save to disk</a></span></li></ul></li></ul></div>

# Analyse all house of commons speeches since 1970

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

[Part 2: Download all speeches belonging to MPs in list](MP_speeches-Part2.ipynb)

[Part 3: Train bigram and trigram models and use them on all speeches](MP_speeches-Part3.ipynb)

## Part 4: Train an LDA topic model and process all speeches with it

[Part 5: Analyse the results of the LDA model](MP_speeches-Part5.ipynb)

In [1]:
import pandas as pd
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.word2vec import LineSentence
import gensim
import os
import warnings
warnings.filterwarnings('ignore')

from config import INTERMEDIATE_DIRECTORY

### Learn the dictionary (list of words) for the whole corpus

In [2]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to relearn the dictionary.
trigram_speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_transformed_speeches_all.txt')
trigram_dictionary_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_dict_all.dict')
if False:
    trigram_speeches = LineSentence(trigram_speeches_filepath)

    # learn the dictionary by iterating over all of the speeches
    trigram_dictionary = Dictionary(trigram_speeches)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
else: 
    # load the finished dictionary from disk
    trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 20 ms, sys: 12 ms, total: 32 ms
Wall time: 31 ms


### Turn speeches into bag-of-words representations

In [3]:
def trigram_bow_generator(filepath):
    """
    generator function to read speeches from a file
    and yield a bag-of-words representation
    """
    
    for speech in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(speech)

In [4]:
%%time
trigram_bow_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_bow_corpus_all.mm')
# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if False:
    # generate bag-of-words representations for
    # all speches and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_speeches_filepath))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.5 µs


### Train the LDA topic model on the speech corpus
If you want the model to use more or fewer topics, then change num_topics below. However, bare in mind that you will have to relabel all the topics yourself if you do this.

In [5]:
%%time
## Train the LDA topic model using Gensim
from gensim.models.ldamulticore import LdaMulticore

# this is a bit time consuming (takes about 45 mins)- make the if statement True
# if you want to train the LDA model yourself.
lda_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'lda_model_all')
if False:
    # load the finished bag-of-words corpus from disk
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=100,
                           id2word=trigram_dictionary,
                           workers=4)
    
    lda.save(lda_model_filepath)
else:
    # load the finished LDA model from disk
    lda = LdaMulticore.load(lda_model_filepath)

CPU times: user 324 ms, sys: 100 ms, total: 424 ms
Wall time: 420 ms


### Let's explore all the topics in the LDA model we just created

In [6]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print('{:20} {}'.format('term', 'frequency'))

    for term, frequency in lda.show_topic(topic_number, topn=topn):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))

In [7]:
explore_topic(0, topn=10)

term                 frequency
business             0.101
small_business       0.018
government           0.015
regulation           0.013
cost                 0.013
work                 0.007
sector               0.007
industry             0.007
’s                   0.007
small                0.007


Topic 0 seems to be about business, judging by the top 10 terms. This method of visualising topics is a bit inconvenient thugh and since we have 100 topics to label, let's use pyLDAvis instead.

### Visualise the LDA model using pyLDAvis
pyLDAvis is a smart visualisation app that allows us to navigate through all the topics in the model

In [8]:
%%time
import pickle
import pyLDAvis.gensim
from gensim.corpora import  MmCorpus
import pyLDAvis
import os

ldavis_pickle_path = os.path.join(INTERMEDIATE_DIRECTORY, "pyldavis.p")
# Change to True if you want to recalculate the visualisation (takes about 40 mins)
if False:
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                                  trigram_dictionary, sort_topics=False)
    pickle.dump(LDAvis_prepared, open(ldavis_pickle_path, "wb"))
else:
    LDAvis_prepared = pickle.load(open(ldavis_pickle_path, "rb"))

CPU times: user 344 ms, sys: 12 ms, total: 356 ms
Wall time: 356 ms


Subtract 1 from topic numbers in visualisation to get pandas index

In [9]:
pyLDAvis.display(LDAvis_prepared)

In [10]:
#%%writefile topic_names.py
# Dictionary of topic names
topic_names_100 = {
    0: "business",
    3: "immigration",
    4: "counter terrorism",
    5: "syria",
    6: "private housing",
    7: "banking",
    9: "tribunal",
    18: "bbc",
    19: "police force",
    20: "parliamentary terms",
    21: "secretary of state terms",
    23: "local authority",
    25: "domestic violence",
    26: "airport and rail expansion",
    27: "scotland",
    29: "parliamentary terms+",
    30: "wales",
    32: "drugs and alcohol",
    33: "middle east",
    35: "care quality commission",
    36: "speaker of the house",
    39: "nhs",
    41: "farming",
    42: "law",
    43: "development & climate change",
    44: "fishing industry",
    45: "inquiries & reports",
    46: "northern ireland",
    47: "construction",
    49: "animal welfare",
    60: "fraud terminology",
    62: "legislation",
    63: "bill terminology",
    64: "regional stuff",
    65: "elections",
    68: "local services",
    69: "energy",
    70: "welfare reforms",
    71: "european union",
    72: "education",
    73: "money-related terms",
    74: "pensioner income",
    76: "child poverty",
    78: "sports & culture",
    79: "investment",
    84: "armed forces",
    85: "economy",
    86: "house of lords",
    87: "employee's rights",
    92: "nuclear weapons",
    95: "parliamentary terms++",
    99: "child care"
}

def topic_dict(topic_number):
    """
    return name of topic where identified
    """
    
    try:
        return topic_names_100[topic_number]
    except KeyError:
        return topic_number

### Load previously computed bigram and trigram models

In [11]:
%%time
if True:
    # Load the bigram and trigram models so we can apply this to any new text
    bigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_model_all')
    trigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_model_all')
    # load the bigram model from disk
    bigram_model = gensim.models.Phrases.load(bigram_model_filepath)
    # load the trigram model from disk
    trigram_model = gensim.models.Phrases.load(trigram_model_filepath)
    # Phraser class is much faster so use this instead of Phrase
    bigram_phraser = gensim.models.phrases.Phraser(bigram_model)
    trigram_phraser = gensim.models.phrases.Phraser(trigram_model)

CPU times: user 1min 10s, sys: 640 ms, total: 1min 11s
Wall time: 1min 11s


### ...and the helper functions

In [12]:
# %load helper_functions.py
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_speech(filename):
    """
    generator function to read in speeches from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for speech in f:
            yield speech.replace('\\n', '\n')

def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse speeches,
    lemmatize the text, and yield sentences
    """
    
    for parsed_speech in nlp.pipe(line_speech(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_speech.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [13]:
if True:
    # Load english language model from spacy
    import spacy
    nlp = spacy.load("en")

def clean_text(speech_text):
    """
    Remove stop words, lemmatize and split into tokens using the trigram parser
    and return a bag-of-words representation
    """
    
    # parse the review text with spaCy
    parsed_speech = nlp(speech_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_speech = [token.lemma_ for token in parsed_speech
                      if not punct_space(token)]
    
    # apply the bigram and trigram phrase models
    bigram_speech = bigram_phraser[unigram_speech]
    trigram_speech = trigram_phraser[bigram_speech]
    
    # remove any remaining stopwords
    trigram_speech = [term for term in trigram_speech
                      if not term in spacy.en.language_data.STOP_WORDS]
    
    return trigram_speech

In [14]:
def lda_description(speech_text):
    """
    accept the original text of a speech and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a series containing all the topics and their probabilities in the LDA representation
    """
    
    # Get clean representation of text
    trigram_speech = clean_text(speech_text)

    #print(trigram_speech)
     # create a bag-of-words representation
    speech_bow = trigram_dictionary.doc2bow(trigram_speech)
    
    # create an LDA representation
    speech_lda = lda[speech_bow]
    
    topic_dict = dict(zip(range(100), [0.0]*100))
    
    for topic in speech_lda:
        topic_dict[topic[0]] = topic[1]
    topic_dict["n_words"] = len(trigram_speech)
    return pd.Series(topic_dict)

In [15]:
lda_description("There is a lot of natural gas in the middle east.").rename(lambda x: topic_dict(x)).sort_values(ascending=False)

n_words                         3.000000
energy                          0.435078
middle east                     0.319922
speaker of the house            0.000000
scotland                        0.000000
28                              0.000000
parliamentary terms+            0.000000
wales                           0.000000
31                              0.000000
drugs and alcohol               0.000000
34                              0.000000
care quality commission         0.000000
37                              0.000000
domestic violence               0.000000
38                              0.000000
nhs                             0.000000
40                              0.000000
farming                         0.000000
law                             0.000000
development & climate change    0.000000
fishing industry                0.000000
inquiries & reports             0.000000
northern ireland                0.000000
construction                    0.000000
airport and rail

In [16]:
import pandas as pd
import numpy as np
# Read in details of MPs
mps = pd.read_hdf("list_of_mps.h5", "mps")

### Apply LDA model to all speeches in the speeches dataframe and save to disk

In [17]:
%%time
# This takes a while (~3h) so use cached version if available
# Change to True if you want to recalculate the LDA topics for the speeches
if False:
    import dask.dataframe as dd
    import dask.threaded
    from dask.diagnostics import ProgressBar
    pbar = ProgressBar()
    pbar.register()
    
    # You should probably tweak this line to point to all speech dataframes
    # I ran this section twice using both sets of raw speeches and saved them to two sections of the processed hdf5 file
    speeches = pd.read_hdf("raw_speeches.h5", "speeches_0")
    
    from multiprocessing import Pool

    ## This is better for smaller dataframes
    with Pool(8) as pool:
        speeches = pd.concat([speeches, pd.DataFrame(list(pool.map(lda_description, list(speeches.body))))], axis=1)
    
    # Thi is better for bigger dataframes but is prone to crashing...
    #speeches = dd.from_pandas(speeches, npartitions=8).map_partitions(lambda x: pd.concat([x, x.body.apply(lda_description, 1)], axis=1)).compute(get=dask.threaded.get)
    
    # And this one to save to the right hdf file. Remember to change it if you don't want to overwrite the file or section.
    speeches.to_hdf("/media/Stuff/processed_speeches_new.h5", "speeches_0")
    
else:
    try:
        del speeches
    except NameError:
            pass
    speeches = pd.read_hdf("/media/Stuff/processed_speeches_new.h5", "speeches_0").drop("body", axis=1)\
        .append([pd.read_hdf("/media/Stuff/processed_speeches_new.h5", "speeches_1").drop("body", axis=1)], ignore_index=True)

CPU times: user 19.9 s, sys: 1.14 s, total: 21.1 s
Wall time: 11min 31s


In [18]:
speeches["date"] = pd.to_datetime(speeches["date"])
speeches["mp_id"] = pd.to_numeric(speeches["mp_id"])
speeches["section_id"] = pd.to_numeric(speeches["section_id"])

In [21]:
speeches.to_hdf("processed_speeches_new.h5", "speeches_0")

This part below was added later because we wanted to separate the speeches dataframe into two for easier loading into memory

In [None]:
# %load separate_speeches_df.py
"""
Method for separating out speeches dataframe into more accessible formats.
We put all the numerical and MP data into an HDF5 table which allows us to query by row
and we put the speeches into a bcolz array which allows us to read directly from disk in an efficient manner
"""
import bcolz
import pandas as pd

# Load speeches
speeches = pd.read_hdf("/media/Stuff/processed_speeches_new.h5", "speeches_0")\
    .append([pd.read_hdf("/media/Stuff/processed_speeches_new.h5", "speeches_1")], ignore_index=True)

# Convert column types
speeches["date"] = pd.to_datetime(speeches["date"])
speeches["mp_id"] = pd.to_numeric(speeches["mp_id"]).astype("category")
speeches["section_id"] = speeches["section_id"].astype(str)
speeches["mp_name"] = speeches["mp_name"].astype(str)
speeches["debate_title"] = speeches["debate_title"].astype(str).astype("category")
speeches["n_words"] = speeches["n_words"].astype("float32")

speeches[list(range(100))] = speeches[list(range(100))].apply(lambda x: x.astype("float32"))

# separate data and speech text into different dataframes
speeches_ = speeches.drop("body", axis=1)
speeches = speeches[["body"]]

# Save data to hdf5
speeches_.to_hdf("speeches.h5", "speeches", mode="w", format="table")

# Save speeches to bcolz array
bcolz.carray(speeches["body"], rootdir="/media/Stuff/speeches.bcolz", chunklen=10000000, cparams=bcolz.cparams(cname="lz4hc"))
