<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-4:-Train-an-LDA-topic-model-and-process-all-speeches-with-it" data-toc-modified-id="Part-4:-Train-an-LDA-topic-model-and-process-all-speeches-with-it-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 4: Train an LDA topic model and process all speeches with it</a></span><ul class="toc-item"><li><span><a href="#Learn-the-dictionary-(list-of-words)-for-the-whole-corpus" data-toc-modified-id="Learn-the-dictionary-(list-of-words)-for-the-whole-corpus-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learn the dictionary (list of words) for the whole corpus</a></span></li><li><span><a href="#Turn-speeches-into-bag-of-words-representations" data-toc-modified-id="Turn-speeches-into-bag-of-words-representations-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Turn speeches into bag-of-words representations</a></span></li><li><span><a href="#Train-the-LDA-topic-model-on-the-speech-corpus" data-toc-modified-id="Train-the-LDA-topic-model-on-the-speech-corpus-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Train the LDA topic model on the speech corpus</a></span></li><li><span><a href="#Let's-explore-all-the-topics-in-the-LDA-model-we-just-created" data-toc-modified-id="Let's-explore-all-the-topics-in-the-LDA-model-we-just-created-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Let's explore all the topics in the LDA model we just created</a></span></li><li><span><a href="#Visualise-the-LDA-model-using-pyLDAvis" data-toc-modified-id="Visualise-the-LDA-model-using-pyLDAvis-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Visualise the LDA model using pyLDAvis</a></span></li><li><span><a href="#Load-previously-computed-bigram-and-trigram-models" data-toc-modified-id="Load-previously-computed-bigram-and-trigram-models-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Load previously computed bigram and trigram models</a></span></li><li><span><a href="#...and-the-helper-functions" data-toc-modified-id="...and-the-helper-functions-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>...and the helper functions</a></span></li><li><span><a href="#Apply-LDA-model-to-all-speeches-in-the-speeches-dataframe-and-save-to-disk" data-toc-modified-id="Apply-LDA-model-to-all-speeches-in-the-speeches-dataframe-and-save-to-disk-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Apply LDA model to all speeches in the speeches dataframe and save to disk</a></span></li></ul></li></ul></div>

# Analyse all house of commons speeches since 1970

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

[Part 2: Download all speeches belonging to MPs in list](MP_speeches-Part2.ipynb)

[Part 3: Train bigram and trigram models and use them on all speeches](MP_speeches-Part3.ipynb)

## Part 4: Train an LDA topic model and process all speeches with it

[Part 5: Analyse the results of the LDA model](MP_speeches-Part5.ipynb)

In [1]:
import pandas as pd
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.word2vec import LineSentence
import gensim
import os
import warnings
warnings.filterwarnings('ignore')

from config import INTERMEDIATE_DIRECTORY

### Learn the dictionary (list of words) for the whole corpus

In [2]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to relearn the dictionary.
trigram_speeches_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_transformed_speeches_all.txt')
trigram_dictionary_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_dict_all.dict')
if False:
    trigram_speeches = LineSentence(trigram_speeches_filepath)

    # learn the dictionary by iterating over all of the speeches
    trigram_dictionary = Dictionary(trigram_speeches)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
else: 
    # load the finished dictionary from disk
    trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 36 ms, sys: 20 ms, total: 56 ms
Wall time: 65.7 ms


### Turn speeches into bag-of-words representations

In [3]:
def trigram_bow_generator(filepath):
    """
    generator function to read speeches from a file
    and yield a bag-of-words representation
    """
    
    for speech in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(speech)

In [4]:
%%time
trigram_bow_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_bow_corpus_all.mm')
# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if False:
    # generate bag-of-words representations for
    # all speches and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_speeches_filepath))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 16.9 µs


### Train the LDA topic model on the speech corpus
If you want the model to use more or fewer topics, then change num_topics below. However, bare in mind that you will have to relabel all the topics yourself if you do this.

In [5]:
%%time
## Train the LDA topic model using Gensim
from gensim.models.ldamulticore import LdaMulticore

# this is a bit time consuming (takes about 1h30 mins)- make the if statement True
# if you want to train the LDA model yourself.
lda_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'lda_model_all_75')
if True:
    # load the finished bag-of-words corpus from disk
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=75,
                           id2word=trigram_dictionary,
                           workers=4)
    
    lda.save(lda_model_filepath)
else:
    # load the finished LDA model from disk
    lda = LdaMulticore.load(lda_model_filepath)

CPU times: user 24min 28s, sys: 4min 44s, total: 29min 12s
Wall time: 1h 15min 2s


### Let's explore all the topics in the LDA model we just created

In [6]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print('{:20} {}'.format('term', 'frequency'))

    for term, frequency in lda.show_topic(topic_number, topn=topn):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))

In [7]:
explore_topic(0, topn=10)

term                 frequency
health               0.039
health_care          0.024
care                 0.024
patient              0.016
bill                 0.014
medical              0.014
hospital             0.014
service              0.013
medicaid             0.013
provide              0.013


Topic 0 seems to be about business, judging by the top 10 terms. This method of visualising topics is a bit inconvenient thugh and since we have 100 topics to label, let's use pyLDAvis instead.

### Visualise the LDA model using pyLDAvis
pyLDAvis is a smart visualisation app that allows us to navigate through all the topics in the model

In [8]:
%%time
import pickle
import pyLDAvis.gensim
from gensim.corpora import  MmCorpus
import pyLDAvis
import os

ldavis_pickle_path = os.path.join(INTERMEDIATE_DIRECTORY, "pyldavis_75.p")
# Change to True if you want to recalculate the visualisation (takes about 1h)
if True:
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                                  trigram_dictionary, sort_topics=False)
    pickle.dump(LDAvis_prepared, open(ldavis_pickle_path, "wb"))
else:
    LDAvis_prepared = pickle.load(open(ldavis_pickle_path, "rb"))

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


CPU times: user 2h 27min 44s, sys: 4h 32min 49s, total: 7h 34s
Wall time: 1h 5min 6s


Subtract 1 from topic numbers in visualisation to get pandas index

In [37]:
LDAvis_prepared_ = LDAvis_prepared

  new_obj[k] = extract_dates(v)


In [None]:
type(LDA)

In [41]:
LDAvis_prepared_[0] = LDAvis_prepared_[0].query("Freq > 1.0")

  new_obj[k] = extract_dates(v)


TypeError: 'PreparedData' object does not support item assignment

In [9]:
pyLDAvis.display(LDAvis_prepared)

  new_obj[k] = extract_dates(v)


In [10]:
#%%writefile topic_names.py
# Dictionary of topic names
topic_names_100 = {
    0: "business",
    3: "immigration",
    4: "counter terrorism",
    5: "syria",
    6: "private housing",
    7: "banking",
    9: "tribunal",
    18: "bbc",
    19: "police force",
    20: "parliamentary terms",
    21: "secretary of state terms",
    23: "local authority",
    25: "domestic violence",
    26: "airport and rail expansion",
    27: "scotland",
    29: "parliamentary terms+",
    30: "wales",
    32: "drugs and alcohol",
    33: "middle east",
    35: "care quality commission",
    36: "speaker of the house",
    39: "nhs",
    41: "farming",
    42: "law",
    43: "development & climate change",
    44: "fishing industry",
    45: "inquiries & reports",
    46: "northern ireland",
    47: "construction",
    49: "animal welfare",
    60: "fraud terminology",
    62: "legislation",
    63: "bill terminology",
    64: "regional stuff",
    65: "elections",
    68: "local services",
    69: "energy",
    70: "welfare reforms",
    71: "european union",
    72: "education",
    73: "money-related terms",
    74: "pensioner income",
    76: "child poverty",
    78: "sports & culture",
    79: "investment",
    84: "armed forces",
    85: "economy",
    86: "house of lords",
    87: "employee's rights",
    92: "nuclear weapons",
    95: "parliamentary terms++",
    99: "child care"
}

def topic_dict(topic_number):
    """
    return name of topic where identified
    """
    
    try:
        return topic_names_100[topic_number]
    except KeyError:
        return topic_number

  new_obj[k] = extract_dates(v)


### Load previously computed bigram and trigram models

In [11]:
%%time
if True:
    # Load the bigram and trigram models so we can apply this to any new text (Takes about 3 mins)
    bigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'bigram_model_all')
    trigram_model_filepath = os.path.join(INTERMEDIATE_DIRECTORY, 'trigram_model_all')
    # load the bigram model from disk
    bigram_model = gensim.models.Phrases.load(bigram_model_filepath)
    # load the trigram model from disk
    trigram_model = gensim.models.Phrases.load(trigram_model_filepath)
    # Phraser class is much faster so use this instead of Phrase
    bigram_phraser = gensim.models.phrases.Phraser(bigram_model)
    trigram_phraser = gensim.models.phrases.Phraser(trigram_model)

  new_obj[k] = extract_dates(v)


CPU times: user 3min 40s, sys: 3.12 s, total: 3min 43s
Wall time: 3min 44s


### ...and the helper functions

In [12]:
# %load helper_functions.py
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_speech(filename):
    """
    generator function to read in speeches from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for speech in f:
            yield speech.replace('\\n', '\n')

def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse speeches,
    lemmatize the text, and yield sentences
    """
    
    for parsed_speech in nlp.pipe(line_speech(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_speech.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

  new_obj[k] = extract_dates(v)


In [13]:
if True:
    # Load english language model from spacy
    import spacy
    nlp = spacy.load("en")

# Load last names and pronouns into stopwords
from spacy.en.language_data import STOP_WORDS

for word in ["mr.", "mrs.", "ms.", "``", "sir", "madam", "gentleman", "colleague", "gentlewoman", "speaker", "-PRON-"] + list(pd.read_hdf("list_of_members.h5", "members").last_name.str.lower().unique()):
    STOP_WORDS.add(word)

def clean_text(speech_text):
    """
    Remove stop words, lemmatize and split into tokens using the trigram parser
    and return a bag-of-words representation
    """
    
    # parse the review text with spaCy
    parsed_speech = nlp(speech_text)
    
    # lemmatize the text, removing punctuation and whitespace
    unigram_speech = [token.lemma_ for token in parsed_speech
                      if not punct_space(token)]

    # remove any remaining stopwords
    unigram_speech = [term for term in unigram_speech
                      if term not in STOP_WORDS]
    
    # apply the bigram and trigram phrase models
    bigram_speech = bigram_phraser[unigram_speech]
    trigram_speech = trigram_phraser[bigram_speech]

    return trigram_speech

  new_obj[k] = extract_dates(v)


In [14]:
def lda_description(speech_text):
    """
    accept the original text of a speech and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a series containing all the topics and their probabilities in the LDA representation
    """
    
    import numpy as np
    
    # Get clean representation of text
    trigram_speech = clean_text(speech_text)

     # create a bag-of-words representation
    speech_bow = trigram_dictionary.doc2bow(trigram_speech)
    
    # create an LDA representation
    speech_lda = lda[speech_bow]
    
    topic_dict = dict(zip(range(75), [0.0]*75))
    
    for topic in speech_lda:
        topic_dict[topic[0]] = topic[1]
    topic_dict["n_words"] = len(trigram_speech)
    return pd.Series(topic_dict).astype(np.float16)

  new_obj[k] = extract_dates(v)


In [15]:
lda_description("Climate change is a disaster and must be averted")#.rename(lambda x: topic_dict(x)).sort_values(ascending=False)

  new_obj[k] = extract_dates(v)


0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
5          0.0
6          0.0
7          0.0
8          0.0
9          0.0
10         0.0
11         0.0
12         0.0
13         0.0
14         0.0
15         0.0
16         0.0
17         0.0
18         0.0
19         0.0
20         0.0
21         0.0
22         0.0
23         0.0
24         0.0
25         0.0
26         0.0
27         0.0
28         0.0
29         0.0
          ... 
46         0.0
47         0.0
48         0.0
49         0.0
50         0.0
51         0.0
52         0.0
53         0.0
54         0.0
55         0.0
56         0.0
57         0.0
58         0.0
59         0.0
60         0.0
61         0.0
62         0.0
63         0.0
64         0.0
65         0.0
66         0.0
67         0.0
68         0.0
69         0.0
70         0.0
71         0.0
72         0.0
73         0.0
74         0.0
n_words    3.0
Length: 76, dtype: float16

### Apply LDA model to all speeches in the speeches dataframe and save to disk

In [16]:
%%time
# This takes a while (~5h) so use cached version if available
# Change to True if you want to recalculate the LDA topics for the speeches
# MAKE SURE YOU DELETE THE PREVIOUS processed_speeches.h5 FILE FIRST
if True:
    import bcolz # For lazy loading speeches
    from tqdm import tqdm # For a progress bar
    from multiprocessing import Pool # For spreading out topic modelling over several cores
    import numpy as np

    # Load speech metadata
    speeches_meta = pd.read_hdf("speech_metadata.h5", "metadata")
    # Lazy load array of speeches
    speeches = bcolz.open("speeches.bcolz")
    # Remove old file
    #!rm processed_speeches.h5
    # Store max string lengths for later
    max_str_lengths = dict(zip(["doc_title", "id", "speaker"], map(lambda x: speeches_meta[x].str.len().max(), ["doc_title", "id", "speaker"])))
    # Loop over all the speeches in chunks of 1024 strings. This method keeps memory requirements low. If you have less memory, make CHUNK_SIZE smaller
    CHUNK_SIZE = 1024
    for i in tqdm(range(0, len(speeches), CHUNK_SIZE)):
        # Using multiprocessing, create a pool of 8 workers
        with Pool(8) as pool:
            # Apply lda function to each speech in chunk
            df = pd.DataFrame(list(pool.map(lda_description, speeches[i: i+CHUNK_SIZE])))\
                .assign(n_words=lambda x: x.n_words.astype(np.int16))
            # Align index so that there are no duplicate indices in the final dataframe
            df.index = df.index+i
            # Concatenate the speech metadata with the results of the lda topic distribution
            pd.concat([speeches_meta.iloc[i:i+CHUNK_SIZE], df], axis=1).to_hdf("processed_speeches_75.h5",
                                                                       "speeches", append=True,
                                                                       format="table", min_itemsize=max_str_lengths)
            
    ## Check that arrays are aligned and that the values stored are the same as computed. Uncomment if you want to do this.
    assert (pd.read_hdf("processed_speeches_75.h5", "speeches").iloc[60, range(75)] - lda_description(speeches[60]).loc[range(75)]).abs().mean() < 1e-4
else:
    try:
        del speeches
    except NameError:
            pass

    speeches = pd.read_hdf("processed_speeches_50.h5", "speeches")

  new_obj[k] = extract_dates(v)
100%|██████████| 450/450 [4:07:22<00:00, 31.01s/it]  


IndexError: positional indexers are out-of-bounds

In [29]:
(pd.read_hdf("processed_speeches_75.h5", "speeches").iloc[10, range(75)] - lda_description(speeches[10]).loc[range(75)]).abs().mean()# < 1e-4

  new_obj[k] = extract_dates(v)


0.0