<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-2:-Download-all-speeches-belonging-to-MPs-in-list" data-toc-modified-id="Part-2:-Download-all-speeches-belonging-to-MPs-in-list-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 2: Download all speeches belonging to MPs in list</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Use-TheyWorkForYou's-API-to-download-all-the-speeches-of-a-particular-MP" data-toc-modified-id="Use-TheyWorkForYou's-API-to-download-all-the-speeches-of-a-particular-MP-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Use TheyWorkForYou's API to download all the speeches of a particular MP</a></span></li><li><span><a href="#Run-the-above-function-in-parallel-for-all-MPs-in-the-list-that-do-not-have-a-speeches-file-yet" data-toc-modified-id="Run-the-above-function-in-parallel-for-all-MPs-in-the-list-that-do-not-have-a-speeches-file-yet-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Run the above function in parallel for all MPs in the list that do not have a speeches file yet</a></span></li></ul></li></ul></li></ul></div>

# Analyse all house of commons speeches since 1970

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

## Part 2: Download all speeches belonging to MPs in list

[Part 3: Train bigram and trigram models and use them on all speeches](MP_speeches-Part3.ipynb)

[Part 4: Train an LDA topic model and process all speeches with it](MP_speeches-Part4.ipynb)

[Part 5: Analyse the results of the LDA model](MP_speeches-Part5.ipynb)

In [1]:
import pandas as pd

In [2]:
# Load the list of MPs from Part 1
mps = pd.read_hdf("list_of_mps.h5", "mps")

#### Use TheyWorkForYou's API to download all the speeches of a particular MP

In [43]:
mps

Unnamed: 0_level_0,First name,Last name,Party,Constituency,URI,full_name,clean_name,is_female,party_women,party_mysoc,mp_wikidata,mp_wikidata_id,party_wikidata
Person ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10001,Diane,Abbott,Labour,Hackney North and Stoke Newington,https://www.theyworkforyou.com/mp/10001/diane_...,Diane Abbott,dianeabbott,True,Labour,Labour,Diane Abbott_1953,http://www.wikidata.org/entity/Q153454,Labour
10002,Gerry,Adams,Sinn Féin,Belfast West,https://www.theyworkforyou.com/mp/10002/gerry_...,Gerry Adams,gerryadams,False,,Sinn Féin,Gerry Adams_1948,http://www.wikidata.org/entity/Q76139,Sinn Féin
10003,Irene,Adams,Labour,Paisley North,https://www.theyworkforyou.com/mp/10003/mrs_ir...,Irene Adams,ireneadams,True,Labour,Labour,"Irene Adams, Baroness Adams of Craigielea_1947",http://www.wikidata.org/entity/Q334498,Labour
10004,Nick,Ainger,Labour,Carmarthen West and South Pembrokeshire,https://www.theyworkforyou.com/mp/10004/nick_a...,Nick Ainger,nickainger,False,,Labour,Nick Ainger_1949,http://www.wikidata.org/entity/Q325367,Labour
10005,Bob,Ainsworth,Labour,Coventry North East,https://www.theyworkforyou.com/mp/10005/bob_ai...,Bob Ainsworth,bobainsworth,False,,Labour,Bob Ainsworth_1952,http://www.wikidata.org/entity/Q258738,Labour
10006,Peter,Ainsworth,Conservative,East Surrey,https://www.theyworkforyou.com/mp/10006/peter_...,Peter Ainsworth,peterainsworth,False,,Conservative,Peter Ainsworth_1956,http://www.wikidata.org/entity/Q337709,Conservative
10007,Richard,Allan,Liberal Democrat,"Sheffield, Hallam",https://www.theyworkforyou.com/mp/10007/mr_ric...,Richard Allan,richardallan,False,,Liberal Democrat,"Richard Allan, Baron Allan of Hallam_1966",http://www.wikidata.org/entity/Q334455,Liberal Democrat
10008,Graham,Allen,Labour,Nottingham North,https://www.theyworkforyou.com/mp/10008/graham...,Graham Allen,grahamallen,False,,Labour,Graham Allen_1953,http://www.wikidata.org/entity/Q259601,Labour
10009,David,Amess,Conservative,Southend West,https://www.theyworkforyou.com/mp/10009/david_...,David Amess,davidamess,False,,Conservative,David Amess_1952,http://www.wikidata.org/entity/Q259646,Conservative
10010,Michael,Ancram,Conservative,Devizes,https://www.theyworkforyou.com/mp/10010/michae...,Michael Ancram,michaelancram,False,,Conservative,Michael Ancram_1945,http://www.wikidata.org/entity/Q332962,Conservative


In [41]:
def get_mp_speeches(mp_id):
    """Get all speeches of a particular MP from the TheyWorkForYou API as save them to a csv file under speeches/"""
    
    # Store TheyWorkForYou API key in separate config file
    from config import TWFY_API_KEY
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    """Get speeches of a particular MP based on TheyWorkForYou id and convert data into long format pandas data frame.
    Each row represents one speech at a particular date and time"""
    all_speeches = pd.DataFrame()
    rows = [1]
    page_no=1
    
    
    # Get date of latest downloaded speech
    try:
        latest_date = pd.read_csv("speeches/mp-{0}.csv".format(mp_id))\
            .assign(date = lambda df: pd.to_datetime(df.date))\
            .sort_values("date", ascending=False).iloc[0].date
        update_speeches = True
        print("Downloading speeches made after {0}".format(latest_date))
    except FileNotFoundError:
        # If we're running this function for the first time, we want to 
        latest_date = pd.to_datetime("1945-01-01")
        update_speeches = False
    
    not_latest_date = True
    while (len(rows) > 0) or not_latest_date:
        t = requests.get("https://www.theyworkforyou.com/api/getDebates?key={api_key}&\
                     type=commons&person={person}&results_per_page=1000&num={num}&page={page}&output=js&order=d".format(api_key=TWFY_API_KEY,
                                                                                                               person=mp_id,
                                                                                                               num=1000,
                                                                                                               page=page_no))
        rows = t.json()["rows"]
        speeches = []
        # Loop over each row
        for row in rows:
            if pd.to_datetime(row["hdate"], format="%Y-%m-%d") > latest_date:
                speeches.append({
                        'speech_id':row["gid"],
                        'speech_url':row["listurl"],
                        'mp_name':row["speaker"]["name"],
                        'mp_constituency':row["speaker"]["constituency"],
                        'mp_party':row["speaker"]["party"],
                        'mp_id':row["person_id"],
                        'date':pd.to_datetime(row["hdate"], format="%Y-%m-%d"),
                        'time':row["htime"],
                        'section_id':row["section_id"],
                        'subsection_id':row["subsection_id"],
                        'debate_title':row["parent"]["body"],
                        'body':BeautifulSoup(row["body"], "html5lib").get_text()
                    })
            else:
                not_latest_date = False
                break
                    
        speeches = pd.DataFrame(speeches)

        # Concatenate onto complete speeches dataframe
        all_speeches = pd.concat([all_speeches, speeches], ignore_index=True)
        # Increment page_counter
        page_no += 1
        #print(speeches)
    
    print("Got speeches for MP {0}".format(mp_id))
    # Write to new csv file specifically for mp
    all_speeches.to_csv("speeches/mp-{0}{1}.csv".format(mp_id, "_new" if update_speeches else ""), index=False)
    return True
    #return all_speeches

#### Run the above function in parallel for all MPs in the list that do not have a speeches file yet
This will take a while (~15 mins, depending on your internet connection)

In [48]:
%%time
## Download all MP speeches if this is set to True
## This can take a while
## If speeches have been downloaded previously, it will only download newer speeches and put them in speeches/mp-{id}_new.csv
## You will have to merge them into the main csv yourself!
if False:
    # Figure out which MPs we still need to download
    import glob
    import os
    
    
    downloaded_mps = [int(file.split("/")[-1].split(".")[0].split("-")[1]) for file in glob.glob("./speeches/mp-*.csv")]
    #downloaded_mps = [int(file.split("/")[-1].split(".")[0].split("-")[1].split("_")[0]) for file in glob.glob("./speeches/mp-*_new.csv")]
    
    # If you want to download speeches again, uncomment the next line
    #downloaded_mps = []
    
    mps_to_download = [mp for mp in list(mps.index) if mp not in downloaded_mps]
    # Parallelise downloading of MP speeches
    from multiprocessing import Pool

    # Number of threads to use to fetch
    NUM_THREADS = 16
    # Make list of mp ids
    list_of_mp_ids = mps_to_download
    #list_of_mp_ids = list(mps.query("exists==False")["Person ID"])[:10]

    # Create pool of threads
    pool = Pool(NUM_THREADS)
    # Use pool.map to download speeches mp by mp
    results = pool.map(get_mp_speeches, list_of_mp_ids)
    pool.close()
    pool.join()

    # Remove the empty mp files
    import glob
    import os
    for file in glob.glob("./speeches/mp-*.csv"):
        if os.path.getsize(file) == 1:
            os.remove(file)

Downloading speeches made after 2005-03-23 00:00:00
Downloading speeches made after 2010-02-24 00:00:00
Downloading speeches made after 1997-01-29 00:00:00
Downloading speeches made after 2010-02-04 00:00:00
Downloading speeches made after 2010-03-04 00:00:00
Downloading speeches made after 2010-03-09 00:00:00
Downloading speeches made after 2001-04-02 00:00:00
Downloading speeches made after 2010-02-22 00:00:00
Downloading speeches made after 2017-04-25 00:00:00
Downloading speeches made after 2015-03-03 00:00:00
Downloading speeches made after 2005-04-05 00:00:00
Downloading speeches made after 2009-11-23 00:00:00
Downloading speeches made after 2010-03-29 00:00:00
Downloading speeches made after 2005-04-05 00:00:00
Downloading speeches made after 2015-03-26 00:00:00
Got speeches for MP 10018
Downloading speeches made after 2005-03-24 00:00:00
Got speeches for MP 10398
Downloading speeches made after 2005-03-23 00:00:00
Got speeches for MP 10390
Got speeches for MP 10410
Downloading 

Downloading speeches made after 2017-09-11 00:00:00
Downloading speeches made after 1994-02-16 00:00:00
Got speeches for MP 14987
Downloading speeches made after 1993-02-15 00:00:00
Got speeches for MP 14142
Downloading speeches made after 2017-03-06 00:00:00
Got speeches for MP 16383
Downloading speeches made after 1993-10-26 00:00:00
Got speeches for MP 16385
Downloading speeches made after 1991-11-05 00:00:00
Got speeches for MP 11783
Downloading speeches made after 1994-07-07 00:00:00
Got speeches for MP 11817
Downloading speeches made after 2014-10-21 00:00:00
Got speeches for MP 14146
Downloading speeches made after 1991-02-19 00:00:00
Got speeches for MP 16391
Downloading speeches made after 1994-07-14 00:00:00
Got speeches for MP 11814
Downloading speeches made after 1995-06-08 00:00:00
Got speeches for MP 11823
Downloading speeches made after 2017-04-19 00:00:00
Got speeches for MP 16386
Got speeches for MP 16397
Downloading speeches made after 1994-03-16 00:00:00
Got speeches

Got speeches for MP 25664
Downloading speeches made after 2017-07-19 00:00:00
Got speeches for MP 25670
Got speeches for MP 25668
Downloading speeches made after 2017-09-14 00:00:00
Downloading speeches made after 2017-09-13 00:00:00
Got speeches for MP 25674
Got speeches for MP 25669
Got speeches for MP 25671
Downloading speeches made after 2017-07-17 00:00:00
Got speeches for MP 25672
Downloading speeches made after 2017-09-07 00:00:00
Got speeches for MP 25673


Process ForkPoolWorker-39:
Process ForkPoolWorker-37:
Process ForkPoolWorker-34:
Process ForkPoolWorker-43:
Process ForkPoolWorker-44:
Process ForkPoolWorker-33:
Process ForkPoolWorker-41:
Traceback (most recent call last):
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()

  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-pa

KeyboardInterrupt: 

  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/ssl.py", line 683, in do_handshake
    self._sslobj.do_handshake()
KeyboardInterrupt
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/ssl.py", line 683, in do_handshake
    self._sslobj.do_handshake()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 326, in connect
    ssl_context=context)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/ssl.py", line 1061, in do_handshake
    self._sslobj.do_handshake()
KeyboardInterrupt
  File "<ipython-input-41-7790e779c516>", line 32, in get_mp_speeches
    page=page_no))
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/ssl.py", line 683, in do_handshake
    self._sslobj.do_handshake()
KeyboardInterrupt
  File "/home/durand/S

  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/socket.py", line 743, in getaddrinfo
    for res in _s

KeyboardInterrupt
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getad

  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/home/durand/Stuff/Sources/anaconda3/envs/nlp/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/durand/Stuff/Sources/anaconda