### Tracking PQs topics

This notebook explores ways of tracking and analysing the topics that are the subject of PQs by MPs and Peers

In [8]:
import requests
import json
import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
tqdm.pandas()
import os
from datetime import datetime, timedelta
import seaborn as sns
from matplotlib import pyplot as plt
import re

#### Import and clean up data

We'll import some data on MPs, so that - if we want to - we can find out more about who is asking about what. Then we'll import all PQs (unanswered & answered). 

We clean up the PQs data, too. 

In [9]:
active_p = pd.read_csv('active_members.csv')
former_p = pd.read_csv('former_members.csv')

all_p = pd.concat([active_p, former_p])
all_p = all_p[['id', 'nameListAs', 'gender', 'latestPartyabbreviation']]

id_party_dict = dict(zip(all_p.id, all_p.latestPartyabbreviation))

In [10]:
wpqs = pd.read_csv('tmp/ua_pqs.csv')
wpqs['dateTabled'] = pd.to_datetime(wpqs.dateTabled)
wpqs['heading'] = wpqs.heading.fillna('')
# wpqs = wpqs[['id', 'askingMemberId', 'askingMember', 'house', 'dateTabled', 'questionText', 'answeringBodyName', 'heading']]

# Populate a column with party appreviation in the WPQs database
wpqs['latestPartyabbreviation'] = wpqs.askingMemberId.progress_apply(lambda x: id_party_dict[x] if x in id_party_dict.keys() else 'n/a')

# Make some of the string fields lower case to improve comparability and searchability
wpqs['heading'] = wpqs.heading.progress_apply(lambda x: x.lower())
wpqs['questionText'] = wpqs.questionText.progress_apply(lambda x: x.lower())

# Sometime the heading is a generic topic, other times it's specified by a ":" symbol. We'll extract this into a 'topic' column.
wpqs['topic'] = wpqs.heading.progress_apply(lambda x: x.split(':')[0])

wpqs['year_month'] = wpqs.dateTabled.dt.to_period('M')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385658/385658 [00:00<00:00, 1408613.21it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385658/385658 [00:00<00:00, 1522448.85it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385658/385658 [00:00<00:00, 1161905.79it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385658/385658 [00:00<00:00, 1343606.81it/s]


In [11]:
# Cleaning up the question text

# Aim: to get the bit of text after the 'to ask the secretary of state for blah blah, if...'

def question_cleaner(question):
    q = re.sub(r',(?=\S)|:', ', ', question)
    q = q.replace("to ask her majesty's government ", "to ask her majesty's government, ").replace("to ask her majesty’s government ", "to ask her majesty's government, ")
    q = q.replace(', and', ' and').replace('foreign, commonwealth and development affairs', 'foreign commonwealth and development affairs').replace('digital, culture, media', 'digital culture media').replace('business, energy and industrial', 'business energy and industrial')
    q = q.replace('levelling up, housing and', 'levelling up housing and').replace('environment, food and rural affairs', 'environment food and rural affairs').replace('culture, media and sport', 'culture media and sport').replace('business, innovation and skills', 'business innovation and skills')
    q = q.replace('digital, culture, media and sport', 'digital culture media and sport')
    q = q.replace('housing, communities and local government', 'housing communities and local government')
    q = q.replace(', representing the church commissioners', ' representing the church commissioners, ') 
    q = q.replace('to ask the chairman of committees ', 'to ask the chairman of committees, ')
    q = q.replace('to ask the leader of the house ', 'to ask the leader of the house, ')
    q = q.replace("to ask her majesty’s government", "to ask her majesty's government, ")
    q = q.replace("to ask the senior deputy speaker ", "to ask the senior deputy speaker, ")
    q = q.replace("her majesty's government ", "her majesty's government, ")
    q = q.replace("to ask the secretary of state for education ", "to ask the secretary of state for education, ")
    q = q.replace("to ask the secretary of state for defence ", "to ask the secretary of state for defence, ")
    q = q.replace("to ask the secretary of state for work and pensions ", "to ask the secretary of state for work and pensions, ")
    q = q.replace("to ask the secretary of state for environment food and rural affairs ", "to ask the secretary of state for environment food and rural affairs, ")
    q = q.replace("to ask the secretary of state for health ", "to ask the secretary of state for health, ")
    q = q.replace("foreign and commonwealth affairs ", "foreign and commonwealth affairs, ")
    q = q.replace("foreign commonwealth and development affairs ", "foreign commonwealth and development affairs, ")
    q = q.replace("the senior deputy speaker ", "the senior deputy speaker, ")
    q = q.replace("secretary of state for the home department,", "secretary of state for the home department, ")
    q = q.replace("to ask mr chancellor of the exchequer ", "to ask mr chancellor of the exchequer, ")
    q = q.replace("to ask the minister of the cabinet office ", "to ask the minister of the cabinet office, ")
    q = q.replace("to ask the minister for the cabinet office ", "to ask the minister for the cabinet office, ")
    q = q.replace("to ask the secretary of state for communities and local government ", "to ask the secretary of state for communities and local government, ")
    q = ' '.join(q.split(', ')[1:])
    cleaned_question = q
    return cleaned_question

wpqs['cleanedQuestion'] = wpqs.questionText.progress_apply(lambda x: question_cleaner(x))

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385658/385658 [00:03<00:00, 124671.20it/s]


In [12]:
# def question_cleaner(question):
#     q = re.sub(r',(?=\S)|:', ', ', question)
#     q = q.replace("to ask her majesty's government ", "to ask her majesty's government, ").replace("to ask her majesty’s government ", "to ask her majesty's government, ")
#     q = q.replace(', and', ' and').replace('foreign, commonwealth and development affairs', 'foreign commonwealth and development affairs').replace('digital, culture, media', 'digital culture media').replace('business, energy and industrial', 'business energy and industrial')
#     q = q.replace('levelling up, housing and', 'levelling up housing and').replace('environment, food and rural affairs', 'environment food and rural affairs').replace('culture, media and sport', 'culture media and sport').replace('business, innovation and skills', 'business innovation and skills')
#     q = q.replace('digital, culture, media and sport', 'digital culture media and sport')
#     q = q.replace('housing, communities and local government', 'housing communities and local government')
#     q = q.replace(', representing the church commissioners', ' representing the church commissioners, ')
#     q = q.replace('to ask the chairman of committees ', 'to ask the chairman of committees, ')
#     q = q.replace('to ask the leader of the house ', 'to ask the leader of the house, ')
#     q = q.replace("to ask her majesty’s government", "to ask her majesty's government, ")
#     q = q.replace("to ask the senior deputy speaker ", "to ask the senior deputy speaker, ")
#     q = q.replace("her majesty's government ", "her majesty's government, ")
#     q = q.replace("to ask the secretary of state for education ", "to ask the secretary of state for education, ")
#     q = q.replace("to ask the secretary of state for defence ", "to ask the secretary of state for defence, ")
#     q = q.replace("to ask the secretary of state for work and pensions ", "to ask the secretary of state for work and pensions, ")
#     q = q.replace("to ask the secretary of state for environment food and rural affairs ", "to ask the secretary of state for environment food and rural affairs, ")
#     q = q.replace("to ask the secretary of state for health ", "to ask the secretary of state for health, ")
#     q = q.replace("foreign and commonwealth affairs ", "foreign and commonwealth affairs, ")
#     q = q.replace("foreign commonwealth and development affairs ", "foreign commonwealth and development affairs, ")
#     q = q.replace("the senior deputy speaker ", "the senior deputy speaker, ")
#     q = q.replace("secretary of state for the home department,", "secretary of state for the home department, ")
#     q = q.replace("to ask mr chancellor of the exchequer ", "to ask mr chancellor of the exchequer, ")
#     q = q.replace("to ask the minister of the cabinet office ", "to ask the minister of the cabinet office, ")
#     q = q.replace("to ask the minister for the cabinet office ", "to ask the minister for the cabinet office, ")
#     q = q.replace("to ask the secretary of state for communities and local government ", "to ask the secretary of state for communities and local government, ")
#     q = ' '.join(q.split(', ')[:1]) # CHANGE ME TO GET THE SUBSTANCE OF THE QUESTION
#     cleaned_question = q
#     return cleaned_question

# wpqs['dept'] = wpqs.questionText.progress_apply(lambda x: question_cleaner(x))

In [13]:
# len(wpqs.dept.unique().tolist())

In [14]:
wpqs.to_csv('cleaned.csv', index=False, index_label=False)

#### Exploratory analysis

Let's make a word cloud of a particular day's PQs. 