### Import Data from CSV

The CSV file is made in 'project_fletcher_data_processing'.

In [1]:
import os
os.getcwd()

'/home/ubuntu/Notebooks'

In [12]:
import pandas as pd
df_all = pd.read_csv("/home/ubuntu/Notebooks/scrapedxml/debates/debates_00_18")

In [14]:
df_all.shape

(878146, 4)

In [18]:
df_all_1 = df_all[['MP','date','speech']]
df_all_1.head()

Unnamed: 0,MP,date,speech
0,Mr. Mike Gapes,2000-11-22a,If she will make a statement on her Department...
1,Mr. Mike Gapes,2000-11-22a,Is my right hon. Friend aware that the biggest...
2,The Parliamentary Under-Secretary of State for...,2000-11-22a,We have responded to a large number of floods ...
3,The Parliamentary Under-Secretary of State for...,2000-11-22a,Since 1991 we have provided £217 million in hu...
4,Mr. Gapes,2000-11-22a,I am grateful for that reply. Does it not show...


I get rid of missing speeches and duplicates.

In [20]:
import numpy as np

df_all = df_all[['MP','date','speech']]
df_all = df_all[df_all.speech != '']
df_all.drop_duplicates(subset = ['speech'], inplace = True)
df_all.replace('',np.nan,inplace = True)
print(df_all.shape)
df_all.head()

(823978, 3)


Unnamed: 0,MP,date,speech
0,Mr. Mike Gapes,2000-11-22a,If she will make a statement on her Department...
1,Mr. Mike Gapes,2000-11-22a,Is my right hon. Friend aware that the biggest...
2,The Parliamentary Under-Secretary of State for...,2000-11-22a,We have responded to a large number of floods ...
3,The Parliamentary Under-Secretary of State for...,2000-11-22a,Since 1991 we have provided £217 million in hu...
4,Mr. Gapes,2000-11-22a,I am grateful for that reply. Does it not show...


I convert the date to a datetime object.

In [24]:
import re
df_all['date_1'] = df_all['date'].map(lambda x: re.sub('[a-z]', '', str(x)))
df_all['date_1'] = pd.to_datetime(df_all['date_1'], format = "%Y/%m/%d")
df_all.head()

Unnamed: 0,MP,date,speech,date_1
0,Mr. Mike Gapes,2000-11-22a,If she will make a statement on her Department...,2000-11-22
1,Mr. Mike Gapes,2000-11-22a,Is my right hon. Friend aware that the biggest...,2000-11-22
2,The Parliamentary Under-Secretary of State for...,2000-11-22a,We have responded to a large number of floods ...,2000-11-22
3,The Parliamentary Under-Secretary of State for...,2000-11-22a,Since 1991 we have provided £217 million in hu...,2000-11-22
4,Mr. Gapes,2000-11-22a,I am grateful for that reply. Does it not show...,2000-11-22


### Adding science documents

I need some point of comparison for assessing the 'evidence-basedness' of these speeches. I needed some speech or text which is recognised for being evidence-based and which is publically available. I chose the Royal Institution Christmas Lecture, which aims to increase the public understanding of science and so uses words like 'evidence', 'cause' etc.

I import them here. Since the actual speaker is not of interest for each 'science speech', I call them all 'Dr Science'. I later average the similarity score of each MP's speech with each of the Royal Institution speeches and so the body of work acts as a comparator.

In [27]:
for year in [6,8]:
    for no in np.arange(1,6):
        science_lec = pd.read_table('RI_lecture{0}_{1}.txt'.format(year,no),names = ['speech'])
        science_lecture = (pd.Series(['Dr Science', '', science_lec['speech'],''], 
                        index=['MP', 'date','speech','date_1']))
        df_all = df_all.append(science_lecture, ignore_index=True)

In [28]:
df_all.tail()

Unnamed: 0,MP,date,speech,date_1
823983,Dr Science,,0 Royal Institution Christmas Lectures Brea...,
823984,Dr Science,,0 1 Part 1 Professor Chris Bishop Have you ...,
823985,Dr Science,,0 1 Part 1 Professor Chris Bishop Did you k...,
823986,Dr Science,,0 Did you know that in the time it takes me...,
823987,Dr Science,,0 Have you ever wondered why computers are ...,


### Adding science words

As a baseline model and proof of concept, I created a very simple comparator - a list of words that I associate with people using evidence in their arguments. I tried to do this systematically by looking at key vocabulary in course notes and text books but it ends up being a conglommeration of words which I think are sensible.

In [29]:
science_words = '''evidence study statistics research data expert 
                    science clinical trial significant average 
                    proportion cause probability frequency 
                    distribution mode ratio sample'''

science_words = (pd.Series(['Dr Science', '', science_words, ''], 
                index=['MP', 'date', 'speech','date_1']))

df_all = df_all.append(science_words, ignore_index=True)

I think pickle the dataframe which now includes the science document comparators.

In [33]:
import pickle
df_all.to_pickle('df_all_science_docs.pkl')

In [34]:
pkl_file = open('df_all_science_docs.pkl', 'rb')
df_all_science_docs = pickle.load(pkl_file)