# Data Cleaning

Our current data structure can be visualised as such:

![title](../Data-Acquisition/data_structure.png)

I will now perform some minor cleaning before moving onto further data collection.

In [6]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import requests
from bs4 import BeautifulSoup

import pprintpp
pp = pprintpp.PrettyPrinter(indent=4)

plt.style.use('ggplot')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Members Dataframe

In [9]:
members = pd.read_pickle("../Data-Acquisition/members.pkl")
members.head()

Unnamed: 0,id,display_name,full_title,gender,member_from,member_house,membership_start,membership_end,membership_end_reason,membership_end_reason_notes,status_is_active,status_description,status_notes,status_start_date,party,party_is_lords_main,party_is_lords_spiritual,party_is_independent,party_government_type
0,172,Ms Diane Abbott,Rt Hon Diane Abbott MP,F,Hackney North and Stoke Newington,1,1987-06-11T00:00:00,,,,True,Current Member,,2019-12-12T00:00:00,Labour,True,False,False,3.0
1,3305,Lord Aberconway,The Lord Aberconway,M,Hereditary,2,1953-05-23T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,Conservative,True,True,False,0.0
2,3469,The Duke of Abercorn,His Grace the Duke of Abercorn,M,Hereditary,2,1979-06-04T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,Conservative,True,True,False,0.0
3,3468,Lord Aberdare,The Rt Hon. the Lord Aberdare KBE DL,M,Excepted Hereditary,2,1957-12-18T00:00:00,2005-01-23T00:00:00,Death,,,,,,Conservative,True,True,False,0.0
4,3898,Lord Aberdare,The Lord Aberdare,M,Excepted Hereditary,2,2009-07-20T00:00:00,,,,True,Current Member,,2009-07-20T00:00:00,Crossbench,True,True,False,


In [10]:
members.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4763 entries, 0 to 4762
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           4763 non-null   int64  
 1   display_name                 4763 non-null   object 
 2   full_title                   4763 non-null   object 
 3   gender                       4763 non-null   object 
 4   member_from                  4760 non-null   object 
 5   member_house                 4763 non-null   int64  
 6   membership_start             4763 non-null   object 
 7   membership_end               3292 non-null   object 
 8   membership_end_reason        2474 non-null   object 
 9   membership_end_reason_notes  313 non-null    object 
 10  status_is_active             1471 non-null   object 
 11  status_description           1471 non-null   object 
 12  status_notes                 6 non-null      object 
 13  status_start_date 

In [11]:
for i in members.columns:
    print(i)

id
display_name
full_title
gender
member_from
member_house
membership_start
membership_end
membership_end_reason
membership_end_reason_notes
status_is_active
status_description
status_notes
status_start_date
party
party_is_lords_main
party_is_lords_spiritual
party_is_independent
party_government_type


### Id

‚úèÔ∏è No duplicates, good. 

In [13]:
members[members['id'].duplicated() == True]

Unnamed: 0,id,display_name,full_title,gender,member_from,member_house,membership_start,membership_end,membership_end_reason,membership_end_reason_notes,status_is_active,status_description,status_notes,status_start_date,party,party_is_lords_main,party_is_lords_spiritual,party_is_independent,party_government_type


### Display Name

In [16]:
members.display_name.value_counts()

The Lord Bishop of Gloucester    4
The Lord Bishop of Ely           4
The Lord Bishop of Durham        4
Lord Sinha                       4
The Lord Bishop of Derby         4
                                ..
Lord Goddard of Stockport        1
Dr Norman Godman                 1
Mr Roger Godsiff                 1
Lord Godson                      1
Lord Zuckerman                   1
Name: display_name, Length: 4425, dtype: int64

In [17]:
members[members["display_name"] == "The Lord Bishop of Gloucester"]

Unnamed: 0,id,display_name,full_title,gender,member_from,member_house,membership_start,membership_end,membership_end_reason,membership_end_reason_notes,status_is_active,status_description,status_notes,status_start_date,party,party_is_lords_main,party_is_lords_spiritual,party_is_independent,party_government_type
1656,2263,The Lord Bishop of Gloucester,The Rt Revd. the Lord Bishop of Gloucester,M,Bishops,2,1981-09-30T00:00:00,1991-10-31T00:00:00,Retired,,,,,,Bishops,True,True,False,
1657,2601,The Lord Bishop of Gloucester,The Rt Rev. the Lord Bishop of Gloucester,M,Bishops,2,1998-01-31T00:00:00,2003-12-31T00:00:00,Retired,,,,,,Bishops,True,True,False,
1658,3906,The Lord Bishop of Gloucester,The Rt Rev. the Lord Bishop of Gloucester,M,Bishops,2,2009-10-27T00:00:00,2014-11-21T00:00:00,Retired,,,,,,Bishops,True,True,False,
1659,4540,The Lord Bishop of Gloucester,The Rt Rev. the Lord Bishop of Gloucester,F,Bishops,2,2015-09-07T00:00:00,,,,True,Current Member,,2015-09-07T00:00:00,Bishops,True,True,False,


‚úèÔ∏è These aren't duplicates as they are titles that were passed on from one person to another. 

üí° **Katie** -- while we could rename these as The Lord Bishop of Gloucester 1, The Lord Bishop of Gloucester 2 etc to represent them as separate entities, we already have the id column performing that for us. Perhaps we leave it as is to represent their sort of a job title?

### Full Title

In [18]:
members.full_title.value_counts()

The Lord Sinha                          4
His Grace the Duke of Northumberland    3
The Rt Rev. the Lord Bishop of Derby    3
The Lord Wolverton                      3
The Rt Hon. the Earl of Powis           3
                                       ..
Rt Hon John Glen MP                     1
The Rt Hon. the Lord Glenamara CH       1
The Lord Glenarthur DL                  1
The Lord Glenconner                     1
The Lord Zuckerman                      1
Name: full_title, Length: 4523, dtype: int64

In [20]:
members[members["full_title"] == "The Lord Sinha"]

Unnamed: 0,id,display_name,full_title,gender,member_from,member_house,membership_start,membership_end,membership_end_reason,membership_end_reason_notes,status_is_active,status_description,status_notes,status_start_date,party,party_is_lords_main,party_is_lords_spiritual,party_is_independent,party_government_type
3983,2323,Lord Sinha,The Lord Sinha,M,Hereditary,2,1967-05-11T00:00:00,1989-01-06T00:00:00,Death,,,,,,Non-affiliated,False,True,False,
3984,2440,Lord Sinha,The Lord Sinha,M,Hereditary,2,1999-01-18T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,Non-affiliated,False,True,False,
3985,2947,Lord Sinha,The Lord Sinha,M,Hereditary,2,1989-01-06T00:00:00,1992-07-25T00:00:00,Death,,,,,,Non-affiliated,False,True,False,
3986,3122,Lord Sinha,The Lord Sinha,M,Hereditary,2,1992-07-25T00:00:00,1999-01-19T00:00:00,Death,,,,,,Non-affiliated,False,True,False,


‚úèÔ∏è I think the same as above applies here...

### Gender

In [23]:
members.gender.value_counts(dropna=False)

M    3971
F     792
Name: gender, dtype: int64

‚úèÔ∏è No NAs - good, distribution - disappointing. 

### Member from

In [26]:
members.member_from.value_counts()

Life peer                 1378
Hereditary                 911
Excepted Hereditary        150
Bishops                    124
Life Peer (judicial)        50
                          ... 
North East Hampshire         1
Battersea North              1
Orkney and Shetland          1
Plymouth, Devonport          1
Kenilworth and Southam       1
Name: member_from, Length: 931, dtype: int64

In [28]:
members[members['member_from'].isna() == True]

Unnamed: 0,id,display_name,full_title,gender,member_from,member_house,membership_start,membership_end,membership_end_reason,membership_end_reason_notes,status_is_active,status_description,status_notes,status_start_date,party,party_is_lords_main,party_is_lords_spiritual,party_is_independent,party_government_type
2678,2184,Lord Lucas of Chilworth,The Lord Lucas of Chilworth,M,,2,2001-11-10T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,,,,,
3634,3826,The Lord Rennell,The Lord Rennell,M,,2,2006-12-09T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,,,,,
4311,3802,Lord Thomson of Fleet,Lord Thomson of Fleet,M,,2,2006-06-12T00:00:00,1999-11-11T00:00:00,Excluded,,,,,,,,,,


üö® **Katie** -- should we drop these?

### Member house

In [31]:
members.member_house.value_counts(dropna=False)

2    2634
1    2129
Name: member_house, dtype: int64

‚úèÔ∏è No NAs - good.

üí° **Katie** -- one is Lords, one is Commons perhaps? Need to find out for data dictionary

### Membership start, end & membership status

In [40]:
print(len(members[members['membership_start'].isna() == True]))
print(len(members[members['membership_end'].isna() == True]))

0
1471


‚úèÔ∏è No NAs for membership start - good. NAs for membership end must mean they are currently active? Checking:

In [36]:
members.status_is_active.value_counts(dropna=False)

NaN      3292
True     1426
False      45
Name: status_is_active, dtype: int64

In [39]:
members.status_description.value_counts(dropna=False)

NaN                 3292
Current Member      1426
Leave of Absence      39
Disqualified           3
Suspended              3
Name: status_description, dtype: int64

In [41]:
members.status_notes.value_counts(dropna=False)

NaN                        3292
None                       1465
Member of the Judiciary       3
Suspended                     3
Name: status_notes, dtype: int64

In [42]:
members.membership_end_reason.value_counts(dropna=False)

None                        2289
Death                       1016
Excluded                     629
Retired                      445
Defeated                     254
Standing Down                 63
General Election              24
Resignation (Northstead)      15
Resignation (Chiltern)        15
Non-attendance                 8
Recall Petition                2
Resigned                       1
Election petition              1
Deselected                     1
Name: membership_end_reason, dtype: int64

In [44]:
members.membership_end_reason_notes.value_counts(dropna=False)

None                                                                                               4450
Dissolution of Parliament                                                                           143
Retired under the House of Lords Reform Act 2014                                                    108
House of Lords Reform Act 2014                                                                        3
Resigned under the Constitutional Reform and Governance Act 2010                                      3
                                                                                                   ... 
Became London Mayor 08/05/16                                                                          1
Retired under the House of Lords Reform Act 2014 [Baroness Lockwood died on 29 April 2019]            1
Retired under the House of Lords Reform Act 2014\n \n[Lord Luke died on 2 October 2015]               1
Mr. Adams refused to accept the office of Crown Steward and Bail

üí° **Katie** -- these columns... Perhaps, we:
1. Make a *membership_duration* column, 
2. Join the five above into one? With potential labels as the following:
- status description = current member / active for those that are current members, 
- inactive - leave of absence 
- inactive - disqualified
- inactive - suspended
- inactive - membership_end_reason values (+ membership_end_reason_notes) for those that don't fall under any of the above but have clear membership start and end dates? 
3. Drop *status_notes*, *status_description*, *status_is_active*, *membership_end_reason*, *membership_end_reason_notes* (+ status_start_date ?)

### Party

In [46]:
members.party.value_counts(dropna=False)

Conservative                        1779
Labour                              1194
Crossbench                           652
Liberal Democrat                     273
Other                                233
Non-affiliated                       136
Bishops                              124
Scottish National Party               79
Independent                           66
Labour (Co-op)                        58
Social Democratic Party               22
Democratic Unionist Party             22
Ulster Unionist Party                 21
Sinn F√©in                             14
Liberal                               12
Plaid Cymru                           11
Independent Labour                     9
Social Democratic & Labour Party       8
Independent Conservative               7
Not known                              5
The Independent Group for Change       5
Green Party                            4
NaN                                    3
Alliance                               3
Alba Party     

‚úèÔ∏è Dropping NAs:

In [56]:
members = members[members.party.isna() == False]

### Party_is_lords_main / party_is_lords_spiritual / party_is_independent / party_government_type

In [51]:
members.party_is_lords_main.value_counts(dropna=False)

True     4086
False     674
Name: party_is_lords_main, dtype: int64

In [52]:
members.party_is_lords_spiritual.value_counts(dropna=False)

True     2736
False    2024
Name: party_is_lords_spiritual, dtype: int64

In [53]:
members.party_is_independent.value_counts(dropna=False)

False    4694
True       66
Name: party_is_independent, dtype: int64

‚úèÔ∏è I don't know what these mean but no NAs - good.

In [54]:
members.party_government_type.value_counts(dropna=False)

0.0    1779
NaN    1729
3.0    1252
Name: party_government_type, dtype: int64

üí° **Katie** -- No idea what this means / what insight it would bring. Should we drop this column?

# Votes Dataframe

In [59]:
votes = pd.read_pickle("../Data-Acquisition/votes.pkl")
votes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111692 entries, 0 to 111691
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   member_id          111692 non-null  object
 1   vote_title         111692 non-null  object
 2   vote_date          111692 non-null  object
 3   aye_count          111692 non-null  object
 4   no_count           111692 non-null  object
 5   member_voted_aye   111692 non-null  object
 6   member_was_teller  111692 non-null  object
dtypes: object(7)
memory usage: 6.0+ MB


In [60]:
votes.head()

Unnamed: 0,member_id,vote_title,vote_date,aye_count,no_count,member_voted_aye,member_was_teller
0,4212,Finance (No2) Bill Reasoned Amendment,2023-03-29T18:03:00,211,289,True,False
1,4212,Illegal Migration Bill: Committee of the whole...,2023-03-28T20:37:00,248,301,True,False
2,4212,Illegal Migration Bill: Committee of the whole...,2023-03-28T20:25:00,249,301,True,False
3,4212,Illegal Migration Bill: Committee of the whole...,2023-03-28T20:13:00,248,299,True,False
4,4212,Illegal Migration Bill: Committee of the whole...,2023-03-28T20:01:00,302,242,False,False


In [62]:
votes.vote_title.value_counts()

Finance Bill: Third Reading                                                             292
Motion to sit in private                                                                252
Business of the House motion                                                            208
UK's withdrawal from the EU: Mr Corbyn's amendment (a)                                  189
Trade Bill: Second Reading                                                              189
                                                                                       ... 
Copyright (Rights and Remuneration of Musicians, Etc.) Bill: Second Reading: Closure      9
Mental Health Units (Use of Force) Bill report stage: Amendment 12                        8
Mental Health Units (Use of Force) Bill report stage: Amendment 11                        8
Ten Minute Rule Bill: Recall of MPs (Change of Party Affiliation)                         7
Motion to sit in private                                                        

üí° **Katie** -- Ed and I were discussing that it would be interesting to potentially treat this as an NLP column to uncover insights into what kinds of topics certain members tend to aye/nay :)

# Written Statements

In [64]:
statements = pd.read_pickle("../Data-Acquisition/written_statements.pkl")
statements.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6255 entries, 0 to 6254
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   statement_id         6255 non-null   int64 
 1   statement_title      6255 non-null   object
 2   statement_text       6255 non-null   object
 3   member_id            6255 non-null   int64 
 4   member_name          0 non-null      object
 5   member_role          6255 non-null   object
 6   statement_date       6255 non-null   object
 7   answering_body_id    6255 non-null   int64 
 8   answering_body_name  6255 non-null   int64 
 9   house                6255 non-null   object
dtypes: int64(4), object(6)
memory usage: 488.8+ KB


‚úèÔ∏è Dropping empty column member_name: 

In [68]:
statements = statements.drop(columns="member_name")