![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=SocialStudies/HansardAnalysis/hansard-analysis.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Hansard Analysis

The [Hansard](https://en.wikipedia.org/wiki/Hansard) is a transcript of debates in the Canadian Parliament. It is available from the official [Parliament of Canada website](https://www.parl.ca) as well as other sources such as [Open Parliament](https://openparliament.ca) and [LiPaD: The Linked Parliamentary Data Project](https://www.lipad.ca).

We have downloaded the 2020 files from LiPaD, and can load them by selecting the following code cell and clicking the `▶Run` button.

In [1]:
import pandas as pd
hansard = pd.read_csv('proceedings2020.csv')
print(hansard.shape)
hansard.columns

(4945, 15)


Index(['basepk', 'hid', 'speechdate', 'pid', 'opid', 'speakeroldname',
       'speakerposition', 'maintopic', 'subtopic', 'subsubtopic', 'speechtext',
       'speakerparty', 'speakerriding', 'speakername', 'speakerurl'],
      dtype='object')

There are 15 columns and 4945 rows of data.

## Who Spoke?

Let's have a look at who spoke during these debates.

In [None]:
speakers = hansard.drop_duplicates(subset=['speakername'])[['speakername','speakerparty','speakerriding','speakerurl']]
speakers = speakers.dropna().reset_index().drop(columns=['index'])
print('There were',speakers.shape[0],'speakers from the',speakers['speakerparty'].unique(),'parties.')

We can compare that to the list of Members of Parliament from the [43rd Parliament](https://en.wikipedia.org/wiki/43rd_Canadian_Parliament) that started on December 5, 2019.

In [None]:
members = pd.read_csv('https://www.ourcommons.ca/members/en/search/csv?parliament=43')
print('There were',members.shape[0],'Members from the',members['Political Affiliation'].unique(),'parties.')

So of the 340 Members of Parliament we had 312 unique speakers, meaning that 28 Members are not recorded as speaking during 2020. Let's see if we can identify who are they were.

In [None]:
members['Name'] = members['First Name'] +' '+ members['Last Name']
silent = []
for member in members['Name']:
    if member not in speakers['speakername'].values:
        silent.append(member)
print('That is',len(silent),'Members not recorded as speaking in 2020:')
print(silent)

Of course 35 is not equal to 28, but we will leave it to you to compare the list `silent` to the list from `speakers['speakername'].unique()` if you are interested.

## Who Spoke Most?

We can check how many times each speaker is recorded in the Hansard.

In [None]:
hansard['speakername'].value_counts()

Let's also calculate the length (number of characters) of each of those speeches, and sort them by who said the most.

In [None]:
hansard['speechlength'] = hansard['speechtext'].str.len()
hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False)

We can also plot those speech lengths. Feel free to change the `n = 20` variable to see more or fewer Members.

In [None]:
import plotly.express as px
n = 20
top_length = hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False).head(n)
px.bar(top_length, title='Top '+str(n)+' Hansard Speakers by Speech Length').update(layout_showlegend=False)

As well, we can check how many times a Member from a political party spoke.

In [None]:
px.bar(hansard['speakerparty'].value_counts(),title='Hansard Speaker Frequency by Party').update(layout_showlegend=False)

In [None]:
speech_length_party = hansard.groupby('speakerparty').sum().sort_values('speechlength', ascending=False)
px.bar(speech_length_party, title='Hansard Speech Length by Party').update(layout_showlegend=False)

## Natural Language Processing

We are going to use the Python library [spaCy](https://spacy.io) for natural language processing. It will allow us to remove "stop words", which are common words that can be discarded without reducing meaning. We will also simplify words to their base forms, which is called "[lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)" (e.g. the lemma of "speaking" is "speak").

The following code cell will take a while to run.

In [15]:
try:
    import spacy
    nlp = spacy.load("en_core_web_sm")
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import spacy
    nlp = spacy.load("en_core_web_sm")
from spacy.lang.en.stop_words import STOP_WORDS as sw
sw.add('madam')
sw.add('speaker')
sw.add('hon')
sw.add('member')
sw.add('\n')
sw.add('mr.')
sw.add('government')
sw.add('canada')
sw.add('prime')
sw.add('minister')

def tokenize(text):
    speech = []
    try:
        for token in nlp(text):
            if token.is_stop != True and token.is_punct != True and token != '\n':
                speech.append(token.lemma_)
    except:
        pass
    return speech

hansard['tokenized'] = hansard['speechtext'].apply(tokenize)
hansard

Unnamed: 0,basepk,hid,speechdate,pid,opid,speakeroldname,speakerposition,maintopic,subtopic,subsubtopic,speechtext,speakerparty,speakerriding,speakername,speakerurl,tokenized
0,4786557,ca.proc.d.2020-01-27.10733547,2020-01-27,ddcc96da-5bcf-4069-9cd6-da8f76df8b89,229530,Mr. Stéphane Lauzon (Parliamentary Secretary t...,,Speech From The Throne,Resumption of debate on Address in Reply,,"Madam Speaker, I would like to thank my collea...",Liberal,Argenteuil—La Petite-Nation,Stéphane Lauzon,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
1,4786575,ca.proc.d.2020-01-27.10733601,2020-01-27,cedf1659-ebb1-4f26-b723-5c7c5353988b,214948,The Assistant Deputy Speaker (Mrs. Carol Hughes),,Speech From The Throne,Resumption of debate on Address in Reply,,I must remind the hon. member to address her r...,New Democratic Party,Algoma—Manitoulin—Kapuskasing,Carol Hughes,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
2,4786602,ca.proc.d.2020-01-27.10733669,2020-01-27,cedf1659-ebb1-4f26-b723-5c7c5353988b,214948,The Assistant Deputy Speaker (Mrs. Carol Hughes),,Speech From The Throne,Resumption of debate on Address in Reply,,I must interrupt the hon. member because her t...,New Democratic Party,Algoma—Manitoulin—Kapuskasing,Carol Hughes,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
3,4786603,ca.proc.d.2020-01-27.10733670,2020-01-27,00000000-0000-0000-0000-000000000000,248690,"Mr. Paul Manly (Nanaimo—Ladysmith, GP)",,Speech From The Throne,Resumption of debate on Address in Reply,,"Madam Speaker, it is an honour and privilege t...",Green Party,Nanaimo—Ladysmith,Paul Manly,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
4,4786535,ca.proc.d.2020-01-27.10733478,2020-01-27,3dd9162a-4763-4000-8c8a-706a788f7e50,229209,Hon. Chrystia Freeland (Deputy Prime Minister ...,,,Ways and Means,Notice of Motion,"Mr. Speaker, pursuant to Standing Order 83(1),...",Liberal,University—Rosedale,Chrystia Freeland,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
5,4786536,ca.proc.d.2020-01-27.p6026577,2020-01-27,,,,stagedirection,Speech From The Throne,Resumption of debate on Address in Reply,,"The House resumed from December 13, 2019, cons...",,,,,[]
6,4786537,ca.proc.d.2020-01-27.10733485,2020-01-27,00000000-0000-0000-0000-000000000000,252986,"Ms. Raquel Dancho (Kildonan—St. Paul, CPC)",,Speech From The Throne,Resumption of debate on Address in Reply,,"Mr. Speaker, I am pleased to split my time wit...",Conservative,Kildonan—St. Paul,Raquel Dancho,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
7,4786538,ca.proc.d.2020-01-27.10733492,2020-01-27,07cf5767-802c-406c-92a7-7dc92af79b40,229526,Mr. Kevin Lamoureux (Parliamentary Secretary t...,,Speech From The Throne,Resumption of debate on Address in Reply,,"Madam Speaker, the member referred to fighting...",Liberal,Winnipeg North,Kevin Lamoureux,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
8,4786539,ca.proc.d.2020-01-27.10733494,2020-01-27,00000000-0000-0000-0000-000000000000,252986,Ms. Raquel Dancho,,Speech From The Throne,Resumption of debate on Address in Reply,,"Madam Speaker, I ask the member to consider th...",Conservative,Kildonan—St. Paul,Raquel Dancho,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]
9,4786555,ca.proc.d.2020-01-27.10733540,2020-01-27,4b18a0b6-bb17-4967-84ef-abb5a4dde893,229553,Mr. Greg Fergus,,Speech From The Throne,Resumption of debate on Address in Reply,,"Madam Speaker, I thank the member for Berthier...",Liberal,Hull—Aylmer,Greg Fergus,http://www.parl.gc.ca/Parliamentarians/en/memb...,[]


In [16]:
from collections import Counter
for speaker in hansard['speakername'].unique():
    words = []
    for token_list in hansard[hansard['speakername']==speaker]['tokenized']:
        words.extend(token_list)
    word_freq = Counter(words)
    common_words = word_freq.most_common(5)
    print(speaker, common_words)

Stéphane Lauzon []
Carol Hughes []
Paul Manly []
Chrystia Freeland []
nan []
Raquel Dancho []
Kevin Lamoureux []
Greg Fergus []
John Brassard []
Dan Albas []
Alexandre Boulerice []
Marie-France Lalonde []
Darrell Samson []
Jacques Gourde []
Wayne Easter []
Jeremy Patzer []
Damien Kurek []
Elizabeth May []
Garnett Genuis []
Tom Kmiec []
Julie Dzerowicz []
Bryan May []
Earl Dreeshen []
Kenneth McDonald []
Anthony Housefather []
Marc Dalton []
William Amos []
Jean Yip []
James Cumming []
Majid Jowhari []
Michael Cooper []
Nelly Shin []
Alistair MacGregor []
Michel Boudrias []
Mario Simard []
Brian Masse []
Yves Perron []
Alexis Brunelle-Duceppe []
Denis Trudel []
Martin Champoux []
Heather McPherson []
Lloyd Longfield []
Xavier Barsalou-Duval []
The Speaker []
Some hon. members []
Luc Berthold []
Cathy McLeod []
Peter Julian []
Mark Gerretsen []
Andy Fillmore []
Gérard Deltell []
Brad Vis []
Charlie Angus []
Robert Gordon Kitchen []
Gord Johns []
Gabriel Ste-Marie []
Anthony Rota []
Micha

## Other Datasets

If you want to explore datasets from other timeframes on [LiPaD](https://www.lipad.ca/full/), you can upload the CSV files to a `new_data` folder here, and run the following code to import them into a DataFrame.

```python
import os
import pandas as pd
hansard = pd.DataFrame()
for root, dirs, files in os.walk('new_data'):
    for name in files:
        df = pd.read_csv(os.path.join(root, name))
        hansard = hansard.append(df)
hansard
```

# Conclusion



[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)