![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=SocialStudies/HansardAnalysis/hansard-analysis.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Hansard Analysis

The [Hansard](https://en.wikipedia.org/wiki/Hansard) is a transcript of debates in the Canadian Parliament. It is available from the official [Parliament of Canada website](https://www.parl.ca) as well as other sources such as [Open Parliament](https://openparliament.ca) and [LiPaD: The Linked Parliamentary Data Project](https://www.lipad.ca).

We have downloaded the 2020 files from LiPaD, and can load them by selecting the following code cell and clicking the `▶Run` button.

In [None]:
import pandas as pd
hansard = pd.read_csv('proceedings2020.csv')
print(hansard.shape)
hansard.columns

There are 15 columns and 4945 rows of data.

## Who Spoke?

Let's have a look at who spoke during these debates.

In [None]:
speakers = hansard.drop_duplicates(subset=['speakername'])[['speakername','speakerparty','speakerriding','speakerurl']]
speakers = speakers.dropna().reset_index().drop(columns=['index'])
print('There were',speakers.shape[0],'speakers from the',speakers['speakerparty'].unique(),'parties.')

We can compare that to the list of Members of Parliament from the [43rd Parliament](https://en.wikipedia.org/wiki/43rd_Canadian_Parliament) that started on December 5, 2019.

In [None]:
members = pd.read_csv('https://www.ourcommons.ca/members/en/search/csv?parliament=43')
print('There were',members.shape[0],'Members from the',members['Political Affiliation'].unique(),'parties.')

So of the 340 Members of Parliament we had 312 unique speakers, meaning that 28 Members are not recorded as speaking during 2020. Let's see if we can identify who are they were.

In [None]:
members['Name'] = members['First Name'] +' '+ members['Last Name']
silent = []
for member in members['Name']:
    if member not in speakers['speakername'].values:
        silent.append(member)
print('That is',len(silent),'Members not recorded as speaking in 2020:')
print(silent)

Of course 35 is not equal to 28, but we will leave it to you to compare the list `silent` to the list from `speakers['speakername'].unique()` if you are interested.

## Who Spoke Most?

We can check how many times each speaker is recorded in the Hansard.

In [None]:
hansard['speakername'].value_counts()

Let's also calculate the length (number of characters) of each of those speeches, and sort them by who said the most.

In [None]:
hansard['speechlength'] = hansard['speechtext'].str.len()
hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False)

We can also plot those speech lengths. Feel free to change the `n = 20` variable to see more or fewer Members.

In [None]:
import plotly.express as px
n = 20
top_length = hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False).head(n)
px.bar(top_length, title='Top '+str(n)+' Hansard Speakers by Speech Length').update(layout_showlegend=False)

As well, we can check how many times a Member from a political party spoke.

In [None]:
px.bar(hansard['speakerparty'].value_counts(),title='Hansard Speaker Frequency by Party').update(layout_showlegend=False)

In [None]:
speech_length_party = hansard.groupby('speakerparty').sum().sort_values('speechlength', ascending=False)
px.bar(speech_length_party, title='Hansard Speech Length by Party').update(layout_showlegend=False)

## Natural Language Processing

We are going to use the Python library [spaCy](https://spacy.io) for natural language processing. It will allow us to remove "stop words", which are common words that can be discarded without reducing meaning. We will also simplify words to their base forms, which is called "[lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)" (e.g. the lemma of "speaking" is "speak").

The following code cell will take a while to run.

In [None]:
try:
    import spacy
    nlp = spacy.load("en_core_web_sm")
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import spacy
    nlp = spacy.load("en_core_web_sm")
from spacy.lang.en.stop_words import STOP_WORDS as sw
sw.add(['madam','speaker','hon','member'])

def tokenize(text):
    speech = []
    try:
        for token in nlp(text):
            if token.is_stop != True and token.is_punct != True:
                speech.append(token.lemma_)
    except:
        pass
    return speech

hansard['tokenized'] = hansard['speechtext'].apply(tokenize)
hansard

In [None]:
speaker = 'Carol Hughes'
ch = []
for token_list in hansard[hansard['speakername']==speaker]['tokenized']:
    ch.extend(token_list)
    #print(token_list)
ch

In [None]:
#hansard['speakername']
for speaker in hansard['speakername'].unique():
    #print(speaker)
    df = hansard[hansard['speakername']==speaker]
    

In [None]:
hansard.columns

## Other Datasets

If you want to explore datasets from other timeframes on [LiPaD](https://www.lipad.ca/full/), you can upload the CSV files to a `new_data` folder here, and run the following code to import them into a DataFrame.

```python
import os
import pandas as pd
hansard = pd.DataFrame()
for root, dirs, files in os.walk('new_data'):
    for name in files:
        df = pd.read_csv(os.path.join(root, name))
        hansard = hansard.append(df)
hansard
```

# Conclusion



[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)