![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=SocialStudies/HansardAnalysis/hansard-analysis.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Hansard Analysis

The [Hansard](https://en.wikipedia.org/wiki/Hansard) is a transcript of debates in the Canadian Parliament. It is available from the official [Parliament of Canada website](https://www.parl.ca) as well as other sources such as [Open Parliament](https://openparliament.ca) and [LiPaD: The Linked Parliamentary Data Project](https://www.lipad.ca).

We have downloaded the 2020 files from LiPaD, and can load them by selecting the following code cell and clicking the `▶Run` button.

In [None]:
import pandas as pd
import plotly.express as px
from collections import Counter

hansard = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/HansardAnalysis/proceedings2020.csv')
print(hansard.shape)
hansard.columns

There are 15 columns and 4945 rows of data.

## Who Spoke?

Let's have a look at who spoke during these debates.

In [None]:
speakers = hansard.drop_duplicates(subset=['speakername'])[['speakername','speakerparty','speakerriding','speakerurl']]
speakers = speakers.dropna().reset_index().drop(columns=['index'])
print('There were',speakers.shape[0],'speakers from the',speakers['speakerparty'].unique(),'parties.')

We can compare that to the list of Members of Parliament from the [43rd Parliament](https://en.wikipedia.org/wiki/43rd_Canadian_Parliament) that started on December 5, 2019.

In [None]:
members = pd.read_csv('https://www.ourcommons.ca/members/en/search/csv?parliament=43')
print('There were',members.shape[0],'Members from the',members['Political Affiliation'].unique(),'parties.')

So of the 340 Members of Parliament we had 312 unique speakers, meaning that 28 Members are not recorded as speaking during 2020. Let's see if we can identify who are they were.

In [None]:
members['Name'] = members['First Name'] +' '+ members['Last Name']
silent = []
for member in members['Name']:
    if member not in speakers['speakername'].values:
        silent.append(member)
print('That is',len(silent),'Members not recorded as speaking in 2020:')
print(silent)

Of course 35 is not equal to 28, but we will leave it to you to compare the list `silent` to the list from `speakers['speakername'].unique()` if you are interested.

## Who Spoke Most?

We can check how many times each speaker is recorded in the Hansard.

In [None]:
hansard['speakername'].value_counts()

Let's also calculate the length (number of characters) of each of those speeches, and sort them by who said the most.

In [None]:
hansard['speechlength'] = hansard['speechtext'].str.len()
hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False)

We can also plot those speech lengths. Feel free to change the `n = 20` variable to see more or fewer Members.

In [None]:
n = 20
top_length = hansard.groupby('speakername').sum().sort_values('speechlength', ascending=False).head(n)
px.bar(top_length, title='Top '+str(n)+' Hansard Speakers by Speech Length').update(layout_showlegend=False)

As well, we can check how many times a Member from a political party spoke.

In [None]:
px.bar(hansard['speakerparty'].value_counts(),title='Hansard Speaker Frequency by Party').update(layout_showlegend=False)

In [None]:
speech_length_party = hansard.groupby('speakerparty').sum().sort_values('speechlength', ascending=False)
px.bar(speech_length_party, title='Hansard Speech Length by Party').update(layout_showlegend=False)

## Natural Language Processing

We are going to use the Python library [spaCy](https://spacy.io) for natural language processing. On feature of spaCy is the identification of [parts of speech](https://universaldependencies.org/docs/u/pos) (noun, verb, etc.). We will also simplify words to their base forms, which is called "[lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)" (e.g. the lemma of "speaking" is "speak").

The following code cell creates a new column in our `hansard` DataFrame containing the nouns from the `speechtext` column. It will take about four minutes to run.

In [None]:
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import spacy
    nlp = spacy.load('en_core_web_sm')

def find_nouns(text):
    nouns = []
    try:
        for token in nlp(text):
            if token.pos_ == 'NOUN':
                nouns.append(token.lemma_)
    except:
        pass
    return nouns

hansard['nouns'] = hansard['speechtext'].apply(find_nouns)
print('We now have the columns:', list(hansard.columns))

We can then create a bar graph of the most common nouns in the dataset.

In [None]:
noun_list = []
for row in hansard.itertuples():
    for noun in row.nouns:
        noun_list.append(noun)
nf = pd.DataFrame.from_dict(Counter(noun_list), orient='index')
top_nouns = nf.sort_values(0, ascending=False).head(30)
px.bar(top_nouns, title='Common Nouns in the House of Commons').update_layout(showlegend=False)

### Other Parts of Speech

Try the following code cell to investigate the frequencies of other [parts of speech](https://universaldependencies.org/docs/u/pos).

In [None]:
pos = 'VERB'

def find_pos(text):
    words = []
    try:
        for token in nlp(text):
            if token.pos_ == pos:
                words.append(token.lemma_)
    except:
        pass
    return words
column_name = pos.lower()+'s'
hansard[column_name] = hansard['speechtext'].apply(find_pos)
word_list = []
for i, row in hansard.iterrows():
    for word in row[column_name]:
        word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(30)
px.bar(common_words, title='Common '+column_name+' in the House of Commons').update_layout(showlegend=False)

## Parts of Speech for Parties or Individual Speakers

Now that we have this NLP dataset, we can see which nouns are most commonly used by each party or speaker.

There are some nouns that are quite common, such as `government` and `member`. Since those seem to be more about the business of debate, we can eliminate them.

In [None]:
pos = 'NOUN'
n = 20
exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill']

for party in hansard['speakerparty'].dropna().unique():
    word_list = []
    for words in hansard[hansard['speakerparty']==party][pos.lower()+'s']:
        for word in words:
            if word not in exclude_words:
                word_list.append(word)
    common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
    title = 'Top '+str(n)+' '+party+' '+pos.lower()+'s'
    px.bar(common_words, title=title).update_layout(showlegend=False, height=300).update_xaxes(title={'text':'Word'}).show()

In [None]:
speaker = 'Garnett Genuis'
pos = 'NOUN'
n = 25
exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill']

word_list = []
for words in hansard[hansard['speakername']==speaker][pos.lower()+'s']:
    for word in words:
        if word not in exclude_words:
            word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
title = 'Top '+str(n)+' '+pos.lower()+'s'+' spoken by '+speaker
px.bar(common_words, title=title).update_layout(showlegend=False)

## Other Datasets

We have been looking at the Hansard from 2020. If you want to explore datasets from other timeframes on [LiPaD](https://www.lipad.ca/full/), you can upload the CSV files to a `new_data` folder here, and run the following code to import them into a DataFrame.

```python
folder_name = 'new_data'
import os
import pandas as pd
hansard = pd.DataFrame()
for root, dirs, files in os.walk(folder_name):
    for name in files:
        df = pd.read_csv(os.path.join(root, name))
        hansard = hansard.append(df)
hansard
```

# Conclusion

The Canadian government provides transcripts of debates in the House of Commons, called the [Hansard](https://en.wikipedia.org/wiki/Hansard). In this notebook we imported the Hansard data from 2020 and identified the frequencies of some [parts of speech](https://universaldependencies.org/docs/u/pos) using [natural language processing]([spaCy](https://spacy.io)).

Perhaps you can try extension activities such as investigating noun frequency by province or territory, identifying the most common [named entities](https://www.geeksforgeeks.org/python-named-entity-recognition-ner-using-spacy), or creating [word clouds](https://github.com/callysto/curriculum-notebooks/blob/master/EnglishLanguageArts/WordClouds/word-clouds.ipynb).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)