![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=SocialStudies/HansardAnalysis/hansard-analysis.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Hansard Analysis

The [Hansard](https://en.wikipedia.org/wiki/Hansard) is a transcript of debates in the Canadian Parliament. It is available from the official [Parliament of Canada website](https://www.parl.ca) as well as other sources such as [Open Parliament](https://openparliament.ca) and [LiPaD: The Linked Parliamentary Data Project](https://www.lipad.ca).

We have downloaded the 2020 files from LiPaD, and can load them by selecting the following code cell and clicking the `▶Run` button.

In [None]:
import pandas as pd
import plotly.express as px
try:
    from wordcloud import WordCloud
except:
    !pip install wordcloud
    from wordcloud import WordCloud
from collections import Counter

hansard = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/HansardAnalysis/proceedings2020.csv')
print(f'There are {hansard.shape[0]} rows and {hansard.shape[1]} columns of data:')
hansard.columns

## Who Spoke?

Let's have a look at who spoke during these debates.

In [None]:
speakers = hansard.drop_duplicates(subset=['speakername'])[['speakername','speakerparty','speakerriding','speakerurl']]
speakers = speakers.dropna().reset_index().drop(columns=['index'])
print('There were',speakers.shape[0],'speakers from the',speakers['speakerparty'].unique(),'parties.')

We can compare that to the list of Members of Parliament from the [43rd Parliament](https://en.wikipedia.org/wiki/43rd_Canadian_Parliament) that started on December 5, 2019.

In [None]:
members = pd.read_csv('https://www.ourcommons.ca/members/en/search/csv?parliament=43')
print('There were',members.shape[0],'Members from the',members['Political Affiliation'].unique(),'parties.')

So of the 338 Members of Parliament we had 312 unique speakers, meaning that 26 Members are not recorded as speaking during 2020. Let's see if we can identify who are they were.

In [None]:
members['Name'] = members['First Name'] +' '+ members['Last Name']
silent = []
for member in members['Name']:
    if member not in speakers['speakername'].values:
        silent.append(member)
print('That is',len(silent),'Members not recorded as speaking in 2020:')
print(silent)

Of course 36 is not equal to 26, but we will leave it to you to compare the list `silent` to the list from `speakers['speakername'].unique()` if you are interested.

## Who Spoke Most?

We can check how many times each speaker is recorded in the Hansard.

In [None]:
hansard_speakers = pd.DataFrame(hansard['speakername'].value_counts())
hansard_speakers

Let's also calculate the length (number of characters) of each of those speeches, and sort them by who said the most.

In [None]:
hansard['speechlength'] = hansard['speechtext'].str.len()
hansard.groupby('speakername').sum(numeric_only=True).sort_values('speechlength', ascending=False)

We can also visualize the number of times *any* MP spoke with a histogram:

In [None]:
px.histogram(hansard_speakers, title='Histogram of Number of Speeches by Member', labels={'value':'Number of times speaking'}).update(layout_showlegend=False)

The histogram is great at showing big-picture trends, but we can also plot those speech lengths on a smaller scale to see the difference. Feel free to change the `n = 20` variable to see more or fewer Members.

In [None]:
n = 20
top_length = hansard.groupby('speakername').sum(numeric_only=True).sort_values('speechlength', ascending=False).head(n)
px.bar(top_length, title='Top '+str(n)+' Hansard Speakers by Speech Length', labels={'speakername': 'Speaker Name', 'value':'Total words spoken'}).update(layout_showlegend=False)

As well, we can check how many times a Member from a political party spoke, as well as the total length of speeches from each party:

In [None]:
px.bar(hansard['speakerparty'].value_counts(),title='Hansard Speaker Frequency by Party', labels={'index': 'Party', 'value':'Number of speeches'}).update(layout_showlegend=False)

In [None]:
speech_length_party = hansard.groupby('speakerparty').sum(numeric_only=True).sort_values('speechlength', ascending=False)
px.bar(speech_length_party, title='Hansard Words Spoken by Party', labels={'speakerparty': 'Party', 'value':'Total words spoken'}).update(layout_showlegend=False)

### Thinking Proportionally
Now, the above plots are useful in finding out which parties spoke the most, but it would be pretty reasonable to expect the parties with the most members to have the longest or most frequent speeches. In the next step, we'll look at the composition of the 43rd Parliament, and normalize the above two plots to the number of members each party has in Parliament:

In [None]:
seats = pd.DataFrame(list(zip(['Liberal', 'Conservative', 'New Democratic Party', 'Bloc Québécois', 'Green Party', 'Independent'],[157, 121, 32, 24, 3, 1])), columns=['Party', 'Seats']).set_index('Party')
px.bar(seats, x=seats.index, y='Seats', title='Number of Seats in 43rd Parliament, by Party').update(layout_showlegend=False)

In [None]:
freq_norm = pd.DataFrame(hansard['speakerparty'].value_counts()).div(seats['Seats'], axis=0)
px.bar(freq_norm,title='Hansard Speaker Frequency by Party (Normalized by Number of Seats)', labels={'index': 'Party', 'value':'Number of speeches (per seat)'}).update(layout_showlegend=False)

In [None]:
speech_length_party_norm = pd.DataFrame(speech_length_party.div(seats['Seats'], axis=0))
px.bar(speech_length_party_norm, title='Hansard Words Spoken by Party (Normalized by Number of Seats)', labels={'index': 'Party', 'value':'Words spoken (per seat)'}).update(layout_showlegend=False)

###  Topics of Importance
We can also look at specific topics that are being addressed the most and vice versa, alongside a particular member of Parliament's topic.

In [None]:
hansard_topics = pd.DataFrame(hansard.groupby('subtopic')['subtopic'].aggregate('count').reset_index(name='count'))
hansard_topics = hansard_topics.sort_values(by=['count']).reset_index()
display(hansard_topics)

We can take a look at what the top 10 *most spoken* topics, alongside the top 10 *least spoken* topics at Parliament.

In [None]:
top_10_fig = px.bar(hansard_topics.tail(10), title="Top 10 Topics spoken in Parliament", y="subtopic", x="count", labels={'subtopic': "Topic"}, orientation='h', color='count')
top_10_fig.update_layout(showlegend=False).update_layout(yaxis_title=None).show()

bot_10_fig = px.bar(hansard_topics.head(10), title="Bottom 10 Topics spoken in Parliament", y="subtopic", x="count", labels={'subtopic': "Topic"}, orientation='h')
bot_10_fig.update_layout(showlegend=False).update_layout(yaxis_title=None).show()

Looking at both bar charts, are certain topics *not* being addressed as much? Vice-versa, are certain topics you think are being addressed too often?

We can also look at which *members of Parliament* speak on topics that you find *important*. In the cell below, input different `subtopic` names in the cell below and see which members of Parliament talk about your particular topic!

In [None]:
list_of_topics = hansard_topics['subtopic'].unique()
print(list_of_topics)

The cell above holds all the subtopics spoken in Parliament. The various subtopics can be inputted in the `usr_input` variable in the code cell below.

In [None]:
topic = 'Health'

members_by_topic = pd.DataFrame(hansard.loc[hansard['subtopic'] == topic]) 
members_by_topic = members_by_topic.drop_duplicates(subset=['speakername']) 
members_by_topic = members_by_topic.drop(columns=['basepk', 'hid', 'speechdate', 'pid', 'opid', 'speakerposition', 'subsubtopic', 'speechtext', 'speechtext', 'speakeroldname', 'speakerurl', 'speakerriding']).reset_index(drop=True) 
if members_by_topic.empty:
    print('No matches. Did you make sure to capitalize and space correctly?')
else:
    display(members_by_topic)

We can investigate this concept by looking at each party's most important topics using the `speakerparty` column.

In [None]:
colors = ['red', 'orange', 'green', 'blue', 'lightblue', 'lightseagreen']
for index, party in enumerate(hansard['speakerparty'].dropna().unique()):
    party_topics = pd.DataFrame(hansard.groupby(['subtopic', 'speakerparty'])['subtopic'].aggregate('count').reset_index(name='count'))
    party_topics = party_topics.sort_values(by=['count'])
    party_topics = party_topics[party_topics['speakerparty'] == party]
    fig = px.bar(party_topics.tail(10), title=f"{party}'s Top 10 Topics", y='subtopic', x='count', orientation='h')
    fig.update_traces(marker_color=colors[index]).update_layout(yaxis_title=None, showlegend=False, height=500).show()


### Questions:

1. Which topics stand out between the different parties of Parliament?
1. Can you think of reasons why different topics are discussed more in certain parties compared to others?
1. What topics would you expect to be to see in certain parties that aren't seen in the top 10?

## Natural Language Processing

We are going to use the Python library [spaCy](https://spacy.io) for natural language processing. One feature of spaCy is the identification of [parts of speech](https://universaldependencies.org/docs/u/pos) (noun, verb, etc.). We will also simplify words to their base forms, which is called "[lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)" (e.g. the lemma of "speaking" is "speak").

The following code cell creates a new column in our `hansard` DataFrame containing the nouns from the `speechtext` column. It will take about four minutes to run.

In [None]:
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import spacy
    nlp = spacy.load('en_core_web_sm')
from IPython.display import clear_output
clear_output()

def find_nouns(text):
    nouns = []
    try:
        for token in nlp(text):
            if token.pos_ == 'NOUN':
                nouns.append(token.lemma_)
    except:
        pass
    return nouns

hansard['nouns'] = hansard['speechtext'].apply(find_nouns)
print('We now have the columns:', list(hansard.columns))

We can then create a bar graph of the most common nouns in the dataset.

In [None]:
noun_list = []
for row in hansard.itertuples():
    for noun in row.nouns:
        noun_list.append(noun)
nf = pd.DataFrame.from_dict(Counter(noun_list), orient='index')
top_nouns = nf.sort_values(0, ascending=False).head(30)
px.bar(top_nouns, title='Common Nouns in the House of Commons', labels={'index':'Noun', 'value':'Count'}).update_layout(showlegend=False)

### Questions:

1. How do these nouns compare to those you use in your everyday life?
1. Can you think of reasons why it would differ?
1. What words would you expect to see here that aren't listed in the top 20?

### Other Parts of Speech

Try the following code cell to investigate the frequencies of other [parts of speech](https://universaldependencies.org/docs/u/pos).

In [None]:
pos = 'PRON'

def find_pos(text):
    words = []
    try:
        for token in nlp(text):
            if token.pos_ == pos:
                words.append(token.lemma_)
    except:
        pass
    return words
column_name = pos.lower()+'s'
hansard[column_name] = hansard['speechtext'].apply(find_pos)
word_list = []
for i, row in hansard.iterrows():
    for word in row[column_name]:
        word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(30)
px.bar(common_words, title='Common '+column_name+' in the House of Commons', labels={'index':column_name, 'value':'Count'}).update_layout(showlegend=False)

## Parts of Speech for Parties or Individual Speakers

Now that we have this NLP dataset, we can see which nouns are most commonly used by each party or speaker.

There are some nouns that are quite common, such as `government` and `member`. Since those seem to be more about the business of debate, we can eliminate them.

<div class="alert alert-block alert-info">
<b>Optional:</b> The below code cell randomly replaces the name of each party in the dataset with a letter, allowing you to determine the party based on their most used words! Another cell after the plots reveals which party is which letter, but if you want to use the party name in the plots you can comment out (place a <tt>#</tt> at the beginning of each line) the below cell.
</div>

In [None]:
# Randomly obscure party names
import random
letters = ['Party A', 'Party B', 'Party C', 'Party D', 'Party E', 'Party F']
parties = hansard['speakerparty'].dropna().unique().tolist()
random.shuffle(letters)
random.shuffle(parties)

mapping = {}
for key in parties:
    for value in letters:
        mapping[key] = value
        letters.remove(value)
        break
        
hansard['speakerparty'] = hansard['speakerparty'].replace(mapping)

In [None]:
pos = 'NOUN'
n = 20

# These words are so common in how Members speak in the House that they've been excluded
exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill']

for party in hansard['speakerparty'].dropna().unique():
    word_list = []
    for words in hansard[hansard['speakerparty']==party][pos.lower()+'s']:
        for word in words:
            if word not in exclude_words:
                word_list.append(word)
    common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
    title = 'Top '+str(n)+' '+party+' '+pos.lower()+'s'
    px.bar(common_words, title=title, labels={'index': 'Word', 'value': 'Count'}).update_layout(showlegend=False, height=300).show()

Uncomment the below cell (remove the `#`) to reveal the party names:

In [None]:
# mapping

## By Member

Next up, [look up your own Member of Parliament](https://www.ourcommons.ca/members/en) (or any MP) and see what you can learn about their time in the House of Commons.

In [None]:
speaker = 'Garnett Genuis'
pos = 'NOUN'
n = 25
exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill']

word_list = []
for words in hansard[hansard['speakername']==speaker][pos.lower()+'s']:
    for word in words:
        if word not in exclude_words:
            word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
title = 'Top '+str(n)+' '+pos.lower()+'s'+' spoken by '+speaker
px.bar(common_words, title=title, labels={'index':pos.capitalize(), 'value':'Count'}).update_layout(showlegend=False)

Or, we can also visualize the words as a wordcloud.

In [None]:
import matplotlib.pyplot as plt
wc = WordCloud(background_color='white')
wc.generate_from_frequencies(common_words[0])
plt.imshow(wc)
plt.axis('off')
plt.show()

## Other Datasets

We have been looking at the Hansard from 2020. If you want to explore datasets from other timeframes on [LiPaD](https://www.lipad.ca/full/), you can upload the CSV files to a `new_data` folder here, and run the following code to import them into a DataFrame.

```python
folder_name = 'new_data'
import os
import pandas as pd
hansard = pd.DataFrame()
for root, dirs, files in os.walk(folder_name):
    for name in files:
        df = pd.read_csv(os.path.join(root, name))
        hansard = hansard.append(df)
hansard
```

# Conclusion

The Canadian government provides transcripts of debates in the House of Commons, called the [Hansard](https://en.wikipedia.org/wiki/Hansard). In this notebook we imported the Hansard data from 2020 and identified the frequencies of some [parts of speech](https://universaldependencies.org/docs/u/pos) using [natural language processing]([spaCy](https://spacy.io)).

Perhaps you can try extension activities such as investigating noun frequency by province or territory, identifying the most common [named entities](https://www.geeksforgeeks.org/python-named-entity-recognition-ner-using-spacy), or creating [word clouds](https://github.com/callysto/curriculum-notebooks/blob/master/EnglishLanguageArts/WordClouds/word-clouds.ipynb).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)