![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/SocialStudies/OpenParliament/open-parliament.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Open Parliament

The [Hansard](https://en.wikipedia.org/wiki/Hansard) is a transcript of debates in the Canadian Parliament. It is available from the official [Parliament of Canada website](https://www.parl.ca) as well as other sources such as [Open Parliament](https://openparliament.ca) and [LiPaD: The Linked Parliamentary Data Project](https://www.lipad.ca).

Later on during this notebook, we'll be also be using information from [openparliament.ca](https://openparliament.ca/) which sources modern data in regard to government-related information.

We have downloaded the 2019 files from LiPaD, and can load them by selecting the following code cell and clicking the `▶Run` button.

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import requests
from collections import Counter
import re
import json
try:  # attempt to import BeautifulSoup
    import bs4
    from bs4 import BeautifulSoup
except:
    %pip install --user bs4
    import bs4
    from bs4 import BeautifulSoup
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import spacy
    nlp = spacy.load('en_core_web_sm')
#import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)

hansard = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/HansardAnalysis/proceedings2020.csv')
print(f'There are {hansard.shape[0]} rows and {hansard.shape[1]} columns of data:')
hansard.columns

###  Topics of Importance
Let's begin by taking a look at specific topics spoke in Parliament through the [Hansard](https://en.wikipedia.org/wiki/Hansard) and then compare current-day topics using data from [Open Parliament](https://openparliament.ca/).

In [None]:
hansard_topics = pd.DataFrame(hansard.groupby('subtopic')['subtopic'].aggregate('count').reset_index(name='count'))
hansard_topics = hansard_topics.sort_values(by=['count']).reset_index()
display(hansard_topics)

We can take a look at what the top 10 *most spoken* topics, alongside the top 10 *least spoken* topics at Parliament.

In [None]:
top_10_fig = px.bar(hansard_topics.tail(10), title="Top 10 Topics spoken in Parliament", y="subtopic", x="count", labels={'subtopic': "Topic"}, orientation='h', color='count')
top_10_fig.update_layout(showlegend=False).update_layout(yaxis_title=None).show()

bot_10_fig = px.bar(hansard_topics.head(10), title="Bottom 10 Topics spoken in Parliament", y="subtopic", x="count", labels={'subtopic': "Topic"}, orientation='h')
bot_10_fig.update_layout(showlegend=False).update_layout(yaxis_title=None).show()

Looking at both bar charts, are certain topics *not* being addressed as much? Vice-versa, are certain topics you think are being addressed too often?

We can also look at which *members of Parliament* speak on topics that you find *important*. In the cell below, input different `subtopic` names in the cell below and see which members of Parliament talk about your particular topic!

In [None]:
list_of_topics = hansard_topics['subtopic'].unique()
print(list_of_topics)

The cell above holds all the subtopics spoken in Parliament. The various subtopics can be inputted in the `topic` variable in the code cell below.

You can change the topic to anything you're interested in. For example, instead of `topic = 'Health'` you can input `topic = 'Petitions'`

In [None]:
topic = 'Health'

members_by_topic = pd.DataFrame(hansard.loc[hansard['subtopic'] == topic]) 
members_by_topic = members_by_topic.drop_duplicates(subset=['speakername']) 
members_by_topic = members_by_topic.drop(columns=['basepk', 'hid', 'speechdate', 'pid', 'opid', 'speakerposition', 'subsubtopic', 'speechtext', 'speechtext', 'speakeroldname', 'speakerurl', 'speakerriding']).reset_index(drop=True) 
if members_by_topic.empty:
    print('No matches. Did you make sure to capitalize and space correctly?')
else:
    display(members_by_topic)

<div class="alert alert-block alert-info">
<b>Optional:</b> The below code cell randomly replaces the name of each party in the dataframe with a letter, allowing you to guess the party based on their topic 10 most spoken topics! Another cell after the plots reveals which party is which letter.

If you want to use the party name in the plots you can comment out the code in the cell below (place a <tt>#</tt> at the beginning of each line) or not run it.
</div>

In [None]:
import random 
letters = ['Party A', 'Party B', 'Party C', 'Party D', 'Party E', 'Party F']
parties = hansard['speakerparty'].dropna().unique().tolist()
random.shuffle(letters)
random.shuffle(parties)

mapping = {}
for key in parties:
    for value in letters:
        mapping[key] = value
        letters.remove(value)
        break
        
hansard['speakerparty'] = hansard['speakerparty'].replace(mapping)
hansard['speakerparty'].unique()

We can investigate this concept by looking at each party's most important topics using the `speakerparty` column.

In [None]:
colors = ['red', 'orange', 'green', 'blue', 'lightblue', 'lightseagreen']
for index, party in enumerate(hansard['speakerparty'].dropna().unique()):
    party_topics = pd.DataFrame(hansard.groupby(['subtopic', 'speakerparty'])['subtopic'].aggregate('count').reset_index(name='count'))
    party_topics = party_topics.sort_values(by=['count'])
    party_topics = party_topics[party_topics['speakerparty'] == party]
    fig = px.bar(party_topics.tail(10), title=f"{party}'s Top 10 Topics", y='subtopic', x='count', orientation='h')
    fig.update_traces(marker_color=colors[index]).update_layout(yaxis_title=None, showlegend=False, height=500).show()

Uncomment the code line in the cell below (remove the `#`) to reveal the party names:

In [None]:
#mapping

Now that we've uncovered information based on the Hansard, let's utilize Open Parliament's API (**Application Programming Interface**) to obtain information on current-day debate topics. With this, we can compare topics that were spoken about in 2019 to current topics.

We have data on the past current debate topics spoken in Parliament, let's keep just the English portions and visualize the most and least common topics.

In [None]:
debate_info = requests.get('http://api.openparliament.ca/debates/?limit=10&format=json')
data = debate_info.json()
debate_df = pd.DataFrame(data['objects'])

speech_urls = []
for value in debate_df['url'].values:
    speech_urls.append(requests.get(f'http://api.openparliament.ca{value}?format=json').json()['related']['speeches_url'])
combined_topics = pd.DataFrame()
for i, url in enumerate(speech_urls):
    topics = requests.get(f'http://api.openparliament.ca{url}&format=json&limit=100')
    topics_df = pd.DataFrame(topics.json()['objects'])
    extracted_col = topics_df['h2']
    extracted_df = pd.DataFrame({f'Col{i}': extracted_col})
    combined_topics = pd.concat([combined_topics, extracted_df], axis=1)
def extract_translation(value):
    if isinstance(value, dict):
        return value.get('en', '').strip('{}')
    else:
        return ''
combined_topics = combined_topics.map(extract_translation)
flatten_series = combined_topics.stack()
value_counts = flatten_series.value_counts().reset_index()
value_counts.columns = ['Value', 'Count']

common_topics = value_counts.query('Count > 10')
px.pie(common_topics, values='Count', names='Value', title="Common Topics").update_traces(textposition='inside').show()
uncommon_topics = value_counts.query('Count < 5')
px.pie(uncommon_topics, values='Count', names='Value', title='Uncommon Topics').update_traces(textposition='inside').show()

### Questions:

1. Which topics stand out between the different parties of Parliament?
2. What is the significance of studying and analyzing the topics discussed by members of Parliament in Canadian politics?
3. How might the frequency of discussions on specific topics reflect the priorities or concerns of the government and the society?
4. What challenges might arise when analyzing and interpreting data on the topics discussed in Parliament?

### Investigating Canadian Parliament's API

An API, which stands for **Application Programming Interface**, is like a bridge that allows different software applications to communicate and interact with each other. 

Imagine you're at a restaurant. The _menu_ acts as an API because it provides an simplfied way for you to interact with the kitchen. Instead of going into the kitchen directly and asking the chef how to cook your dish, you simply order off the menu. The kitchen staff then uses the instructions provided on the menu to prepare and serve your menu. 

Earlier in the notebook, we used [Open Parliament's](https://openparliament.ca) API in order to obtain information on current-day topics spoken during debates.

Let's obtain information from _openparliament_ by making a request to a specific web address. 

In [None]:
r = requests.get('http://api.openparliament.ca/votes/?format=json&limit=100')
df = pd.DataFrame(r.json()['objects'])
df

Here we have information of the past 100 bills that have been in circulation in Parliament. However, some of the data we obtained isn't in the correct format we want it in. We want our dataframe to be _clean_ in order to use it in an effective manner. Data cleaning refers to the process of identifying and/or correcting errors, inconsistencies, and inaccuracies in a dataframe. This could in the form of removing missing values, standardizing formats, and dealing with any inconsistencies.

In our first step of data cleaning, let's separate the `description` column to two different columns, `english_desc` and `french_desc`.

In [None]:
df['english_desc'] = df['description'].apply(lambda x: x['en'])
df['french_desc'] = df['description'].apply(lambda x: x['fr'])
df = df.drop(columns=['description'])
df

Next, let's remove any bills that don't have an `url` or a *None* as a value for their `url`.

In [None]:
pd.set_option("display.max_rows", None)
temp_fig = df.dropna().reset_index(drop=True)
bill_names = [re.search(f"/bills/{session_name}/(.*)/", bill_url).group(1)
              for bill_url, session_name in zip(temp_fig['bill_url'], temp_fig['session'])]
temp_fig['bill_name'] = bill_names
temp_fig

Perfect! Now we have *clean* data in the correct format.

We can find the total percentage of bills in Parliament that have either **passed** or **failed** alongside the individual bills.  

In [None]:
px.pie(temp_fig['result'].value_counts().reset_index(), values='count', names='result', title='Percentage of Bills that have Passed or Failed').show()
px.bar(temp_fig, x='bill_name', y='number', color='result', hover_data=['yea_total', 'nay_total'],  height=400, title='Bills')

Looking at the figures above, is the percentage of bills that pass/fail surprising? Think about the government that has the majority of seats and the bills that are frequently being passed. Is there a correlation between these factors?

Let's take a deeper look at bills that have passed/failed multiple times. This is usually the result of bills having multiple readings or being at different stages, thus being altered at each step.

You can change the `name_of_bill` variable to look at a different bill, for example `name_of_bill = 'C-11'`

In [None]:
name_of_bill = 'C-21'

party_names = ['Green Party of Canada', "Liberal Party of Canada", "Bloc Québécoi", "New Democratic Party", "Conservative Party of Canada"]
df_with_bill = temp_fig.loc[temp_fig['bill_name'] == name_of_bill]
if len(df_with_bill) == 0:
    print("No results, use the plots above to find a bill to investigate.")
for index, row in df_with_bill.iterrows():
    r = requests.get(f"http://api.openparliament.ca{row['url']}?format=json")
    data = r.json()
    vote_info = pd.DataFrame(data['party_votes'])
    vote_info.drop(columns=['party'])
    vote_info['party'] = party_names
    voter_percentage = vote_info['vote'].value_counts(normalize=True)
    vote_info = vote_info.style.set_caption(row['english_desc'])
    display(vote_info)
    print("Percentage of parties who voted yes/no:\n", voter_percentage.to_string(),'\n')

Looking again at all the bills, we can take a deeper dive at the different bills voted in Parliament. Specifically, we can look at how different members in Parliament voted.

In [None]:
temp_fig[['url', 'english_desc']]

Listed above are a list of `urls` of bills and their corresponding descriptions. You can use this list of `urls` to find a particular bill to explore in the cell below.

Change `bill_to_explore` to take a look at the different bills members of Parliament voted on.

For example, `bill_to_explore = '/votes/44-1/279/'`

In [None]:
bill_to_explore = '/votes/44-1/333/'

r = requests.get(f"http://api.openparliament.ca/votes/ballots/?format=json&vote={bill_to_explore}")
data = r.json()
politician_vote_info = pd.DataFrame(data['objects'])

politician_urls = politician_vote_info['politician_url']
membership_urls = [f"http://api.openparliament.ca{url}?format=json" for url in politician_urls]

responses = [requests.get(url) for url in membership_urls]
data = [response.json() for response in responses]
parties = [d['memberships'][0]['party']['name']['en'] for d in data]
provinces = [d['memberships'][0]['riding']['province'] for d in data]
politician_vote_info['party'] = np.array(parties)
politician_vote_info['province_info'] = np.array(provinces)
politician_vote_info['name'] = politician_vote_info['politician_url'].str.extract("/politicians/(.*)/", expand=False)
politician_vote_info

Looking at the description of the `url` for */votes/44-1/333/*, it states:
> 3rd reading and adoption of Bill C-21, An Act to amend certain Acts and to make certain consequential amendments (firearms)

Now we can look more in depth on why potential members of Parliament chose to vote the way they did on this particular bill.

We can also look at how parties voted on certain bills by combining members of Parliament who share the same party.

In [None]:
party_counts = politician_vote_info.groupby(['party', 'ballot'])['name'].agg('count').reset_index()
party_counts.rename(columns={"name": "count"}, inplace=True)
px.bar(party_counts, x='party', y='count', color='ballot', title='Ballot votes of each Party')

### Questions:

1. What factors do you think influence how political parties decide to vote on specific bills?
2. How can data science techniques be used to analyze and predict how certain parties may vote on a particular bill?
3. Why is it important for political parties to have a consistent voting pattern on bills in Parliament?
4. In what ways can the study of party voting patterns help citizens understand the political landscape and hold their representatives accountable?

---

### Web Scraping

To get the Hansard Data we will be scraping from the website https://openparliament.ca (we previously used openparliament's API). To do this, we use the requests module to get the HTML for a web page. To understand the markup, we will be using [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/).

The cell below will look for certain elements on the page and collects it such as the name of the speaker, their political party, their affiliation, and what they said during the debate.

If you want to chnage the date of the debate, change `'2023/03/31'` to another valid date in the format YYYY/MM/DD

In [None]:
dateOfDebate = ('2023/03/31/')

page = requests.get('https://openparliament.ca/debates/' + dateOfDebate + '?singlepage=1').text  #?singlepage=1' gets all of the speakers
data = BeautifulSoup(page, 'html.parser')
debateDict = {'Name': [],
              'Party' : [],
              'Affiliation' : [],
              'Said' : []
             }
for i in data.findAll("div", class_="row statement_browser statement"):
    try:  # getting the name of the speaker
        name = i.find('span', class_='pol_name').text
        name = str(name)
    except AttributeError:
        continue
    try:  # if they have spoken already, we do not find their party or affiliation
        index = debateDict['Name'].index(name)
        indexFound = True
    except ValueError:
        indexFound = False
        try:  # finding the affiliation
            affiliation = i.find('span', class_="pol_affil").text
            affiliation = str(affiliation)
            affiliation = affiliation.replace("						", "")
        except AttributeError:
            affiliation = 'N/A'
        try:  # For speakers without party tags
            party = i.find('p', class_='partytag').text
            party = str(party)
        except AttributeError:
            party = 'N/A'
    said = i.find('div', class_='text').text
    if indexFound:
        debateDict["Said"][index] = debateDict["Said"][index] + said
    else:
        debateDict['Name'].append(name)
        debateDict['Party'].append(party)
        debateDict['Affiliation'].append(affiliation)
        debateDict['Said'].append(said)

if debateDict == {'Name': [], 'Party': [], 'Affiliation': [], 'Said': []}:
    print("Error: Please input a valid date for the variable dateOfDebate.")
else:
    dataFrame = pd.DataFrame.from_dict(debateDict)
    dataFrame['Party'].replace('\n', '', regex=True, inplace=True)
    dataFrame['Affiliation'].replace('\n', '', regex=True, inplace=True)
    display(dataFrame)

## What Each Party Said

Let's look at the top 25 nouns spoken by each party in Parliament.

You can also alter the variable `n` below to look at the top `n` nouns spoken by a party. 

In [None]:
n = 25

def find_nouns(text):
    nouns = []
    try:
        for token in nlp(text):
            if token.pos_ == 'NOUN':
                nouns.append(token.lemma_)
    except:
        pass
    return nouns

parties = {'Liberal':'red', 'Conservative':'blue', 'NDP':'orange', 'Bloc':'lightblue', 'Green':'green'}
for party, color in parties.items():
    pos = 'NOUN'
    exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill', 'madam']
    word_list = []
    index = dataFrame[dataFrame["Party"]==party].index.values
    cell_values = ''
    for item in index:
        cell_values = cell_values + dataFrame.iloc[item]["Said"]
    for words in cell_values.split(' '):
        for word in find_nouns(words):
            if word not in exclude_words:
                word_list.append(word)
    common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
    title = 'Top '+str(n)+' '+pos.lower()+'s'+' spoken by the '+party+' Party'
    fig = px.bar(common_words, title=title, labels={'index':pos.capitalize(), 'value':'Count'}).update_layout(showlegend=False)
    fig.update_traces(marker_color=color).show()

## What Your Representative Said

What can also look at the `Name` column in our dataframe to see what nouns are common in our members of Parliament.

You can change the name of the speaker, for example `speaker = 'Jenny Kwan'` , or the part of speech, e.g. `pos = 'VERB'`

In [None]:
speaker = 'Clifford Small'
n = 25
pos = 'NOUN'

exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill', 'madam']
word_list = []
index = dataFrame[dataFrame["Name"]==speaker].index.values
cell_value = ''
for item in index:
    cell_value = cell_value + dataFrame.iloc[item]["Said"]
for words in cell_value.split(" "):
    for word in find_nouns(words):
        if word not in exclude_words:
            word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
title = speaker+"'s "+'Top '+str(n)+' '+pos.title()+'s'
px.bar(common_words, title=title, labels={'index': 'Word', 'value': 'Count'}).update_layout(showlegend=False, height=300)

## By Area

Lastly, we can also find the common nouns of representatives in certain `provinces`, `cities`, or `ridings`.

You can change the `area = ` to any city, riding, or province. For example, `area = 'Edmonton'`

In [None]:
area = 'AB'
n = 25
pos = 'NOUN'

exclude_words = ['government', 'member', 'people', 'time', 'year', 'legislation', 'bill', 'madam']
word_list = []
cell_values = ''
for item in range(len(dataFrame.index)):
    if area in dataFrame.iloc[item]["Affiliation"]:
        cell_values = cell_values + dataFrame.iloc[item]["Said"]
    else:
        continue
for words in cell_values.split(' '):
    for word in find_nouns(words):
        if word not in exclude_words:
            word_list.append(word)
common_words = pd.DataFrame.from_dict(Counter(word_list), orient='index').sort_values(0, ascending=False).head(n)
title = 'Top '+str(n)+' '+pos+'s'+' spoken by the representitives for '+area
px.bar(common_words, title=title.title(), labels={'index':pos.capitalize(), 'value':'Count'}).update_layout(showlegend=False)

### Questions:

1. What are the benefits and limitations of web scraping as a method to collect data from online sources, such as the debates in the Canadian Parliament?
2. How can the analysis of debates and identification of common nouns be used to compare and contrast the priorities of different political parties over time?
3. Can the analysis of common nouns in the debates help us understand the language and rhetoric used by political parties and its impact on public discourse?
4. What are the potential biases or limitations in analyzing debates and identifying common nouns, and how can they be addressed to ensure the accuracy and reliability of the findings?

# Conclusion

The Canadian government provides transcripts of debates in the House of Commons, called the [Hansard](https://en.wikipedia.org/wiki/Hansard). In this notebook we imported the Hansard data from 2020 and identified the frequencies of some [parts of speech](https://universaldependencies.org/docs/u/pos) using [natural language processing]([spaCy](https://spacy.io)). We also found which parties spoke the most relative to their seats alongside how often certain members of Parliament spoke. 

We also used the Hansard to find out which topics each party prioritized and experimented if you were able to identify parties based on the top 10 topics they spoke about. 

Lastly, using [openparliament.ca](https:https://openparliament.ca/), we identified trends of bill voting, specifically how certain members/parties of Parliament voted on certain bills alongside trends on which bills passed or failed. 

Perhaps you can try extension activities such as investigating predictions on which bills pass or fail in Parliament, identifying the most common [named entities](https://www.geeksforgeeks.org/python-named-entity-recognition-ner-using-spacy), or creating [word clouds](https://github.com/callysto/curriculum-notebooks/blob/master/EnglishLanguageArts/WordClouds/word-clouds.ipynb)

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)