# Hansard and Twitter Analysis
The goal of this project was to explore the relationship between debates in parliament, and political tweets. To do this, data was scraped using requests and BeautifulSoup from the website https://openparliament.ca/, and from twitter using the twitter API. Finally the data is compared using plotly express.

## Importing Modules
To get the data we need, first we must import the modules. You may see that for some modules we need to install them as they do not come preinstalled.

This is done with the code: 

`try:
    import module
except:
    !pip conda module
    import module`

Below you will see the code.

In [None]:
try:
    import requests
except:
    !pip conda requests
    import requests

try:
    import bs4
    from bs4 import BeautifulSoup
except:
    !pip conda bs4
    import bs4
    from bs4 import BeautifulSoup
    
try:
    import spacy
except:
    !pip conda spacy
    !python -m spacy download en_core_web_sm
    import spacy

from collections import Counter

try:
    import twitter
except:
    !pip conda twitter
    import twitter

import json
import re
import plotly.express as px

## Getting Hansard Data
### Making Soup
To get the Hansard Data we will be scraping from the website https://openparliament.ca. To do this, we use the requests module to send a request. It returns the HTML markup for the web page. To understand the markup, we will be using BS4. This is a module that sorts through the markup and allows us to pull specific data that we need.

In [None]:
dateOfDebate = ('2022/10/18/')
page = requests.get('https://openparliament.ca/debates/' + dateOfDebate).content
data = BeautifulSoup(page, 'html.parser')
data

### Storing The Data
We will store the data we find in two ways: first, a string of everything said, and second, a list of every time someone spoke filled with objects that contain their party, name, and what was said. Both are defined in the next cell.

In [None]:
class speaker:
  def __init__(self, name, party, said):
    self.name = name
    self.party = party
    self.said = said
speakersList = []

text = ''

### Getting Data
To get our data we have to use BS4, to sort through all the data on the webpage. First we look at the markup (this can be done by inspecting the webpage). From there we need to find our data and then use a method from BS4 to retrieve it. I used .findAll() to get the list of every time someone spoke and .find() to get data about the person who spoke.

In [None]:
for i in data.findAll("div", class_="row statement_browser statement"):
    name = i.find('span', class_='pol_name').text
    try:
        party = i.find('p', class_='partytag').text
    except AttributeError:
        party = 'N/A\n'
    said = i.find('div', class_='text').text
    speakersList.append(speaker(name, party, said))
    text = text + said

Now we have all of our data from the debate. From here we get the twitter data.

## Getting Twitter Data
### Limits To The Twitter API
There are many limits to my current access to the Twitter API. I only have Essential access. The limits can be read about here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api

Ideally, I would be able to access the timestamps of tweets to allow me to make time graphs. In addition, being able to search by country of accounts would allow me to filter out tweets from other countries, but I do not have that ability. Instead I have limited the hashtags we search to ones that are explicitly canadian. So #conservative is not included, but #conservativepartyofcanada is. Finally, the amount of tweets I can gather from each hashtag is limited, so this limits the sample size. All in all, this project is a demonstration of potential rather than an in depth analysis. 

### Hashtags
Below are the hashtags we are searching.

In [None]:
hashtagsList = ["canadianpolitics", "canada", "cdnpoli", "justintrudeau", "trudeaumustgo", "trudeau", 
    "canadian", "ontario", "toronto", "ottawa", "alberta", "andrewscheer", "canadapolitics", "cpc",
    "conservativepartyofcanada", "fucktrudeau", "cdnpolitics", "canadavotes", "ppc", "britishcolumbia",
    "ndp", "cbc", "liberalpartyofcanada", "quebec", "politicscanada", "canadianelection", "erinotoole", 
    "canadiannews", "canpoli", "canadaelection", "montreal", "jagmeetsingh", "makecanadagreatagain"]

### Setup Of The Request
Below is the setup for the request. It has my unique bearer token that connects my code to my app. This allows us to make requests through the twitter API. In addition, we have the URL and the headers.

In [None]:
bearer_token = "AAAAAAAAAAAAAAAAAAAAABSNhwEAAAAA0wvbzndD61OQwAPE62nNyw7QsE8%3DWs28ejwV2Fvq8334rl4uqIsJjHpdOkIySQK1caUVBC4oTePGbN"
url = "https://api.twitter.com/2/tweets/search/recent?query="
headers = {"Authorization": "Bearer {}".format(bearer_token)}

### Getting The Data
Here is where we connect to the twitter API and request the top tweets from each hashtag listed above. We then use json to understand the response and pull out the text from each tweet. It also keeps count of the number of tweets being pulled.

In [None]:
tweetText = ''
tweetAmount = 0
for hashtag in hashtagsList:
    response = requests.request("GET", url + hashtag, headers=headers).json()
    try:
        for tweet in response['data']:
            tweetText = tweetText + tweet['text']
            tweetAmount += 1
    except KeyError:
        continue
print("Amount Of Tweets Used: " + str(tweetAmount))

### Filtering The Data
The text data that we get from the twitter API is a massive list of every word tweeted. Many of the words like @mentions, urls and hashtags we are not interested in, including such things, would throw off our data because the same hashtags and @mentions are used in many tweets saying different things.

To fix this we use RegEx to filter the unwanted text out.

In [None]:
filteredTweetText = ' '.join(re.sub("(#[A-Za-z0-9]+)|(@[A-Za-z0-9]+)"," ", tweetText).split())
print("Text Filtered!\n@mentions, and hashtags have been removed.")

Now we have all our data from both Twitter and the debate. From here we can start to analyze it.

## Analysis
### Natural Language Processing
To process what is being said, we are going to use spacy to filter the text. In this case we are looking for common nouns.

Below we have a function that returns the most common nouns.

In [None]:
nlp = spacy.load('en_core_web_sm')

def getNouns(textData, amount):
    #nlp = spacy.load('en_core_web_sm')
    doc = nlp(textData)
    nouns = [token.text
             for token in doc
             if (not token.is_stop and
                 not token.is_punct and
                 token.pos_ == "NOUN")]
    nounFreq = Counter(nouns)
    commonNouns = nounFreq.most_common(amount)
    nounList = []
    countList = []
    for i in commonNouns:
        nounList.append(i[0])
        countList.append(i[1])
    return(nounList, countList)

## Filtering Common Nouns
Below you can see the most common nouns across both the Hansard transcript and the Twitter data. This is done using the function above.

In [None]:
tweetNouns = getNouns(filteredTweetText, 10)
print(tweetNouns)

In [None]:
debateNouns = getNouns(text, 10)
# debateNounsList = debateNouns[0]
# debateValueDict = {'Count' : debateNouns[1]}

print(debateNouns)

In [None]:
# fig = px.bar(debateValueDict, title='Common Nouns in the House of Commons', labels={'index':'Noun', 'value':'Count'}).update_layout(showlegend=False)
fig = px.bar(x=debateNouns[0], y=debateNouns[1], 
             title='Common Nouns in the House of Commons', 
             labels={'x':'Noun', 'y':'Count'}).update_layout(showlegend=False)

fig.show()