<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 4 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 4 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. If you run into problems, you can reach out to John (john.mclevey@uwaterloo.ca) or Pierson (pbrowne@uwaterloo.ca) for help. You can ask a friend for help if you like, regardless of whether they are enrolled in the course.

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, use the 'USER_DEFINED' option to fill in any blank that you assigned an arbitrary name to.</b></div>

## Package Imports

In [1]:
import pickle
import os
from posixpath import join
import random
from pyprojroot import here

import pandas as pd
import numpy as np
from scipy.stats import zscore

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

import spacy
from gensim.models.phrases import Phrases, Phraser

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from pprint import pprint

## Question 1:
<div class="alert alert-block alert-info">  
For the first part of this assignment, we're going to work with speech data drawn from Canadian Hansards. To save you the trouble of having to work with the entire set of Hansards from 2017 through to 2020, we've already filtered them down to only include speeches by federal party leaders present in the House of Commons. Once the data has been loaded, we'll run the entire set of speeches through <code>spaCy</code>'s nlp suite, which will prepare us for the subsequent sections!
</div>
<div class="alert alert-block alert-success">
Load the Canadian Hansard dataframe and feed the text of the speeches through <code>spaCy</code>'s nlp pipeline. Retrieve the part of speech of the first word of the speech at index 1 of the resulting list of processed speeches.
</div>

In [None]:
# Load the dataframe of leader speeches
leader_df = pd.read_csv(PATH_TO_DATA/"can_hansard_leaders.csv")

filter_terms = [
    "Justin Trudeau", # Prime Minister, 2015-Present; Leader of the Liberal Party of Canada, 2013-Present
    "Andrew Scheer", # Leader of the Official Opposition, 2017-2019
    "Mario Beaulieu", # Interim Leader of the Bloc Quebequois, 2018-2019
    "Jagmeet Singh", # Leader of the New Democratic Party, 2017-Present
    "Elizabeth May", # Leader of the Green Party of Canada, 2006-2019
    "PPC", # Maxime Bernier in his capacity as Leader of the People's Party of Canada, 2018-Present 
]

# Instantiate list for dataset subsets
leader_speeches = []

# Iterate over filter_terms, filter dataset to term, and store resulting dataframe subset in 'leader_speeches'
for t in filter_terms:
    h = leader_df[leader_df["speakeroldname"].str.contains(t, na=False)]
    leader_speeches.append(h)

# Initialize nlp pipeline, taking care to disable named entity recognition
nlp = spacy.load(▰▰0▰▰, disable=[▰▰1▰▰])

# Feed each speech through spaCy's nlp pipeline
corpus = []

for speech in leader_df[▰▰2▰▰]:
    corpus.▰▰3▰▰(▰▰4▰▰(speech))

# Extract part of speech from the first word of the speech at index 1
corpus[▰▰5▰▰][▰▰6▰▰].▰▰7▰▰

## Question 2:
<div class="alert alert-block alert-info">  
Now that we have our output from <code>spaCy</code>, we have access to a wealth of information about each word in our dataset. We'll start by using this information to create a list of lists, where the outer list contains a number of inner lists, and each of the inner lists represents a single speech. We'll populate those inner lists with the lemmatized forms of all the nouns and proper nouns present in their corresponding speeches.   
</div>
<div class="alert alert-block alert-success">
Iterate over the docs to create a list of lists, where the outer list represents the documents and the inner list contains a list of lemmatized nouns and proper nouns. Retrieve the list of lemmatized nouns and proper nouns associated with the speech located at index 2 of the resulting list of speeches.
</div>

In [None]:
# Populate list containing desired parts of speech, in this order:
# Proper Nouns, Nouns
filter_list = [▰▰0▰▰, ▰▰1▰▰]

# Initialize list for containing results
lem_list = []

# Iterate over speeches
▰▰2▰▰ speech ▰▰3▰▰ corpus:
    # Iterate over each word in speech, add the lemmatized word if it matches one of the desired parts of speech
    lem_list.▰▰4▰▰([n.▰▰5▰▰ for n in speech if n.▰▰6▰▰ in filter_list])

# Extract speech from corpus at index 2
lem_list[▰▰7▰▰]

## Question 3:
<div class="alert alert-block alert-info">  
The solution to this question is going to be very similar to that of the previous question. The major difference here is that we're going to add two new types of token to our inner lists: verbs and adjectives.
</div>
<div class="alert alert-block alert-success">
Iterate over the docs to create a list, where the outer list represents the documents and the inner list contains a list of lemmatized nouns, proper nouns, verbs, and adjectives. Retrieve the list of lemmatized nouns, proper nouns, verbs, and adjectives associated with the speech located at index 1 of the resulting list of speeches.
</div>

In [None]:
# Populate list containing desired parts of speech, in this order:
# Proper Nouns, Nouns, Verbs, Adjectives
filter_list = [▰▰0▰▰, ▰▰1▰▰, ▰▰2▰▰, ▰▰3▰▰]

# Initialize list for containing results
lem_list_2 = []

# Iterate over speeches
for speech in corpus:
    # Iterate over each word in speech, add the lemmatized word if it matches one of the desired parts of speech
    lem_list_2.▰▰4▰▰([n.▰▰5▰▰ for n in speech if n.▰▰6▰▰ in filter_list])

    
# Extract speech from corpus at index 1
lem_list_2[▰▰7▰▰]
    

## Question 4:

<div class="alert alert-block alert-info">  
One of the more useful parsing options <code>spaCy</code> provides is robust sentence detection, which is often useful for downstream tasks and sometimes even required. One of those tasks is the bigram detection implementation in gensim. Because bigrams are usually common phrases, they often end up being some kind of topic, even if it's not the core topic of the text. In these next problems, you will create a list of the most frequently occurring bigrams for each political leader. 
</div>
<div class="alert alert-block alert-success">
Using the corpus object you created earlier, create a flat list of all of the tokenized sentences in order to train a gensim Phrases model. While you're at it, use the leader_speeches list of dataframes to prepare a nested list of tokenized speech sentences for each political leader, which you will apply the trained bigram model on. Retrieve the first sentence by the first leader.
</div>

In [None]:
# Initialize list for storing results
sent_list = []

# Iterate over speeches in corpus
for speech in corpus:
    # iterate over the sentences in each speech, and then the tokens from each of those sentences...
    # ... add each token to 'sent_list' *as individual words, not lists*.
    sent_list.▰▰0▰▰([[token.▰▰1▰▰ for token in sent] for sent in speech.▰▰2▰▰])

# Initialize list for storing results
leader_sent_lists = []

# Iterate over the separate leader-specific dataframes in the 'leader_speeches' variable (created in question 1)
▰▰3▰▰ df ▰▰4▰▰ leader_speeches:
    
    # Initialize list for storing results within the loop 
    leader_sentences = []
    
    # Run the individual dataframe (from the loop) through spacy's nlp pipe and iterate over the results
    for speech in df['speechtext']:
        
        # iterate over the sentences in the speech, and then the tokens from each of those sentences...
        # ... add each token to 'sent_list' *as individual words, not lists*.
        leader_sentences.▰▰5▰▰([[token.▰▰6▰▰ for token in sent] for sent in ▰▰7▰▰(speech).▰▰8▰▰])
        
    # Add the 'leader_sentences' list to 'leader_sent_lists' *as a list*
    leader_sent_lists.▰▰9▰▰(leader_sentences)

# Retrieve the first sentence spoken by the first leader in the list of lists.
leader_sent_lists[▰▰10▰▰][▰▰11▰▰]


## Question 5:

<div class="alert alert-block alert-info">  
Now we can train a model and use it on the text for each leader!
</div>
<div class="alert alert-block alert-success">
Using the sentences in the corpus, train a gensim Phrases model. Apply that model to the list that contains the list of sentences for each leader. Retrieve the bigrammed first sentence of the first speech.
</div>

In [None]:
# Train the model using the data from Question 10
model = ▰▰0▰▰(sent_list, min_count=1, threshold=0.75,
                    scoring='npmi')  # train the model
    
# Create the model applicator
bigrammer = Phraser(model)  

# Initialize list
bigrammed_list = []

# Iterate over the lists in the leader_sent_lists object
▰▰1▰▰ sent_list ▰▰2▰▰ leader_sent_lists:
    
    # Initialize in-loop list
    bigrammed_sents = []
    
    # Iterate over each sentence in 'sent_list'
    ▰▰3▰▰ sent ▰▰4▰▰ sent_list:
        
        # Subscript the bigrammer with the sentence and store result 
        bigrammed_sent = bigrammer[sent]
        
        # Add the bigrammed sentence to list of bigrammed sentences
        bigrammed_sents.▰▰5▰▰(bigrammed_sent)
    
    # Add list of bigrammed sentences from leader to 'bigrammed_list'
    bigrammed_list.▰▰6▰▰(bigrammed_sents)
    
# Extract first sentence from first leader in list of lists ('bigrammed_list')
bigrammed_list[▰▰7▰▰][▰▰8▰▰]


## Question 6:

<div class="alert alert-block alert-info">  
Given that the speeches in your list of lists should - if everything went according to plan - appear in the same order as they did in the list of dataframes we used at the beginning of this part, you should be able to match up the list of bigrammed sentences to tell who's doing the talking. 
</div>

<div class="alert alert-block alert-success">
Print the 10 most common bigrams for each leader using a Pandas series. Hint: each leader has a list of tokenized sentences, where each sentence is a list of tokens and bigram tokens are two words joined by "_". Submit the name of the leader that talked about Donald Trump a lot. <br><br> Note that this question, if properly completed, should result in <b>exactly one</b> leader whose top 10 bigrams contains 'Donald_Trump'. If there are 0, 2, or more than 2 leaders who qualify, that is a strong indication that something has gone awry; try restarting the kernel and running each cell of the assignment exactly once, in order.  
</div>

In [None]:

# Zip together 'filter_terms' and 'bigrammed_list', and iterate over them
for leader, sent_list in ▰▰0▰▰(filter_terms, bigrammed_list):
    
    # Initialize list
    bigrams = []
    
    # Iterate over sentences in sentence list (from bigrammed_list)
    for sent in sent_list:    
        
        # Iterate over tokens in sentence and extract bigrams...
        # ... which can be identified by the presence of an underscore '_' ...
        # ... and add them to the 'bigrams' list *as a list* 
        bigrams.▰▰1▰▰([token ▰▰2▰▰ token ▰▰3▰▰ sent ▰▰4▰▰ '▰▰5▰▰' ▰▰6▰▰ token])
        
    # Convert list of bigrams into a pandas series
    bigram_series = pd.Series(bigrams)
    
    # Print the leader's top ten most-spoken bigrams
    print(leader + '\n')
    print(bigram_series.value_counts()[:10])
    print('\n')

## Question 7:
<div class="alert alert-block alert-info">  
For the following exercises, you will use a dataframe containing a sample of speeches from the Canadian Hansard data. The speech text is already pre-processed, with bigrams detected and all but nouns, proper nouns, and adjectives filtered out. Using the text from the 'preprocessed' column of the dataframe, you will create a matrix of count vectors, where each row is a speech and the columns are word counts for each item in the vocabulary. You'll then create a dataframe from this matrix, adding a column with the party of the person who made the speech.
</div>
<div class="alert alert-block alert-success">
Begin by reading the CSV into pandas and extracting the pre-processed speeches as a list. Provide this list of speeches to sklearn's CountVectorizer, turning the results into a new dataframe and naming the columns for the terms they contain counts of. 
</div>

In [None]:
# read pre-processed dataset
df = pd.read_csv(PATH_TO_DATA/'processed_can_hansards.csv')
# make a list of speeches
speeches = df['preprocessed'].tolist()                 

# initialize the counter vectorizer
count_vectorizer = ▰▰0▰▰(max_df=.1,          
                                   min_df=3,
                                   strip_accents='ascii',
                                   )

# apply the vectorizer to the list of speeches
count_matrix = count_vectorizer.▰▰1▰▰(speeches)   

# gather a list of the feature names (terms)
vocabulary = count_vectorizer.▰▰2▰▰()        

# use pandas sparse matrix functionality to make a dataframe
count_df = pd.DataFrame.sparse.from_spmatrix(count_matrix)    

# name the columns according to what they're counting
count_df.▰▰3▰▰ = vocabulary                            

# add the party names from the original dataframe to the new one
count_df['speakerparty'] = df['speakerparty']     

## Question 8:
<div class="alert alert-block alert-info">  
Combining the count vectors for each of the 3 largest political parties in Canada (Liberal, Conservative, NDP) you will produce a 3 row dataframe where the rows are the composition of each party's speeches in the data. Convert the raw counts to proportions of the party's total words, to more easily compare the party's term vectors to each other.
</div>
<div class="alert alert-block alert-success">
Use pandas' groupby to create party groups in the dataframe, with their word counts added together as the aggregation function. Transform the dataframe of counts into a dataframe of proportions (ie. term count / total words). Add the terms with the largest percentages for each party to a dictionary and print the items in the dictionary.
</div>

In [None]:
# group the speech vectors by party, adding the counts together
party_counts = count_df.▰▰0▰▰('speakerparty').▰▰1▰▰('▰▰2▰▰')      

# transform the count dataframe to proportions by dividing the values by the total words
party_percents = party_counts.div(party_counts.▰▰3▰▰(axis=1), axis=0)   
# Transpose the dataframe so that each party is a column and each term a row
party_percents = party_percents.▰▰4▰▰       

top_words_per_party = {}

# loop through each party in the data, adding their top (term,score) tuples to their dictionary entry 
for party in party_percents.▰▰5▰▰:             
    top = party_percents[party].▰▰6▰▰(10)
    top_words_per_party[party] = list(zip(top.index, top))

# print the keys (party name) and associated values (top terms) in the dictionary
▰▰7▰▰ k, v ▰▰8▰▰ top_words_per_party.▰▰9▰▰():
    print(k.upper())
    for each in ▰▰10▰▰:
        ▰▰11▰▰(each)
    print('\n')

## Question 9:
<div class="alert alert-block alert-info">  
For this problem, you will find the words that most differentiate each party from each of the other two parties, in terms of proportion of total words. The finished code won't produce any visible output, but we'll use the three resulting dataframes (l_to_c, n_to_c, l_to_n) as the basis of a visualization we'll create in the subsequent problem.
</div>
<div class="alert alert-block alert-success">
By subtracting the party term proportion vectors (that you created in the previous problem) from each other, gather the terms that are most associated with each side of the comparison. At this point, these vectors should be the columns of the dataframe. Because higher positive values are more associated with one party in the comparison and negative values with the other party, this only requires three comparisons to look at both ends of the party combinations. 
</div>

In [None]:
# term vector calculations and sorting
lib_to_con = ▰▰0▰▰['Liberal'] - ▰▰1▰▰['Conservative']  
lib_to_con.sort_values(ascending=False, inplace=True)
ndp_to_con = ▰▰2▰▰['NDP'] - ▰▰3▰▰['Conservative']
ndp_to_con.sort_values(ascending=False, inplace=True)
lib_to_ndp = ▰▰4▰▰['Liberal'] - ▰▰5▰▰['NDP']
lib_to_ndp.sort_values(ascending=False, inplace=True)

# combine the top 5 and bottom 5 values of the comparison dataframes into new ones
l_to_c = pd.▰▰6▰▰([lib_to_con.▰▰7▰▰(), lib_to_con.▰▰8▰▰()])
n_to_c = pd.▰▰9▰▰([ndp_to_con.▰▰10▰▰(), ndp_to_con.▰▰11▰▰()])
l_to_n = pd.▰▰12▰▰([lib_to_ndp.▰▰13▰▰(), lib_to_ndp.▰▰14▰▰()])

## Question 10:
<div class="alert alert-block alert-success">
Create a swarm plot to examine the results of the comparison between the Liberals and Conservatives. The x-axis should be the term proportions and the y-axis should be the terms themselves. Use your swarm plot to determine the word that is the most negative (Conservative) on the x-axis and submit it.
</div>

In [None]:

fig, ax = ▰▰0▰▰.subplots(figsize=(6, 4))
# Create a swarmplot 
sns.▰▰1▰▰(x=l_to_c, y=l_to_c.index, color='black', size=4)
# Add a vertical line at 0
ax.▰▰2▰▰(0) 
# add a grid to the plot to make it easier to interpret
plt.grid()  

# keep in mind which party a negative value is associated with, based on which vector was the subtracted one...
ax.set(xlabel=r'($\longleftarrow$ Conservative Party)        (Liberal Party $\longrightarrow$)',
       ylabel='',
       title='Difference of Proportions')
plt.tight_layout()
plt.show()

## Question 11:
<div class="alert alert-block alert-info">  
In this next batch of problems, you'll expand on the concept of the previous ones by creating TF-IDF vectors for each party and comparing them using cosine similarity. Start by creating a TF-IDF dataframe, again with the speaker party column added. Print the terms with the top TF-IDF scores after sorting by each party.
</div>
<div class="alert alert-block alert-success">
Initialize the TfidfVectorizer and implement it in a very similar way to the CountVectorizer above. At the end, print the first 10 TF-IDF scores, sorted highest to lowest, for each party. This will also print the scores for those terms for the other parties.
</div>

In [None]:
tfidf_vectorizer = ▰▰0▰▰(stop_words="english",
                                   lowercase=True,
                                   max_features = 300,       # not best practice, we do this here in case of resource limitations
                                   strip_accents='ascii')

tfidf_matrix = tfidf_vectorizer.▰▰1▰▰(speeches) 

vocabulary = tfidf_vectorizer.▰▰2▰▰()

tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix)
tfidf_df.columns = vocabulary

party_scores = tfidf_df.copy()
party_scores['speakerparty'] = df['speakerparty']

# group the speech vectors by party, adding the counts together
party_scores = party_scores.▰▰3▰▰('speakerparty').▰▰4▰▰('▰▰5▰▰')
# Transpose the dataframe so that each party is a column and each term a row
party_scores = party_scores.▰▰6▰▰

for party in party_scores.▰▰7▰▰:
    party_scores.sort_values(by = party, ascending = False, inplace = True)
    print(party + '\n')
    ▰▰8▰▰(party_scores.head(10))
    print('\n')

## Question 12:
<div class="alert alert-block alert-info">  
Next you will calculate pair-wise cosine similarity to compare the vectors for each speaker in the data. You may have noticed that the TF-IDF scores from the last problem wound up being on different scales for each party, with the Liberals having the highest scores because they have the most speeches. This time, you will re-normalize the TF-IDF scores after adding them together, which also makes cosine similarity faster to calculate.
</div>
<div class="alert alert-block alert-success">
Create a new dataframe from the tfidf_matrix that you generated above. Be sure to add speakernames as usual, then filter the dataframe to keep only speeches by speakers with 50 or more speeches. Group the speeches by speaker, aggregating the TF-IDF vectors for each of their speeches, then use sklearn's Normalizer() to prepare the vectors for cosine similarity. Create a cosine similarity matrix by calculating the dot product of the normalized speaker score matrix and its transpose.
</div>

In [None]:
# create a new dataframe from the tfidf_matrix
speaker_scores = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix)   

# turn the sparse matrix into a dense one for faster aggregation runtime
speaker_scores = speaker_scores.sparse.to_dense()                  

# add the speaker names to the new dataframe
speaker_scores[▰▰0▰▰] = df[▰▰1▰▰]                  

# keep only speakers with 50 or more speeches to speed things up and to have vectors with a bit more term diversity
speaker_scores = speaker_scores.▰▰2▰▰('▰▰3▰▰').▰▰4▰▰(lambda x: ▰▰5▰▰(x) >= 50)    
# group the speech vectors by speaker and aggregate their values
speaker_scores = speaker_scores.▰▰6▰▰('▰▰7▰▰').▰▰8▰▰('▰▰9▰▰')        

normalize = Normalizer()
# convert the aggregate TF-IDF scores into unit norms
speaker_scores_n = normalize.▰▰10▰▰(speaker_scores)      

# calculate the product of the matrix for pairwise cosine similarities
speaker_matrix = speaker_scores_n @ speaker_scores_n.T         

## Question 13:
<div class="alert alert-block alert-info">  
Identify the 5 most similar speakers and the 5 least similar speakers in the data.
</div>
<div class="alert alert-block alert-success">
Fill the diagonal and lower triangle of the cosine similarity matrix with np.nan values. Create a new dataframe from the matrix and make the speaker names both the index and the column names. Use df.stack() to make the dataframe 1-dimensional for a relatively simple way of finding the largest and smallest values in the whole matrix. Print the 5 highest and 5 lowest cosine comparisons. These will be the members of parliament whose speech topic composition is either most or least similar. 
</div>

In [None]:
# Fill the speaker_matrix's diagonal with NaN values
np.▰▰0▰▰(speaker_matrix, np.▰▰1▰▰)

speaker_matrix[np.tril_indices(speaker_matrix.shape[0], -1)] = np.nan
speaker_df = pd.DataFrame(speaker_matrix)

speaker_df.▰▰2▰▰ = speaker_scores.▰▰3▰▰
speaker_df.▰▰4▰▰ = speaker_scores.▰▰5▰▰

print(speaker_df.stack().▰▰6▰▰(5))
print('\n')
print(speaker_df.stack().▰▰7▰▰(5))

## Question 14:
<div class="alert alert-block alert-info">  
Print the top-weighted terms for two speakers who were among the most similar to each other.
</div>
<div class="alert alert-block alert-success">
Using the normalized speaker_scores matrix, create a dataframe with speaker names as the index and feature names (terms) as the column names. Use .loc to select the row for the speaker scores you will be examining, and print the 10 most important terms along with their TF-IDF scores. Submit the word that both Anthony Rota and Bruce Stanton's share as their most important word. 
</div>

In [None]:

speaker_scores_df = pd.DataFrame(speaker_scores_n)
speaker_scores_df.index = speaker_scores.index
speaker_scores_df.columns = vocabulary

top1 = speaker_scores_df.▰▰0▰▰['Anthony Rota'].▰▰1▰▰(10)
top2 = speaker_scores_df.▰▰2▰▰['Bruce Stanton'].▰▰3▰▰(10)

print("Anthony Rota's Top Words \n")
▰▰4▰▰(top1)
print('\n')
print("Bruce Stanton's Top Words \n")
▰▰5▰▰(top2)