# Aggressivity analysis
## Who are the driver of the polarization?

## Setup

In [1]:
# Notebook config
%config Completer.use_jedi = False

In [2]:
# Built-in
import os
from IPython.display import display

# Third parties
import numpy as np
import pandas as pd
import nltk
from nltk import tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import scipy
import statsmodels.api as sm
import statsmodels.formula.api as smf


In [3]:
# Initialization needed for some modules

# tqdm for pandas
tqdm.pandas()

# NLTK configuration
nltk.download('vader_lexicon')
nltk.download('stopwords')
sia = SentimentIntensityAnalyzer()

# TokenSpace initialization
tokenSpace = tokenize.WhitespaceTokenizer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/olivier/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/olivier/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Configuration
DATA_PATH = "data"
PKL_PATH = os.path.join(DATA_PATH, "pkl")
CSV_PATH = os.path.join(DATA_PATH, "csv")
RESOURCES_PATH = os.path.join(DATA_PATH, "resources")

In [5]:
# Utils functions

def get_sentiment(row: pd.Series) -> pd.Series:
    """
    Compute the sentiment score of a given row
    """   
    
    row['NLTK_score'] = sia.polarity_scores(row['quotation'])
    return row

def counter(text, columnText, quantity, label):
    allWords = ' '.join([text for text in text[columnText].astype('str')])
    tokenPhrase = tokenSpace.tokenize(allWords)
    frequency = nltk.FreqDist(tokenPhrase) 
    dfFrequency = pd.DataFrame({"Word": list(frequency.keys()), "Frequency": list(frequency.values())}) 
    
    dfFrequency = dfFrequency.nlargest(columns = "Frequency", n = quantity)
    plt.figure(figsize=(15,3))
    ax = sns.barplot(data = dfFrequency, x = "Word", y = "Frequency", palette="deep")
    ax.set(ylabel = "Count")
    plt.xticks(rotation='horizontal')
    plt.title(f"Most common words for {label}")
    plt.show()

In [7]:
# Load df
df = pd.read_pickle(os.path.join(PKL_PATH, "final_subset.pkl"))

### For windows users :
# from pickle5 import pickle
# with open("data/pkl/final_subset.pkl", "rb") as fh:
#   df = pickle.load(fh)

In [8]:
display(df.info())
df.sample(2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105929 entries, 0 to 6361
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   quoteID               105929 non-null  object 
 1   quotation             105929 non-null  object 
 2   speaker               105929 non-null  object 
 3   qids                  105929 non-null  object 
 4   date                  105929 non-null  object 
 5   numOccurrences        105929 non-null  float64
 6   probas                105929 non-null  object 
 7   urls                  105929 non-null  object 
 8   phase                 105929 non-null  object 
 9   subset                105929 non-null  bool   
 10  id                    82439 non-null   object 
 11  givenName             105929 non-null  object 
 12  familyName            105929 non-null  object 
 13  unaccentedGivenName   105929 non-null  object 
 14  unaccentedFamilyName  105929 non-null  object 
 15  bi

None

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,subset,...,honorificPrefix,honorificSuffix,position,stateName,parties,NLTK score,neg,neu,pos,compound
4889,2019-10-21-012130,chairman schiff deliberately misled the americ...,adam schiff,['Q350843'],2019-10-21 00:00:00,29.0,"[['Adam Schiff', '0.5085'], ['None', '0.3032']...",['http://www.readingeagle.com/ap/article/trump...,E,True,...,,,Representative,CA,Democrat,"{'neg': 0.0, 'neu': 0.925, 'pos': 0.075, 'comp...",0.0,0.925,0.075,0.296
1433,2018-06-26-061249,leader pelosi has always enjoyed the overwhelm...,nancy pelosi,['Q170581'],2018-06-26 19:40:00,1.0,"[['Nancy Pelosi', '0.5077'], ['None', '0.48'],...",['http://www.msn.com/en-us/news/politics/democ...,E,True,...,Ms.,,Representative,CA,Democrat,"{'neg': 0.0, 'neu': 0.666, 'pos': 0.334, 'comp...",0.0,0.666,0.334,0.919


## 1. Aggressivity analysis

Which politicians are the least or the most aggressive.

To see that, we will simply check the most negative scores.

In [8]:
most_agg = df.groupby(["speaker", "parties"]).agg({
    "compound": "mean",
    "speaker": "size", 
}) \
.rename({"speaker": "quotes_count"}, axis=1) \
.sort_values("compound")

As we can see, the extremes are such because they only have one quote to their names. So we will only consider those with at least 100 quotes.

In [9]:
most_agg = most_agg[most_agg["quotes_count"] >= 100]
# Only keep those really negative
threshold = -0.05
most_agg[most_agg["compound"] <= threshold].sort_values("compound")

Unnamed: 0_level_0,Unnamed: 1_level_0,compound,quotes_count
speaker,parties,Unnamed: 2_level_1,Unnamed: 3_level_1
barbara lee,Democrat,-0.211459,182
tulsi gabbard,Democrat,-0.139305,332
joe walsh,Republican,-0.096052,105
elijah cummings,Democrat,-0.095461,936
maxine waters,Democrat,-0.090463,480
bennie thompson,Democrat,-0.087981,129
pramila jayapal,Democrat,-0.073166,270
claudia tenney,Republican,-0.07077,117
john yarmuth,Democrat,-0.070638,226
xavier becerra,Democrat,-0.065034,354


In [10]:
g = pd.DataFrame(most_agg.groupby("parties").size()).rename({0: "count"}, axis=1)
g["proportion"] = g["count"] / g["count"].sum()
g
# Less republicans

Unnamed: 0_level_0,count,proportion
parties,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,85,0.544872
Republican,71,0.455128


### Other approach?


In [11]:
# ML-based?
# seems to be much more involved

## 1.1 EDA of most agressive speakers

It seems that the following speakers are specifically negative and thus, polarizing when they are mentioning people of the other political camp:
- barbara lee (Democrat)
- tulsi gabbard (Democrat)
- joe walsh (Republican)

In [21]:
df.sample(1)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,subset,...,honorificPrefix,honorificSuffix,position,stateName,parties,NLTK score,neg,neu,pos,compound
22423,2018-07-07-009398,"given their vision of a politicized judiciary,...",orrin hatch,['Q381157'],2018-07-07 00:00:00,2.0,"[['Orrin Hatch', '0.7522'], ['None', '0.1833']...",['http://www.foxnews.com/politics/2018/07/07/o...,E,True,...,,,Senator,UT,Republican,"{'neg': 0.335, 'neu': 0.514, 'pos': 0.151, 'co...",0.335,0.514,0.151,-0.7761


In [74]:
quotes = df[df["speaker"] == "joe walsh"].sample(3)[["quoteID", "compound", "quotation"]].values
for q in quotes:
    print(f"QID: {q[0]}, Score: {q[1]} \n \"{q[-1]}\"\n")

QID: 2019-04-22-024846, Score: -0.296 
 "it's weird that the democrats in the house, they're in the tough position, they have to decide what to do with this,"

QID: 2017-11-30-108399, Score: -0.4939 
 "the next time you hear a democrat talk about protecting illegals or defending sanctuary cities, remind them that it's because of an illegal and sanctuary cities that kate steinle is no longer alive today."

QID: 2019-09-10-048086, Score: 0.128 
 "it happened with obama and the [ democratic ] party [ in 2012 ] when there was no presidential primary opposition at all,"



**Barbara Lee**
Interesting quotes:
- QID: 2015-09-16-064526 Score: -0.743   
 "it's past time that republicans stop governing by crisis,"
- QID: 2017-12-21-143552 Score: -0.3252   
 "you will not have a united states government as we know it if donald trump has his way."
- QID: 2017-07-21-019041 Score: -0.8452   
 "congress has no business interfering in women's personal health decisions. my republican colleagues need to drop this dangerous crusade and stop trying to turn back the clock on women."

 
Example of our algorithm's weaknesses:
- QID: 2017-06-29-069197 Score: -0.1796   
 "i've been working on this for years and years and years. i'm just really pleased that republicans and democrats today really understood what i've been saying and i've been explaining for the last 16 years, and that is, this resolution is a blank check for perpetual war,"
    - not against reps
- QID: 2020-04-16-025948 Score: 0.8074   
 "instead of giving relief to americans who are struggling to make ends meet, senate republicans snuck in tax breaks and corporate giveaways for their wealthy friends,"
    - should be negative
    
=> many many quotes about trump


**Tulsi Gabbard** quotes:
- QID: 2019-09-25-056870, Score: -0.3182   
 "it's important that donald trump is defeated,"
- QID: 2020-01-10-062819, Score: -0.9734  
 "president trump has committed an illegal and unconstitutional act of war, pushing our nation headlong into a war with iran without any authorization from congress -- a war so devastating and costly it would make our wars in iraq and afghanistan look like a picnic,"
 
Problems:
- QID: 2019-05-21-110771, Score: -0.5267  
 "we can and must do so by recognizing that the effects of climate change are threatening people in communities all across the country, whether you're in a republican state or a democratic state. in order to bring about the kind of big change that we need to see, we have to come together and unite toward making the big investments that we need to make."
    - should be the contrary of polarizing
- QID: 2019-04-30-091725, Score: -0.4927  
 "the most attacks i get are not from republicans,"
    - actually in favor of republicans

if democrats are almost only mentioning Trump, what's happening with republicans?

**Joe Walsh**:
- QID: 2019-08-24-040425, Score: -0.3818  
 "the truth: as practiced by most muslims, islam is not a religion. these muslims are at war w us. barack obama, a muslim, is on their side,"
- QID: 2018-09-04-100175, Score: -0.2023  
 "this whole process is a joke. all the democrats demanding to review all these kavanaugh documents are the very same democrats who announced months ago they were going to oppose kavanaugh no matter what. so why do they need to review anything?"
- QID: 2017-11-30-108399, Score: -0.4939  
 "the next time you hear a democrat talk about protecting illegals or defending sanctuary cities, remind them that it's because of an illegal and sanctuary cities that kate steinle is no longer alive today."

Problems:
..


Almost all quotes speak about trump or its impeachment in a way or another. We should really investigate the impact of some events on the cleavage.

Before that, we will investigate the aggressivity's evolution through the years. 

## 1.2 Aggressivity by year

In [11]:
# create new "year" column in df
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year

In [12]:
# Sanity check
df.sample(2)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,subset,...,honorificSuffix,position,stateName,parties,NLTK score,neg,neu,pos,compound,year
3814,2018-03-18-053078,"scott brown said, `he's coming to new zealand? '",elizabeth warren,['Q434706'],2018-03-18 18:36:20,1.0,"[['Elizabeth Warren', '0.7433'], ['None', '0.2...",['https://www.boston.com/news/local-news/2018/...,E,True,...,,Senator,MA,Democrat,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,2018
8525,2016-05-04-044362,i would rather run against crooked hillary cli...,donald trump,"['Q22686', 'Q27947481']",2016-05-04 11:59:44,18.0,"[['Donald Trump', '0.6367'], ['None', '0.2895'...",['http://feeds.businessinsider.com.au/~/152822...,E,True,...,,President,,Republican,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,2016


In [34]:
most_agg = df.groupby(["year", "speaker", "parties"]).agg({
    "compound": "mean",
    "speaker": "size", 
}) \
.rename({"speaker": "quotes_count"}, axis=1) \
.sort_values("compound")

most_agg = most_agg[most_agg["quotes_count"] >= 100]
# Only keep those really negative
threshold = -0.05
most_agg = most_agg[most_agg["compound"] <= threshold].sort_values("year")

In [37]:
g = pd.DataFrame(most_agg.groupby(["year", "parties"]).size()).rename({0: "count"}, axis=1)
g["proportion"] = g["count"] / g.groupby("year")["count"].sum()
g

Unnamed: 0_level_0,Unnamed: 1_level_0,count,proportion
year,parties,Unnamed: 2_level_1,Unnamed: 3_level_1
2015,Democrat,1,1.0
2016,Democrat,1,0.5
2016,Republican,1,0.5
2017,Democrat,6,0.857143
2017,Republican,1,0.142857
2018,Democrat,8,0.888889
2018,Republican,1,0.111111
2019,Democrat,5,0.454545
2019,Republican,6,0.545455


Observations:

**2015**:
- blabla

## 2. Impact of events 

### 2.1 Events dataset

No single dataset that we could found was adapted to our needs. Because of that, we decided to constitute a dataset ourselves by selecting interesting events from different sources.
- Wikipedia's "year in the USA events list" [link](https://en.wikipedia.org/wiki/2015_in_the_United_States);
- ACLED (The Armed Conflict Location & Event Data Project), which contains US events like protests and shootings. We select only the most important (i.e. with most fatalities) from this dataset [link](https://acleddata.com/).
- Most impactful events in US history as defined by the Encyclopedia Britannica [link](https://www.britannica.com/list/25-decade-defining-events-in-us-history).
- Most impactful events in US history as defined by the BBC [link](https://www.bbc.com/news/world-us-canada-16759233).
- Most impactful events in US history as defined by the Time [link](https://time.com/3889533/25-moments-changed-america/).
- Most impactful events accross generations, results of a research [link](https://www.pewresearch.org/politics/2016/12/15/americans-name-the-10-most-significant-historic-events-of-their-lifetimes/).


Final dataset handpicked from wikipedia (see `events_dataset.ipynb`). Actual analysis done by René.