# Polarization around climate change: 
## Is it growing as fast as the polar circle is shrinking? 

### The notebook

This notebook is the result of a 4 way merge between our individual notebooks, somethings may looks silly (i.e. graph missing, cells not able to run, etc...) In this case please refer to the individual notebooks (under /notebooks) !

DISCLAIMER: Viewing experience using Visual Studio Code might be suboptimal. We advise the reader to read this notebook using the Jupyter Notebook/Lab interface.

NB: We're submitting a single notebook, but we strongly believe that for a better reviewing experience you should review each notebook independently under the following order :
1. json-filtering.ipynb
2. sentiment_analysis.ipynb
3. word_embeddings.ipynb 
4. wikipedia.ipynb

Let us explain the logic of this milestone 2 to make your reading experience easier.

First, json-filtering lays out the preprocessing pipeline we had in place to filter out the quotes to only keep the ones under the theme of climate change, additionally we only keep quotes where the speaker was assigned with a probability > 0.9. We go from having the full Quotebank dataset to having a pickled dataframe of each year (2015-2020) of manageable size.

The sentiment analysis part presents our study of the current best practices of sentiment analysis. From this study, we pick our sentiment analyzer and do some elementary data exploration on the 2017 data.

Next, the word embeddings part showcases the pipeline put in place to go from quote to word embedding using Word2Vec. It then showcases our capability to visualize the data using word embeddings.

Finally, the wikipedia part showcases our capabilities to extract features from Wikipedia using QIDs. Such features are gender, political assignation or age. Wikipedia data is quite messy and the heuristics used to extract these features are shown.

# Colab Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm.notebook import trange, tqdm
import bz2
import json
import os
from urllib.parse import urlparse
from importlib import reload
import numpy as np

## FIRST TIME? uncomment this to get started
# if you dont have a token https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
"""
os.chdir('/content/drive/MyDrive/')
!git clone https://USERNAME:TOKEN@github.com/epfl-ada/ada-2021-project-adada-sur-mon-bidet.git
"""

os.chdir('/content/drive/Shareddrives/ADA/ada-2021-project-adada-sur-mon-bidet/')
import helpers.helpers as helpers

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
!git pull

Updating 9b2737e..c1f6241
error: Your local changes to the following files would be overwritten by merge:
	base_climate_dictionary.txt
Please commit your changes or stash them before you merge.
Aborting


In [None]:
data_path = 'Quotebank/'
out_path  = 'output/'

years = range(2020, 2014, -1)

data_files = os.listdir(data_path)
path_to_files = dict(zip(years, [data_path + f for f in data_files]))
path_to_files

{2015: 'Quotebank/quotes-2015.json.bz2',
 2016: 'Quotebank/quotes-2016.json.bz2',
 2017: 'Quotebank/quotes-2017.json.bz2',
 2018: 'Quotebank/quotes-2018.json.bz2',
 2019: 'Quotebank/quotes-2019.json.bz2',
 2020: 'Quotebank/quotes-2020.json.bz2'}

In [None]:
#Load climate dict
climate_dict = []
with open('base_climate_dictionary.txt', 'r') as f:
    climate_dict = f.read().split("\n")

print(len(climate_dict), climate_dict[:10])

62 ['aerosol', 'agriculture', 'atmosphere', 'agriculture', 'atmosphere', 'biosphere', 'carbon', 'climate', 'climatology', 'coral']


In [None]:
df = pd.read_pickle("output/df")
_df = df.sample(n=2000)
df.head(3)

Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-02-21-000455,2019 was a landmark year for fiverr as we comp...,Micha Kaufman,[Q26923564],2020-02-21 13:00:00,1,[www.fool.com],E
2020-03-01-005419,councils and communities are passionate about ...,Linda Scott,"[Q19667145, Q469184]",2020-03-01 16:30:28,45,"[cowraguardian.com.au, wauchopegazette.com.au,...",E
2020-04-01-026038,i will encourage anyone from the caloundra ele...,Mark McArdle,[Q6768772],2020-04-01 15:00:00,1,[www.sunshinecoastdaily.com.au],E
2020-02-24-028340,if you re a doctor that cares about the wellbe...,Fiona Stanley,[Q1653736],2020-02-24 12:45:00,4,"[watoday.com.au, www.theage.com.au, www.smh.co...",E
2020-03-09-038856,march has the largest amount of acreage burned...,Michael Guy,[Q11107729],2020-03-09 07:37:02,7,"[kvia.com, abc17news.com, localnews8.com, www....",E


# Preprocessing and Data Filtering

Preprocessing steps are described here, but for extensive details and code please refer to notebooks/preprocessing.ipynb. Also most steps where presented in milestone 2.

## json -> smaller json
We start from provided Quotebanks json files and select quotes which match the followin criterias:
 - Only quotations with a good speaker identification confidence
 - Only quotations refering to our chosen subject are kept
 - Only domain names or nothing is kept from urls

This filtering filters out around 88% of the data. We are left with workable sizes

## json -> pickle

Adds a little panda-related preprocessing and saves as pickle files to reduce loading times.

Additional preprocessing:
*   safety drop na
*   index using quoteid
*   drop irrelevant columns
*   type correctly date and phase
*   normalize quotes to alphanumeric lowercase characters

## EDA and FP filtering
We investigate data density:

*   Distribution of occurrence (expecting exponential or power-law)
*   Distribution of quote lengths (expect exponential or power-law)
*   Temporal distribution
*   Topic distribution

Then we check data quality. In particular further processing is hurt by false positives. So we come up with different ways to reduce, sometimes at the cost of many datapoints, their occurrence.

*   Quote quality: reading random quotes!
  * small quotes
  * false positives on climate topic
    * small words: "vegas" contains "gas"
    * "energy" doesn't refer to electricity
    * "nuclear" and "atmosphere" etc are off-topic

## different pickles available

* df, the original version, before FP filtering
* sanitized_df, df with much less FP
* sanitized_strict_df, df with as little FP as possible, at the cost of a lot of data points

With each of these version goes a "*prefix*_dummies" that records word occurrence for our vocabulary for each quote.

## EDA and looking for problems

In [6]:
df = pd.read_pickle("output/df")
_df = df.sample(n=2000)
df.head()

Unnamed: 0_level_0,quotation,speaker,qids,date,numOccurrences,urls,phase
quoteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-02-21-000455,2019 was a landmark year for fiverr as we comp...,Micha Kaufman,[Q26923564],2020-02-21 13:00:00,1,[www.fool.com],E
2020-03-01-005419,councils and communities are passionate about ...,Linda Scott,"[Q19667145, Q469184]",2020-03-01 16:30:28,45,"[cowraguardian.com.au, wauchopegazette.com.au,...",E
2020-04-01-026038,i will encourage anyone from the caloundra ele...,Mark McArdle,[Q6768772],2020-04-01 15:00:00,1,[www.sunshinecoastdaily.com.au],E
2020-02-24-028340,if you re a doctor that cares about the wellbe...,Fiona Stanley,[Q1653736],2020-02-24 12:45:00,4,"[watoday.com.au, www.theage.com.au, www.smh.co...",E
2020-03-09-038856,march has the largest amount of acreage burned...,Michael Guy,[Q11107729],2020-03-09 07:37:02,7,"[kvia.com, abc17news.com, localnews8.com, www....",E


### Occurences

We check the distribution of quote occurences. We expect some kind of power law where most quotes are cites a little amount of times but some are very popular and reach high occurences.

In [None]:
display(df[["numOccurrences"]].describe())
plt.hist(_df["numOccurrences"],bins=10,log=True)
plt.title('Histogram of climate quote occurrences (cumulative)')
plt.ylabel('count')
plt.xlabel('occurrence')
plt.show()

In [None]:
# most popular climate quote
df.nlargest(10, "numOccurrences")

### Quote lengths
Expecting an exponential distribution

In [None]:
df["quoteLength"] = df["quotation"].apply( len )
df["quoteWC"] = df["quotation"].apply( lambda q : len(q.split()))

_df = df.sample(n=2000)

In [None]:
def hist_lengths(x):
  fig, ax = plt.subplots(2, figsize=(12, 8))
  ax[0].hist(x["quoteLength"], log = True, bins = 30)
  ax[0].set_title("histogram of characters in quotes")
  ax[1].hist(x["quoteWC"], log = True, bins = 30)
  ax[1].set_title("histogram of words in quotes")
  return

hist_lengths(_df)

#### small quotes

We want to make sure even small quotes have meaning.
Lets check a few of them to see if we have irrelevant quotes like "Climate climate climate" or not.

In [None]:
def small_quote(df, threshold = 5):
  df["quoteWC"] = df["quotation"].apply( lambda q : len(q.split()))
  return df.query("quoteWC <=" + str(threshold))

display(small_quote(_df)["quotation"].head(10))
small_quote(df, threshold=5)["quoteWC"].value_counts()

small_quote(df, threshold=3)["quotation"][:20]

quoteID
2015-07-27-082359        the seductiveness of cheap gasoline 
2015-06-25-026765                              i ll cop that 
2016-09-27-064724                   it was not the coalition 
2018-09-17-021329                   focus even more energy on
2019-03-27-092439                  the climate is so extreme 
2018-05-10-004043           advancing clean energy solutions 
2015-08-10-009263                  because it s unrefined gas
2019-04-30-023522                    fuel for new innovation 
2015-03-31-102395               world s worst climate villain
2017-06-15-129244    this represents value added agriculture 
Name: quotation, dtype: object

quoteID
2020-02-18-070635            strong sustainable concept 
2020-02-26-058255                       size scope scale
2020-01-07-080448                warming drying climate 
2019-01-15-047739          inclusive sustainable growth 
2019-04-16-041825    irrepressible irresponsible energy 
2019-10-29-012869                  calm cerebral energy 
2019-04-05-023701                         game on vegas 
2019-10-24-000108                        climate change 
2019-10-31-086128        stable sustainable development 
2019-03-10-037534                       the gas chamber 
2019-07-08-002705                         all his energy
2019-10-20-010429              genuine warming charisma 
2019-05-03-081411                     still dodgin cops 
2019-10-30-000115                          dont copy me 
2018-04-29-017901         holistic biodynamic ecosystem 
2018-05-18-085088                     putting his energy
2018-01-08-022335                      fresh new energy 
2018-07-20-058650      

It seems ok. Do not need more preprocessing.

### Time distribution

In [None]:
datetime_index = df.reset_index().quoteID.apply(lambda x: x[:10])
date_df = df.set_index(datetime_index)
date_df.index = pd.to_datetime(date_df.index)

In [None]:
occs = date_df.groupby(pd.Grouper(freq="3M"))["numOccurrences"].aggregate(["sum", "count"])
sns.lineplot(data=occs)

### Topic distribution

We would like to see what sub-topics on climate change are prominent. If they are evenly represented etc. For example if 50% of quotes are about nuclear power, we should keep it in mind in further conclusions.

For further use we generate a dataframe with dummies that can easily be attached if needed. This df is relatively wide and quite sparse.

In [None]:
# generate dummies
dm = df[["quotation"]].copy()

for w in climate_dict:
  dm[w] = dm["quotation"].apply(lambda q : 1 if w in q else 0)

# plot occurrence and top 5
x = dm.describe().loc["mean"]
fig = plt.subplot()
fig.hist(x.values)
fig.set_title("mean occurrence of words in a quotes")
fig.set_ylabel("words count")
fig.set_xlabel("occurrence probability of word in a quote")

display(x.nlargest(5))

As expected from previous analysis, energy quotes are very common, but most are filtered out in better versions of our preprocessed data. Also some vocabulary words are very rare, but that's expected for example for "ar4".

In [None]:
def generate_dummies(df, vocab, out=None):
  dm = df[["quotation"]].copy()

  for w in vocab:
    dm[w] = dm["quotation"].apply(lambda q : 1 if w in q else 0)
  
  if(out):
    dm.to_pickle(out)
  return dm

### Climate topic false positives

False positives might be a problem if there are too many, as further conclusions might be biased. First, we browse through at least a 100 quotes to identify false positive reccuring examples and a idea of the FP rate.

TOTAL read : 100
of which FP: 20

Which is on the high-side.

In the following subsections, we replay some FP recurring examples and deal with them to reduce this FP rate.

#### Small words

Example:

"we re going to hop on a plane head to new york and show people what ve**gas** is all about"

==> small words can end up randomly in the middle of other words

Instead of looking for dictionnary words in the quotation as a list of chars, we can look for words in the list of words. The disadvantage is that it will result in excluding all words from the same family that don't exactly match.

In [None]:
 _df = df.sample(n=4000) #work on a random subset

# the number of quotes which have an exact word from the dictionnary, rather the substring
exact_words = sum([1 if any([w in q.split() for w in climate_dict]) else 0 for q in _df["quotation"]])

# the number of quotes in which there is a small word as a substring but not as a word
def small_words_fp(quote, vocab, small_thresh = 5):
  small_words = [w for w in vocab if len(w) < small_thresh]
  in_string = any([w in quote for w in small_words])
  in_words  = any([w in quote.split() for w in small_words])
  return  in_string and not in_words

small_fp = sum(_df["quotation"].apply(lambda x : small_words_fp(x, climate_dict)))

#Compare the elimination ratio if we are to preprocess using each method
print("exact words elimination ratio : ", (4000 - exact_words)/ 4000)
print("small words elimination ratio : ", (small_fp)/ 4000)

exact words elimination ratio :  0.25
small words elimination ratio :  0.0805


Depending on seed, the very conservative *exact words only* results in 15% loss which is a bit more then we are comfortable with. Considering only small words for this sanitazation we have about 8% of loss which is much more reasonable.

In [None]:
# function to remove quotes with small words as substring but not as 
def sanitize_small_word_fp(df):
  to_drop = df.loc[df["quotation"].apply(lambda q : small_words_fp(q, climate_dict))].index
  return df.drop(index=to_drop)

sanitize_small_word_fp(df).index.size

379625

#### Energy

Examples:

"players still have to get the same kind of visceral energy that they d get if they had a real audience "

==> Energy has a variety of meanings which don't correspond to our down to earth physical meaning, and thus results in a lot of false positives.

In [None]:
#look in more detail at energy quotes
def energy_only(q, vocab = climate_dict):
  energy = "energy" in q
  others = any([word in q for word in vocab if word != "energy"])
  return energy and not others

energy_only_quotes = df.loc[df["quotation"].apply(lambda q : energy_only(q))]

print("energy only quotes: ", energy_only_quotes.index.size, "  ratio : ", energy_only_quotes.index.size / df.index.size)

<img src="https://i.imgflip.com/5vx8at.jpg" title="made at imgflip.com"/>

Unfortunately the share of quotes containing only energy is consequent, and we must find another way to isolate the false positives. Around half are FP. 

Most recurring false positives are about sports, shows, and trait of character. The first two can be isolated rather easily, but the last is much harder. We try to either isolate our meaning of energy, or isolate the wrong meanings of energy by searching with more specific queries.



In [None]:
## Refined energy queries
better_energy = ["wind energy", "solar energy", "hydro energy", "clean energy",
                 "energy policy", "energy compan", "geothermal energy",
                 "energy sector","energy storage", "renewable energy", "energy consumption"]

better_energy_quotes = energy_only_quotes.loc[energy_only_quotes["quotation"].apply(lambda q : any([w in q for w in better_energy]))]

## Isolating sport and shows energy references
abstract_energy = "league stadium show star play sport song team coach player game audience kid actor actress"
abstract_energy = abstract_energy + " boy girl olympic fans supporters health healthy nutrients"
abstract_energy = abstract_energy.split()
abstract_energy.append("my energy")

abstract_energy_quotes = energy_only_quotes.loc[energy_only_quotes["quotation"].apply(lambda q : any([w in q for w in abstract_energy]))]

print("better energy queries rate amongst energy only quotes: " ,
      better_energy_quotes.index.size / energy_only_quotes.index.size)
print("sport and shows energy quote rate amongst energy quotes : ",
      abstract_energy_quotes.index.size / energy_only_quotes.index.size)

7861 0.11841530466219778
21261 0.32026813286133915
overlap :  6948


Trying to isolating wrong meanings only result in ~30% of our set, but we know we have around 50%, so it leaves a lot of FP.

Being more conservative and keeping only the refined energy queries, we retain 11%. This means loosing on quite a lot of TP, but removing all energy FP.

In [None]:
def sanitize_energy_conservatory(df):

  def energy_only(q, vocab = climate_dict):
    energy = "energy" in q
    others = any([word in q for word in vocab if word != "energy"])
    return energy and not others

  energy_only_quotes = df.loc[df["quotation"].apply(lambda q : energy_only(q))]
  better_energy = ["wind energy", "solar energy", "hydro energy", "clean energy",
                   "energy policy", "energy compan", "geothermal energy", "energy sector",
                   "energy storage", "renewable energy", "energy consumption"]
  better_energy_quotes = energy_only_quotes.loc[energy_only_quotes["quotation"].apply(lambda q : any([w in q for w in better_energy]))]

  to_drop = set(energy_only_quotes.index)-set(better_energy_quotes.index)
  return df.drop(to_drop)


(df.index.size - sanitize_energy_conservatory(df).index.size) / df.index.size

0.142392081867418

In [None]:
def sanitize_energy_permissive(df):

  def energy_only(q, vocab = climate_dict):
    energy = "energy" in q
    others = any([word in q for word in vocab if word != "energy"])
    return energy and not others

  energy_only_quotes = df.loc[df["quotation"].apply(lambda q : energy_only(q))]

  abstract_energy = "league stadium show star play sport song team coach player " + \
                    "game audience kid actor actress boy girl olympic fans supporters health healthy nutrients"
  abstract_energy = abstract_energy.split()
  abstract_energy.append("my energy")
  abstract_energy_quotes = energy_only_quotes.loc[energy_only_quotes["quotation"].apply(lambda q : any([w in q for w in abstract_energy]))]

  return df.drop(abstract_energy_quotes.index)

(df.index.size - sanitize_energy_permissive(df).index.size) / df.index.size

0.05172917183690749

#### strict removal of FP-prone vocabulary

In [None]:
def w_only(q, w, vocab = climate_dict):
  w_in_quote = w in q
  any_other_in_quote = any([word in q for word in vocab if word != w])
  return w_in_quote and not any_other_in_quote

FPprone = ["cop", "atmosphere", "nuclear", "ecosystem"]

def sanitize_fpprone_vocab(df, words):
  to_remove = pd.Index([])
  for w in words:
    w_only_idx = df.loc[df["quotation"].apply(lambda q : w_only(q, w))].index
    to_remove = to_remove.append(w_only_idx)
  
  return df.drop(to_remove)

(df.index.size - sanitize_fpprone_vocab(df, FPprone).index.size) / df.index.size

0.2066076894254585

## Sanitized df

In [None]:
# dummies for df
generate_dummies(df, climate_dict, out="output/df_dummies");

# less FP and its dummies
sanitized_df = sanitize_energy_conservatory(sanitize_small_word_fp(df))
sanitized_df.to_pickle("output/sanitized_df")
generate_dummies(sanitized_df, climate_dict, out="output/sanitized_dummies");
print("sanitized df keep ratio: ", sanitized_df.index.size / df.index.size))

# even less FP and its dummies
FPprone = ["cop", "atmosphere", "nuclear", "ecosystem"]
sanitized_strict_df = sanitize_fpprone_vocab(sanitized_df, FPprone)
sanitized_strict_df.to_pickle("output/sanitized_strict_df")
generate_dummies(sanitized_strict_df, climate_dict, out="output/sanitized_strict_dummies");
print("sanitized strict df keep ratio: ", (sanitized_strict_df.index.size / df.index.size))

# Who talks about Climate

## Most famous quotes

In [None]:
data = sanitized_strict_df

top5 = data.nlargest(n= 5, columns="numOccurrences")[["quotation", "speaker", "numOccurrences"]]
top5["quotation"].values

array(['i lost so i m going to follow our democratic traditions poison the wells and scorch the earth',
       'this is not an opportunity to go outside and try to have fun with a hurricane ',
       'this is a precautionary measure to ensure we have enough fuel to support lifesaving efforts respond to the storm and restore critical services and critical infrastructure ',
       'this is when the taiwanese people show their calm resilience and love ',
       'pretend assume presume that a major hurricane is going to hit right smack dab in the middle of south carolina and is going to go way inshore '],
      dtype=object)

## Most famous speaker

In [None]:
data = sanitized_strict_df

speakers = data[["speaker", "numOccurrences"]].groupby("speaker").sum()
speakers["speaker"] = speakers.index
speakers.nlargest(n=5, columns="numOccurrences")


Unnamed: 0_level_0,numOccurrences,speaker
speaker,Unnamed: 1_level_1,Unnamed: 2_level_1
Narendra Modi,3566,Narendra Modi
Josh Frydenberg,3131,Josh Frydenberg
Antonio Guterres,2931,Antonio Guterres
Scott Morrison,2604,Scott Morrison
Malcolm Turnbull,2433,Malcolm Turnbull


In [None]:
speakers[["famous_quote","famous_quote_occs"]] = data.groupby("speaker")[["quotation", "numOccurrences"]].max()
speakers.nlargest(n=5, columns="numOccurrences")

In [None]:
speakers.index.size

72441

In [None]:
fig = sns.scatterplot(data=speakers, x="numOccurrences", y="famous_quote_occs", )
fig.set(xlabel="Cumulated quotations", ylabel='Occurrence of best quote')
fig.set_title("speaker quote occurrence relationship");


## WRT time

In [None]:
datetime_index = data.reset_index().quoteID.apply(lambda x: x[:10])
date_df = data.set_index(datetime_index)
date_df.index = pd.to_datetime(date_df.index)

In [None]:
def top_speakers(data, n=5):
  s = data[["speaker", "numOccurrences"]].groupby("speaker").sum()
  return s.nlargest(n, columns="numOccurrences")

grouped_month = date_df.groupby(pd.Grouper(freq="1M"))[["speaker", "numOccurrences"]]

acc = []
for month, group in grouped_month:
  x = top_speakers(group, n=1000).T
  x.index = [month]
  acc.append(x)

In [None]:
monthly_speakers = pd.concat(acc, join="outer").fillna(0)
cum_speakers = monthly_speakers.cumsum()


In [None]:
import bar_chart_race as bcr
bcr.bar_chart_race(cum_speakers, n_bars=6, steps_per_period=5)

# Review of sentiment analysis

The sentiment analysis part presents our study of the current best practices of sentiment analysis. From this study, we pick our sentiment analyzer and do some elementary data exploration on the 2017 data.


### Choice of sentiment analyzer

The classic gold standard lexicon, especially for longer text, is LIWC (Linguistic Inquiry and Word Count) [[1]]. It is a Semantic Orientation (Polarity-based) Lexicon. Sociologists, psychologists, linguists, and computer scientists find LIWC appealing because it has been extensively validated. Also, its straightforward dictionary and simple word lists are easily inspected, understood, and extended if desired. Such attributes make LIWC an attractive option to researchers looking for a reliable lexicon to extract emotional or sentiment polarity from text. 

But LIWC is unable to account for differences in the sentiment intensity of words. For example, “The food here is exceptional” conveys more positive intensity than “The food here is okay”. A sentiment analysis tool using LIWC would score them equally (they each contain one positive term). Such distinctions are intuitively valuable for fine-grained sentiment analysis and in our case to detect polarization between two opinions on climate. "I am skeptic about climate" is not as intense as "I hate Greta Thunberg" and we should be able to detect it.

Another aspect to take into account is that a given sentiment analyzer performs differenlty depending on the length of quotes. In our dataset, we have a mixture of short quotes and long quotes with a majority of shorter quotes.

Ease of use such as the need (or not) to train the sentiment analyzer has to be taken into account.

Given all these factors, we have decided  use VADER [[2]] (Valence Aware Dictionary and sEntiment Reasoner). It is pretrained and built into NLTK.

Reading the paper, we know that VADER is best suited for language used in social media and short text. 

VADER is the result of very thorough work. It has been trained on its own valence-aware sentiment lexicon composed of other well established/ "gold standard" sentiment banks such as ANEW (Affective Norms for English Words) [[4]] for sentiment valence ranging from [1-9], LIWC mentioned before and the Genereal Inquirer (GI) [[3]]. On top of that, it incorporates numerous lexical features common to sentiment expression in microblogs.

In the paper, it was shown that VADER (F1 = 0.96) outperforms individual human raters (F1 = 0.84) at correctly classifying the sentiment of tweets into positive, neutral, or negative classes. Furthermore, it was shown to generalize very well and to outperform other analyzers outside of social media text and longer text.


We also went through the ADA lectures on text analysis and remarked that VADER was also used, further convincing us that it is indeed a quality choice.


### Scoring:

Given a sentence, we can use VADER to compute polarity_scores() which will ouptput a dictionary of 4 values ('compound', 'neg', 'neu', 'pos').

 The 'compound' score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of polarity for a given sentence.

The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1 or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.



[2]: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

[3]: http://www.wjh.harvard.edu/~inquirer/

[1]: https://liwc.wpengine.com/

[4]: https://csea.phhp.ufl.edu/media/anewmessage.html

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

In [None]:
## Setting up a sample dataframe

df = pd.read_pickle("df2017_0")
df['compound'] = df.quotation.apply(lambda x : sia.polarity_scores(x)['compound'])

In [None]:
def analyze_sentiment(text):
    """
    Given text, outputs VADER polarity scores with explanations

    Args:
        text (string)
    """
    sia = SentimentIntensityAnalyzer()
    polarity_scores = sia.polarity_scores(text)
    print(f"Portion of the text which is negative: {polarity_scores['neg']}.")
    print(f"Portion of the text which is neutral: {polarity_scores['neu']}.")
    print(f"Portion of the text which is positive: {polarity_scores['pos']}.")

    print(f"Normalized weighted average valence score of the text: {polarity_scores['compound']}\n")

In [None]:
def plot_sentiment_hist(df,sentiment='compound',all=False):
    """"
    Given the dataframe with quotations, plots the distribution of a given component of polarity_scores() or all components.

    Args:
        df (pd.DataFrame): Dataframe with quotations 
        sentiment (str, optional): Can be 'pos' , 'neg', 'neu' or 'compound'. Defaults to 'compound'. Is ignored if all=True
        all (bool, optional): if true, plot the distribution of all components. Defaults to False.
    """
    
    sia = SentimentIntensityAnalyzer()

    if all:
        f,a = plt.subplots(2,2,figsize=(15,7),sharey=True)
        sentiments = ['neg','neu','pos','compound']
        transformed = df.quotation.apply(lambda x : sia.polarity_scores(x))
        f.suptitle("Distribution of 'neg', 'neu', 'pos' and 'compound' in the given corpus")

        for i, sent in enumerate(sentiments):
            idx = divmod(i,2)
            g = sns.histplot(data=transformed.apply(lambda x: x[sent]), bins='auto',ax=a[idx[0],idx[1]])
            g.set_xlabel(f"{sent} score")
            g.set_yscale('log')
    else:

        transformed = df.quotation.apply(lambda x : sia.polarity_scores(x)[sentiment])
        f, a = plt.subplots(figsize=(15, 5))
        f.suptitle(f"Distribution of {sentiment} sentiment")
        g= sns.histplot(data=test,bins='auto')
        g.set_xlabel(f"{sentiment} score")
        g.set_yscale('log')

In [None]:
def plot_compound_time_series(df, freq = "W"):
    """
    Given a dataframe assumed to have a "compound" column, plots the time series of mutliple aggregates of compound at given frequency (W = week, M = month)

    Args:
        df (pd.DataFrame): dataframe asssumed to have a "compound" column
        freq (str, optional): The frequency of our time series i.e. at frequency do we take our aggregates. Defaults to "W".
    """

    sia = SentimentIntensityAnalyzer()

    ## changing the index into datetime

    new_index = df.reset_index().quoteID.apply(lambda x: x[:10])
    new_df = df.set_index(new_index)
    new_df.index = pd.to_datetime(new_df.index)
    
    mean_compound_values_time_series = new_df.compound.groupby(pd.Grouper(freq=freq)).mean()
    sd_compound_values_time_series = new_df.compound.groupby(pd.Grouper(freq=freq)).std()
    x = range(0,len(mean_compound_values_time_series))

    f, a = plt.subplots(1,2,figsize=(15, 5))
   
    g= sns.lineplot(x=x, y=mean_compound_values_time_series, ax = a[0])
    g.set_title(f"The time series of the mean value 'compound' at frequency {freq}")
    g.set_xlabel(f"Time steps at frequency {freq}")
    g.set_ylabel(f"Mean of compound at frequency {freq}")


    g= sns.lineplot(x=x, y=sd_compound_values_time_series, ax = a[1])
    g.set_title(f"The time series of the standard deviation of the value 'compound' at frequency {freq}")
    g.set_xlabel(f"Time steps at frequency {freq}")
    g.set_ylabel(f"Mean of compound at frequency {freq}")




In [None]:
plot_compound_time_series(df)

We can see here that the weekly mean compound score oscillates around 0.3 (meaning slightly positive) but plotting the standard deviation, we can see that there are huge variations around this mean. This is the indicator of our more "polarized" quotes. Investigation into quantizing the compound scores into bins e.g. ([-1, -0.5], [-0.5, 0], [0, 0.5], [0.5, 1] or finer quantization) and looking at most frequent values in these bins should prove to be insightful and let us capture insights about polarization more easily than aggregates.   

In [None]:
plot_sentiment_hist(df,all=True)

In truly polarized data, we would expect to see a bimodal distribution in the 'compound' distribution. The somewhat uniform distribution of compound might come from two possibilites:

1. The data isn't yet filtered well enough to only capture quotes about climate change
2. Data about climate change isn't polarized enough for us to see a bimodal distribution

Nonetheless, we can see slighlty more mass on the positive side, which is an indicator of more positive quotes in the corpus.

# Word Embedding

The word embedding part showcases the pipeline put in place to go from quote to word embedding using Word2Vec. It then showcases our capability to visualize the data using word embeddings.

In [None]:
import sys
sys.path.append('./helpers/')
from helpers import get_samples
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import helpers
from importlib import reload
from collections import Counter
from time import time
import visual as viz
import w2v as w2v
import text_tools as tt
reload(helpers)
reload(tt)
reload(w2v)
reload(viz);

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\antom\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\antom\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df = get_samples(num_samples=5000, random=True)

In [None]:
df

In [None]:
#### Average length of the text before any preprocess : 

In [None]:
df = df.drop(["qids", "probas", "phase", "quoteID", "urls"], axis=1)
#sns.histplot(data=)

In [None]:
df["quote_len"] = df.quotation.apply(tt.get_tokens).apply(len)

In [None]:
f, a = plt.subplots(figsize=(15, 5))
sns.histplot(data=df, x="quote_len", kde=True);

In [None]:
helpers.CIs(data=df, columns=["quote_len"], funcs=[np.mean, np.std, lambda x : np.percentile(x, 0.5)]).transpose()

Unnamed: 0,mean_low,mean_computed,mean_high,std_low,std_computed,std_high,<lambda>_low,<lambda>_computed,<lambda>_high
quote_len,27.830818,28.391003,29.121095,21.188133,23.612752,26.448992,5.0,5.0,5.0


In [None]:
from time import time
start = time()

df["prep_quote"] = df.quotation.apply(tt.preprocess_quote)
print(f"It took : {round(time() - start, 2)} seconds")

It took : 7.52 seconds


In [None]:
df["prep_token_nb"] = df.prep_quote.apply(len)
f, a = plt.subplots(figsize=(15, 5))
sns.histplot(data=df, x="prep_token_nb", kde=True);

In [None]:
helpers.CIs(data=df, columns=["prep_token_nb"], funcs=[np.mean, np.std, np.median]).transpose()

Unnamed: 0,mean_low,mean_computed,mean_high,std_low,std_computed,std_high,median_low,median_computed,median_high
prep_token_nb,21.752234,22.348119,22.982704,18.231695,20.732495,23.824741,17.0,17.0,17.0


In [None]:
print(df[df["prep_token_nb"] == df.prep_token_nb.max()]["quotation"].values[0])
print("\n"*3)
print(df[df["prep_token_nb"] == df.prep_token_nb.max()]["prep_quote"].values[0])

March 27, 2020 Letter to Cape Cod Second Homeowners: Cape Cod is home to over 214,000 year-round residents, who appreciate and depend upon our seasonal influx of visitors and second homeowners. It has been our way of life for centuries. During the coronavirus crisis, we all understand the desire to come to your second home on the Cape while sheltering in place. We are asking that if you do so, please help us all to remain safe and healthy by following these actions: Individuals traveling to Cape Cod from off-Cape and out of state are to self-quarantine for 14 days to avoid spreading the virus. Bring items that you will need during your stay, including prescriptions, groceries, cleaning supplies, personal health items and personal protective equipment. While essential service establishments may be open, there are shortages being experienced of key items. Support our restaurants with take-out orders as found on this list https://www.capecodchamber.org / restaurants/restaurants-offering-t

### Word to Vec :

In [None]:
total = []
for prep in df.prep_quote.values:
    total = total + list(prep)

In [None]:
common = Counter(total).most_common(20)

In [None]:
### don't run : really long
### w2v.save_model()

In [None]:
### be careful : tension on RAM
model = w2v.get_model()

In [None]:
model.get_vector(common[0][0]).reshape((1, 300)).shape

(1, 300)

In [None]:
%timeit -n 1 w2v.aggregate(model, df.prep_quote.values[4]) 

The slowest run took 336.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10.7 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
df[df.prep_token_nb == 0]

Unnamed: 0,quotation,speaker,date,numOccurrences,quote_len,prep_quote,prep_token_nb
9995,"Why now, 2021? Here's why,",Michael J. Graham,2020-03-02 21:03:20,1,9,[],0
14178,should be out in 2020,Kate Hudson,2020-02-17 00:00:00,5,5,[],0
16443,not having the same face,Angelina Pivarnick,2020-02-05 23:38:27,1,5,[],0
20738,He can take him on.,Ray Newman,2020-02-12 01:06:39,1,6,[],0
49164,"I didn't want that out,",Clayton Kershaw,2020-04-15 12:00:00,6,7,[],0


In [None]:
df = df[df.prep_token_nb != 0]

In [None]:
## Comme dirait Jean Pierre Coff : C DE LA MERDE

X = w2v.get_w2c_matrix(model, df, "prep_quote")

In [None]:
viz.show_w2v_words(X)

### Discrimanation between years :

In [None]:
df['date'] = pd.to_datetime(df['date'])

In [None]:
split_dates = []
for month_id in range(1, 13):
    split_dates.append(pd.datetime(2020,month_id,1))
split_dates.append(pd.datetime(2021, 1, 1))

  split_dates.append(pd.datetime(2020,month_id,1))
  split_dates.append(pd.datetime(2021, 1, 1))


### Final Pipeline + Benchmarking :

In [None]:
def benchmark(start, part):
    print(f"It took for 10_000 samples : {round(time() - start, 2)} to {part}")

In [None]:
start = time()

model = w2v.get_model()
benchmark(start, "load model")

startr = time()
df = helpers.get_samples(num_samples=20_000, random=True)
benchmark(startr, "load data")

df = df.drop(["qids", "probas", "phase", "quoteID", "urls"], axis=1) ## get rid of useless cols
df["date"] = pd.to_datetime(df["date"]) ## need date to split it after (need proper typing)

startr = time()
df["prep_quote"] = df.quotation.apply(tt.preprocess_quote) ## preprocess quotes
benchmark(startr, "preprocess")

## discriminate with the month here :
df["month"] = pd.DatetimeIndex(df["date"]).month

## adds random sentiment
fake_sentiments = np.random.randint(0, 2, len(df.index))
df['sentiment'] = fake_sentiments

startr = time()
## get all the datapoints (one per quote) in W2V vector space
vec_spaces, labels = zip(*df.groupby("month").apply(lambda x : w2v.get_w2c_matrix(model, x, "prep_quote", "sentiment")).values)
benchmark(startr, "get matrices")

## plot them all : 
startr = time()
[viz.show_w2v_words(vec_space, outfilename=f'W2V{idx}.png', colors=viz.get_cmap_from_labels(labels[idx])) for idx, vec_space in enumerate(vec_spaces)]
benchmark(startr, "plot")

benchmark(start, "do everything")

In [None]:
helpers.get_cmap_from_labels(labels[0])

array([[1.46200e-03, 4.66000e-04, 1.38660e-02, 1.00000e+00],
       [1.46200e-03, 4.66000e-04, 1.38660e-02, 1.00000e+00],
       [9.87053e-01, 9.91438e-01, 7.49504e-01, 1.00000e+00],
       ...,
       [1.46200e-03, 4.66000e-04, 1.38660e-02, 1.00000e+00],
       [1.46200e-03, 4.66000e-04, 1.38660e-02, 1.00000e+00],
       [1.46200e-03, 4.66000e-04, 1.38660e-02, 1.00000e+00]])

### Evaluate the clusterizations :

In [None]:
vec_spaces = df.groupby("month").apply(lambda x : helpers.get_w2c_matrix(model, x, "prep_quote"))
[helpers.show_w2v_words(vec_space, outfilename=f'W2V{idx}.png', colors=helpers.get_color_map(df, "sentiment")) for idx, vec_space in enumerate(vec_spaces)]


In [None]:
reload(helpers);

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\antom\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
### we study January
jvec_space = vec_spaces[0]
jlabels = labels[0]
pos_vecs = jvec_space[jlabels == 1]
neg_vecs = jvec_space[jlabels == 0]

cpos = helpers.get_center_of_mass(pos_vecs)
cneg = helpers.get_center_of_mass(neg_vecs)

In [None]:
helpers.normalized_cut(pos_vecs, neg_vecs, helpers.cosine_sim)

  


0.999817626607963

In [None]:
stats = []
for month in range(4):    
    jvec_space = vec_spaces[month]
    print(vec_spaces[month].shape)
    print(labels[month].shape)
    jlabels = labels[month]
    pos_vecs = jvec_space[jlabels == 1]
    neg_vecs = jvec_space[jlabels == 0]

    cpos = helpers.get_center_of_mass(pos_vecs)
    cneg = helpers.get_center_of_mass(neg_vecs)
    stats.append([helpers.cosine_sim(cpos, cneg), helpers.normalized_cut(pos_vecs, neg_vecs)])
    print(f"Done with month {month + 1}")

(5503, 300)
(5503,)


  


Done with month 1
(5007, 300)
(5007,)
Done with month 2
(4159, 300)
(4160,)


IndexError: boolean index did not match indexed array along dimension 0; dimension is 4159 but corresponding boolean dimension is 4160

#### Benchmark :

In [None]:
sizes = 10 ** np.arange(2, 6)
stats = []
#model = helpers.get_model()
for size in sizes:
    reps = []
    print(f"Starting size : {size}")
    for rep in range(10):
        df = get_samples(num_samples=size, random=True)
        try :
            reps.append(helpers.process(df, model))
        except :
            print("Failed")
    stats.append(np.mean(reps))
    print(f"Done for size : {size}")

In [None]:
f, a = plt.subplots(figsize=(7, 4))
plt.plot(stats)
plt.plot(n_stats)
a.set_xticklabels(["","1e2","", "1e3","", "1e4", "", "1e5"])
a.set_ylabel("runtime [s]")
a.set_xlabel("number of samples");

In [None]:
n_stats = []
## parallelized benchmark
sizes = 10 ** np.arange(2, 6)
for size in sizes:
    reps = []
    print(f"Starting size : {size}")
    for rep in range(5):
        df = get_samples(num_samples=size, random=True)
        try :
            reps.append(helpers.process(df, model, par=False))
        except :
            print("Failed")
    n_stats.append(np.mean(reps))
    print(f"Done for size : {size}")

In [None]:
df = get_samples(num_samples=100, random=True)
helpers.process(df, model, par=True)

TypeError: process() got an unexpected keyword argument 'par'

# Wikipedia feature engineering

The wikipedia part showcases our capabilities to extract features from Wikipedia using QIDs. Such features are gender, political assignation or age. Wikipedia data is quite messy and the heuristics used to extract these features are shown.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from importlib import reload
import matplotlib.pyplot as plt
import urllib.request
import json
import sys
import re
sys.path.append('./helpers/')
sys.path.append('./feature_engineering/')
import names
import helpers

reload(helpers)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lucastrg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lucastrg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<module 'helpers' from '/home/lucastrg/FLEP/MA1/ADA/ada-2021-project-adada-sur-mon-bidet/./helpers/helpers.py'>

## Pre-Processing
We will remove all the quotes without a speaker, and we will extract the set of all the speakers and QIDs of the sampled rows.
We then fetch a json of each speaker's whole page as well as all its PIDs and RIDs (these 2 IDs are not yet in use)

In [None]:
df = helpers.get_samples(num_samples=10000, random=True)

In [None]:
df=df[df["speaker"]!="None"]

In [None]:
len(df)

5978

Not so bad ! About 60% of the rows are kept.

In [None]:
df.head()

In [None]:
qids=list(set(df["qids"].to_numpy().sum()))
speakers=list(set(df["speaker"]))

In [None]:
len(speakers)

4898

In [None]:
request_template= "https://www.wikidata.org/wiki/Special:EntityData/{}.json"
request_template2="https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles={}&formatversion=2&rvprop=content&rvslots=*"
request_template3="https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&pageids={}&formatversion=2&rvprop=content&rvslots=*"


In [None]:
invalid_qids=[]
for qid in qids[:10]:
   try :
      with urllib.request.urlopen(request_template.format(qid)) as response:
         raw_data = response
         data = json.load(raw_data)
         data.keys()
   except urllib.request.HTTPError :
      invalid_qids.append(qid)


      #print(json.dumps(data, indent=2, sort_keys=True))
      

Doesn't work atm, not really useful so might drop

In [None]:
invalid_qids=[]
for qid in qids[:10]:
   try :
      with urllib.request.urlopen(request_template3.format(qid_to_rid[qid])) as response:
         raw_data = response
         data = json.load(raw_data)
         data.keys()
   except urllib.request.HTTPError :
      invalid_qids.append(qid)


      #print(json.dumps(data, indent=2, sort_keys=True))
      

NameError: name 'qid_to_rid' is not defined

In [None]:
data

{'batchcomplete': True,
 'query': {'pages': [{'pageid': 1426072190, 'missing': True}]}}

#Wikipedia data fetching
Fetches all we need to know about a speaker (using their name). Handles one redirection if needed 

In [None]:
invalid_speakers=[]
speaker_content={}
for speaker in speakers[:2000]:
   try :
      with urllib.request.urlopen(request_template2.format(urllib.parse.quote(speaker))) as response:
         raw_data = json.load(response)["query"]["pages"][0]
         
         if raw_data.get("missing",False):
            invalid_speakers.append(speaker)
         else:
            content = raw_data["revisions"][0]["slots"]["main"]["content"]
            if re.search("^'''{}''' may refer to".format(speaker),content): #Drop disambiguation pages
               invalid_speakers.append(speaker)

            else:
               if re.search("(^#REDIRECT \[\[)([A-Za-z 'À-ÿZİı.-]*)", content): #Allows to fix most redirecting problems 
                  speaker_alt = re.search("(^#REDIRECT \[\[)([A-Za-z 'À-ÿZİı.-]*)", content).group(2)
                  print("Redirect ", speaker ,"->",speaker_alt) #Je laisse le print parce qu'il est satisfaisant
                  if speaker_alt:
                     with urllib.request.urlopen(request_template2.format(urllib.parse.quote(speaker_alt))) as response:
                        raw_data = json.load(response)["query"]["pages"][0]
                        if raw_data.get("missing",False):
                           invalid_speakers.append(speaker)
                        else:
                           content = raw_data["revisions"][0]["slots"]["main"]["content"]
                  else :
                     content = "ERROR"
               speaker_content[raw_data["title"]]=content
            
   except urllib.request.HTTPError :
      invalid_speakers.append(speaker)
      

Redirect  Clara Kramer -> Clara's War
Redirect  Jeffrey Mims -> D. Jeffrey Mims
Redirect  Cesar Diaz -> César Díaz
Redirect  Matty Healy -> The 
Redirect  Stephane Dujarric -> Stéphane Dujarric
Redirect  Darion Anderson -> Jake Anderson 
Redirect  Joe Giudice -> Teresa Giudice
Redirect  Bob Miller -> Robert Miller
Redirect  V Srinivasan -> V. Srinivasan
Redirect  Mick Cronin -> Michael Cronin
Redirect  Bill O'Brien -> William O'Brien 
Redirect  Bobby James -> Bob James
Redirect  Kim Kardashian West -> Kim Kardashian
Redirect  Bill Chapman -> William Chapman
Redirect  Georgina Wood -> Georgina Theodora Wood
Redirect  Kareena Kapoor Khan -> Kareena Kapoor
Redirect  Mike White -> Michael White
Redirect  Dave Roberts -> David Roberts
Redirect  Sinead O'Connor -> Sinéad O'Connor
Redirect  Mike Green -> Michael Green
Redirect  Bill Hoffman -> William Hoffman
Redirect  Tedros Adhanom Ghebreyesus -> Tedros Adhanom
Redirect  Danny Garcia -> Daniel García
Redirect  Stephen Townsend -> Stephen J.

In [None]:
len(speaker_content)

1622

We manage to fetch around 75% of the wikipedia page that we were looking for ! 
However we can notice a small percentage of rows that are considered as valid to be completely wrong. Since we fetch the jsons using the name of the speaker, we can either have trouble resolving homonyms, or simply suffer from badly assigned names (i.e. "Theater Director")

## Political Side assignation
Here we're guessing the political side of each speaker with somewhat good accuracy. We use 2 different strategies, if the speaker has a well filled in wikipedia page, we can simply find its current political party. If not, we're using a surprisingly alright heuristic, we simply count the occurences of words assigned to democrats (i.e. "left-wing", "liberal", ...) and republicans, and compare the 2 counts.

NB: There is obviously one major assumption that speaker belong exclusively to either of these two (or none). However, even in the US, some speakers are "in the middle". 

It should also be noted that some speakers are not American, we however found that our heuristic still matched those speakers with conservatives view to the Republican and vice-versa). We shall in the next milestone investigate further and perhaps adopt a deeper model.

In [None]:
def pol_compass_from_wiki(speakers_content, discrete = True):
    if discrete:
        dem_words=["democrat", "left-wing", "liberal"]
        rep_words =["republican", "conservative", "right-wing"]


        for speaker in speakers_content:
            yielded = False
            s= speakers_content[speaker].lower()

            for line in s.split("\n"):
                if "| party" in line:
                    if any(x in line for x in dem_words):
                        yield speaker, ("Democrat", -1)
                        yielded = True
                    elif any(x in line for x in rep_words):
                        yielded = True
                        yield speaker, ("Republican", -1)
                
            if not yielded:

                dem= sum(s.count(x) for x in dem_words)
                rep= sum(s.count(x) for x in rep_words)
                total = rep+dem
                if total:
                    yield speaker, ("Democrat" if dem>rep else "Republican", total)
    else:
        dem_words=["democrat", "left-wing", "liberal"]
        rep_words =["republican", "conservative", "right-wing"]

        for speaker in speakers_content:
            yielded = False
            s= speakers_content[speaker].lower()

            for line in s.split("\n"):
                if "| party" in line:
                    if any(x in line for x in dem_words):
                        yield speaker, (1,0, -1)
                        yielded = True
                    elif any(x in line for x in rep_words):
                        yielded = True
                        yield speaker, (0,1, -1)
                
            if not yielded:

                dem= sum(s.count(x) for x in dem_words)
                rep= sum(s.count(x) for x in rep_words)
                total = rep+dem
                if total:
                    yield speaker, (dem/total,rep/total, total)



In [None]:
speaker_wing= dict(pol_compass_from_wiki(speaker_content))
len(speaker_wing)

501

## Gender assignation

In order to guess the gender if the speakers, we again use 2 strategies. At first, we try guessing the gender by counting occurences of gendered pronoums, but if we don't get any, we train a classifier, which solely uses the name of the speaker to guess the gender (thus with pretty bad accuracy ~70%) 

In [None]:
from nltk.corpus import names
from nltk import NaiveBayesClassifier as NBC
from nltk import classify
import nltk
nltk.download('names')

import random

[nltk_data] Downloading package names to /home/lucastrg/nltk_data...
[nltk_data]   Package names is already up-to-date!


For the classifier we use both the whole name as well as only the last letter

In [None]:
def gender_features(word):
    return {"whole name" : word, "lastletter" : word[-1]}

Training set loading and parsing

In [None]:
femaleNames = [ (name, "female") for name in names.words("female.txt") ]
maleNames = [ (name, "male") for name in names.words("male.txt") ]
allNames = maleNames + femaleNames
random.shuffle(allNames)

Actually training the classifier

In [None]:
featureData = [(gender_features(namelist), gender) for (namelist, gender) in allNames ]
test_data = featureData[:500]
train_data = featureData[500:]
classifier = NBC.train(train_data)

In [None]:
def gender_from_wiki(speaker_content):
    he_words=[" he ", " him", " him"] #The spaces are important, don't modify
    she_words =[" she ", " her"]
    they_words=[" they ", " them"]

    for speaker in speaker_content:
        s= speaker_content[speaker].lower()

        he= sum(s.count(x) for x in he_words)
        she= sum(s.count(x) for x in she_words)
        they= sum(s.count(x) for x in they_words)
        total = he+she+they

        if True:
            if total==0:
                 yield (speaker, classifier.classify(gender_features(speaker.split()[0])))
            elif he == max(he,she,they):
                yield(speaker, "male")
            elif she == max(he,she,they):
                yield(speaker, "female")
            else:
                yield(speaker, "other") 
        if False and total:
            yield speaker, (he/total,she/total, they/total, total)


In [None]:
speaker_gender = dict(gender_from_wiki(speaker_content))
len(speaker_gender)

1622

In [None]:
tmp = []
for speaker, gender in speaker_gender.items():
    tmp.append(gender)
np.unique(tmp, return_counts=True) #Snif

(array(['female', 'male', 'other'], dtype='<U6'), array([ 364, 1254,    4]))

As we can see, only 20% of the speakers, are female.

## Age assignation
Much easier to do, we can most of the time get a solid birth date and compute the age of the speaker (not precisely, we only use the year, since we're more interested about seeing general trends rather than precise assignation)

In [None]:
def age_from_wiki(speaker_content):
    count = 0

    for speaker in speaker_content:
        s= speaker_content[speaker].lower().split("\n")
        for line in s:
            if "birth_date" in line:
                    
                    match = re.match("^(\|birth_date={{birthdateandage\|(\w*=\w*\|)?)([0-9]*)\|([0-9]*)\|([0-9]*)", line.replace(" ",""))
                    if match:
                        age = 2022-int(match.group(3))
                        yield(speaker, age)
                    else : 
                        match = re.match("^(\|birth_date={{birthdate\|(\w*=\w*\|)?)([0-9]*)\|([0-9]*)\|([0-9]*)", line.replace(" ",""))
                        if match:
                            age = 2022-int(match.group(3))
                            yield(speaker, age)
                        else :
                            match = re.match("^(\|birth_date={{birthyearandage\|(\w*=\w*\|)?)([0-9]*)", line.replace(" ",""))
                            if match:
                                age = 2022-int(match.group(3))
                                yield(speaker, age)     
                            else:
                                count +=1


In [None]:
speaker_age = dict(age_from_wiki(speaker_content))

In [None]:
len(speaker_age)

1197

In [None]:
plt.hist(speaker_age.values(), bins=20)
plt.title("Empirical age distribution of the sampled speakers (without filtering)")
plt.xlabel("Age")
plt.ylabel("Count")

In [None]:
wing = []
ages = []
gender = []


for speaker in speaker_content:
    if speaker in speaker_age.keys() and speaker in speaker_gender.keys() and speaker in speaker_wing.keys() and speaker_age[speaker]<120:
 
        ages.append(speaker_age[speaker])
        wing.append(speaker_wing[speaker][0])
        gender.append(speaker_gender[speaker])


In [None]:
plt.hist(wing)
plt.title("Observed repartition between the 2 parties")
plt.ylabel("Count")

In order to get a more precise view of the age of speakers that could have contributed to the climate change question, we filtered speakers over 120 years old

In [None]:
plt.hist(ages,bins=20)
plt.title("Empirical age distribution of the sampled speakers (with a bit of filtering)")
plt.xlabel("Age")
plt.ylabel("Count")

In [None]:
big_dict={}
for speaker in speaker_content:
    if speaker in speaker_age.keys() and speaker in speaker_gender.keys() and speaker in speaker_wing.keys() and speaker_age[speaker]<120:
        big_dict[speaker]=(speaker_age[speaker], speaker_gender[speaker], speaker_wing[speaker][0],speaker_wing[speaker][1])
        

In [None]:
df = pd.DataFrame.from_dict(big_dict, orient="index", columns=["age", "gender", "wing", "political_count"])

In [None]:
df.wing = df.wing.astype( "category")
df.gender = df.gender.astype("category")

In [None]:
sns.set(rc={'figure.figsize':(20,12)})
sns.catplot(x="wing", y="age", hue="gender", kind="swarm", data=df, height=9).fig.suptitle("Age and gender distribution for each major political wing")

In [None]:
sns.histplot(data=df, x="age", hue="wing").set_title("Age distribution of each major political wing")

In [None]:
sns.histplot(data=df, x="age", hue="gender").set_title("Age distribution of each assigned gender")