# MILESTONE 2

`File name:` milestone2.py

`Authors:`
- Víctor González
- Alvaro Bautista 
- Alicia Soria 
- Kamil Czerniak

`Date created:` 05/11/2021

`Date last modified:` 12/11/2021

`Python Version:` 3.9.2


---

## Table of contents

**1. INTRODUCTION**
   * Context
   * Project idea
   * Project goals
   * Motivation
   * Feasibility
   
   
   
**2. THE DATA**
   * Quotebank
   * External data
    
    
    
**3. PIPELINE**
   * Load data
   * Examine our data
   * Clean up data
   * Modeling
   * Interpreting
   * Storytelling and communication
    
    
    
**4. CONCLUSIONS**
   * Summary
   * Results
   * Problems encountered
    
    
    
**5. FUTURE LINES**

---

# 1. Introduction

### 1.1. Context

# Brief introduction
Welcome to Milestone 2 Python notebook.
In this notebook we will answer several research questions surrounding the Breixit event.
We will employ the Quotebank dataset mainly, as well as additional databases to enrich the data and obtain more
complete conclusions.

### 1.2. Project idea


# Explain in clear, reasonable, and thorough way the project idea
Pinpoint and determine the arguments for and against Breixit in different social groups

### 1.3. Project goals

# Clear project goals
-
-
-

### 1.4. Motivation


In [None]:
# What story do we want to tell, why?
# Critical awareness of the project (social, cultural, political, economic, education. ... impact)

### 1.5. Feasibility

In [None]:
# Justify feasibility given the data

---

# 2. The data

## 2.1. Quotebank

This data source is described best by its makers:

> Quotebank is a dataset of 178 million unique, speaker-attributed quotations that were extracted from 196 million English news articles crawled from over 377 thousand web domains between August 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

(*Vaucher, Timoté, Spitz, Andreas, Catasta, Michele, & West, Robert. (2021). Quotebank: A Corpus of Quotations from a Decade of News (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4277311 (accessible on November 10, 2021)*)

In our case, we will use data collected between January 2015 and April 2020. We decided to put the start date in 2015, as this was the year of General Election in the United Kingdom, where the Conservative Party (which won the majority in the House of Commons) has put a promise of an in-out referendum in its manifesto:

> We will negotiate new rules with the EU, so that people will have to be earning here for a number of years before they can claim benefits, including the tax credits that top up low wages. Instead of something-fornothing, we will build a system based on the principle of something-for-something. We will then put these changes to the British people in a straight in-out referendum on our membership of the European Union by the end of 2017.

(*The Conservative Party Manifesto 2015, http://ucrel.lancs.ac.uk/wmatrix/ukmanifestos2015/localpdf/Conservatives.pdf (accessible on November 10, 2021)*)

The data source is based on this paper: *Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760*

### Description of the data

- `quoteID:` Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
- `quotation:` Text of the longest encountered original form of the quotation
- `date:` Earliest occurrence date of any version of the quotation
- `phase:` Corresponding phase of the data in which the quotation first occurred (A-E)
- `probas:` Array representing the probabilities of each speaker having uttered the quotation. The probabilities across different occurrences of the same quotation are summed for each distinct candidate speaker and then normalized
 - `proba:` Probability for a given speaker
 - `speaker:` Most frequent surface form for a given speaker in the articles where the quotation occurred
- `speaker:` Selected most likely speaker. This matches the the first speaker entry in `probas`
- `qids:` Wikidata IDs of all aliases that match the selected speaker
- `numOccurrences:` Number of time this quotation occurs in the articles
- `urls:` List of links to the original articles containing the quotation 

### 2.2. External data

In order to get more context behind the speakers, we opted to use Wikidata dataset. This dataset, meant primarily for use in Wikimedia projects like Wikipedia or Wiktionary, contains properties and references describing an item, e.g., a person or a country. Because Quotebank uses Wikidata QIDs to refer to speakers, we can easily link persons to their attributes in Wikidata. 

Wikidata entries can contain an infinite number of attributes and references, so we have decided to use only a small number of attributes, which we can then use for demographic analysis. These attributes are:
- gender
- date of birth
- nationality
- occupation
- political party
- academic degree
- what political offices a person candidated for
- religion

All of these attributes are capable of containing multiple values (e.g., the entry for Angela Merkel marks her nationality as German and East German). 

This dataset, in its entirety, may have a size of about 100 GB, which is why we decided to use a subset provided by the course (named *speaker_attributes.parquet*). This subset contains these attributes (and a couple more that we opted not to use) for all speakers featured in Quotebank. In addition, we were provided with labels of all Wikidata entries used in the mentioned subset (*wikidata_labels_descriptions_quotebank.csv.bz2*), in order to dereference non-speaker attributes (like gender) more easily. 

#### References
- Wikidata website: https://www.wikidata.org/wiki/Wikidata:Main_Page
- Google Drive directory with preprocessed Wikidata dataset: https://drive.google.com/drive/folders/1VAFHacZFh0oxSxilgNByb1nlNsqznUf0

---

# 3. Pipeline

# Pipeline we are going to follow
- Load data
- Examine our data
- Clean up data
- Exploring and visualizing
- Modeling
- Interpreting our data
- Storytelling and communication

## 3.1. Load data

We encourage you to use Google Colab for executing this part of notebook - Google Colab has support for Google Drive, which in turn has support for linking to external folders, allowing us to use larger datasets without losing space capacity on our accounts. In addition, step 3 relies on loading speaker attributes to memory before handling them - this requires ~6 GB of RAM, which could be an issue on computers with 8 GB of RAM or less.

Each step will generate its own output files, for each year, to limit the strain on following steps and allow analysis of what was removed in each step by comparing pre-step data.

We start by importing libraries that will be used in the pipeline. We also define years to be considered - this will be used to load files for each year and save the output.

In [1]:
import re, json, bz2
import pandas as pd
YEARS = ["2015", "2016", "2017", "2018", "2019", "2020"]

KeyboardInterrupt: 

The first step removes quotes that are not mentioning Brexit and as such are irrelevant to our analysis. We do this by using a regular expression that matches quotes that either contain the phrase "Brexit" or the combination of "leave", "leaving", "exit", "exiting" and "EU" or "European Union". We use a streaming approach used [here](https://colab.research.google.com/drive/1NqLFrAWAzKxr2dAWHI7m6Ml3gWGF72cA) in order to reduce the strain on RAM usage (each datasource file is approx. 2 GB in size).

**Input:** Compressed JSON file with quotes from Quotebank from given year, e.g., `quotes-{year}.json.bz2`

**Output:** Compressed JSON file with quotes mentioning Brexit from Quotebank from given year, e.g., `quotes-{year}-step1.json.bz2`

**NOTE**: due to large number of quotes, this step can take long time - possibly over an hour. You have been warned.

In [19]:
# Step 1: only keep quotes that contain phrases: Brexit OR ((leaving OR leave OR exiting OR exit) AND (EU or European Union))

regex_text = "(brexit)|((leave|leaving|exit|exiting).*(\W+eu\W+|\W+european union))"
regex = re.compile(regex_text, re.IGNORECASE)

for year in YEARS:
    path_to_file = f'data/quotes-{year}.json.bz2' 
    path_to_out = f'data/quotes-{year}-step1.json.bz2'

    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance)
                quotation = instance['quotation'] + " " # extracting quotation, space needed to match 'EU' at the end of a sentence
                if(regex.match(quotation) is not None):
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing to a new file

FileNotFoundError: [Errno 2] No such file or directory: 'data/quotes-2015.json.bz2'

The second step removes quotes for which Quobert has not been able to attribute a speaker. Since the part of this project is to figure out *who* was for or against leaving the European Union, these quotes would not be useful to our analysis. This step should take significantly less time than the previous one - less than a minute.

**Input:** Compressed JSON file with quotes mentioning Brexit from Quotebank from given year, e.g., `quotes-{year}-step1.json.bz2`

**Output:** Compressed JSON file with quotes mentioning Brexit and with a known speaker from Quotebank from given year, e.g., `quotes-{year}-step2.json.bz2`

In [4]:
# Step 2: only keep quotes that have an attributed speaker
for year in YEARS:
    path_to_file = f'data/quotes-{year}-step1.json.bz2' 
    path_to_out = f'data/quotes-{year}-step2.json.bz2'

    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance)
                speaker = instance['speaker'] # Get the speaker's label
                if(speaker != "None"):
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing to the new file

The third and final step involves matching each quote with attributes of its speaker. As mentioned above, this is done by looking into Wikidata subset datasource and grabbing attributes assigned to most likely speaker's Wikidata QID. Because some fields may be empty (due to missing data), we make sure that this case is handled correctly.

**Inputs:** 
- compressed JSON file with quotes mentioning Brexit and with a known speaker from Quotebank from given year, e.g., `quotes-{year}-step2.json.bz2`
- Wikidata subset with data regarding speakers from Quotebank, stored as a `.parquet` file, i.e. `speaker_attributes.parquet`
- Wikidata subset with labels and descriptions of all references mentioned in `.parquet` Wikidata subset, i.e. `wikidata_labels_descriptions_quotebank.csv.bz2`

**Output:** Compressed JSON file with quotes mentioning Brexit and with a known speaker from Quotebank from given year, alongside attributes for speaker of each quote, e.g., `quotes-{year}-step3.json.bz2`

**NOTE**: Wikidata `.parquet` file is stored in memory, which could take ~6 GB of your RAM - please consider using Google Colab, which by default provides 12 GB of RAM. 

In [None]:
# Step 3: merge data from Quotebank and Wikidata
# NOTE: Wikidata parquet requires ~6 GB of RAM available - please use Colab for this step
wd_df = pd.read_parquet('data/speaker_attributes.parquet')
wd_df.set_index('id', inplace=True)

wd_desc_df = pd.read_csv('data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

for year in YEARS:
    path_to_file = f'data/quotes-{year}-step2.json.bz2' 
    path_to_out = f'data/quotes-{year}-step3.json.bz2'
    index_of_wd_df = set(wd_df.index)
    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance)
                qid = list(index_of_wd_df.intersection(set(instance['qids']))) # Get speaker's QIDs (there can be multiple)
                if(len(qid) == 0):
                    # Skip quotes with no speaker mentioned in Wikidata
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) 
                    continue
                if (wd_df['gender'][qid][0] is not None):
                  instance['gender'] = wd_desc_df['Label'][wd_df['gender'][qid][0]].tolist()
                else:
                  instance['gender'] = []
                if (wd_df['date_of_birth'][qid][0] is not None):
                  instance['date_of_birth'] = wd_df['date_of_birth'][qid][0].tolist()
                else:
                  instance['date_of_birth'] = []
                if (wd_df['nationality'][qid][0] is not None):
                  instance['nationality'] = wd_desc_df['Label'][wd_df['nationality'][qid][0]].tolist()
                else:
                  instance['nationality'] = []
                if (wd_df['occupation'][qid][0] is not None):
                  instance['occupation'] = wd_desc_df['Label'][wd_df['occupation'][qid][0]].tolist()
                else:
                  instance['occupation'] = []
                if (wd_df['party'][qid][0] is not None):
                  instance['party'] = wd_desc_df['Label'][wd_df['party'][qid][0]].tolist()
                else:
                  instance['party'] = []
                if (wd_df['academic_degree'][qid][0] is not None):
                  instance['academic_degree'] = wd_desc_df['Label'][wd_df['academic_degree'][qid][0]].tolist()
                else:
                  instance['academic_degree'] = []
                if (wd_df['candidacy'][qid][0] is not None):
                  instance['candidacy'] = wd_desc_df['Label'][wd_df['candidacy'][qid][0]].tolist()
                else:
                  instance['candidacy'] = []
                if (wd_df['religion'][qid][0] is not None):
                  instance['religion'] = wd_desc_df['Label'][wd_df['religion'][qid][0]].tolist()
                else:
                  instance['religion'] = []
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing to the new file

## 3.2. Examine our data

# Understanding what is in data. Get acquainted with the data.
- Formats
- Distributions
- Missing values
- Correlations

# Justify feasibility given the data
- Format is appropriate for analysis
- Enough quotes
- Quotes addressing the issue
- Different speakers involved
- Uniformity of speakers (not all from the same speaker)
- Long and short sentences
- Not many missing quotes
- Not many missing speakers
- There shouldnt be many correlations between (i dont know)

## 3.3. Clean up data

# Generic cleaning
- Identify duplicates
- Remove empty quotes
- Remove empty speakers

# NLP cleaning
- Remove capital letters, punctuations, emojis, links
- Remove quotes not mentioning Breixit

## 3.4. Exploring and visualization

### Plan for analysis

# For each of the questions include the following (even though it is not enriched in the end for ex):
- Various choices of analyses that we thought about but discarded 
- Consider ways to enrich, filter, transform data according to needs 
- Final good choice for analysis. Has to be reasonable and sound
- Complete necessary descriptive statistic tasks

# State and describe all of our questions
- Q1: Which percentage of the speakers supported or was against Brexit?
- Q2: What arguments did the members of each category use to support their beliefs? 
- Q3: Who were the main supporters of each of the categories? Analyze them according to age, gender, occupation, ...
- Q4: How did the opinion towards Brexit change during the 5 year span? Did the arguments of each group also change?

#### Q1: Which percentage of the speakers supported or was against Brexit?

#### Q2: What arguments did the members of each category use to support their beliefs? 


#### Q3: Who were the main supporters of each of the categories? Analyze them according to age, gender, occupation, ...


#### Q4: How did the opinion towards Brexit change during the 5 year span? Did the arguments of each group also change?

## 3.5. Modeling


## 3.6. Interpreting our data

## 3.7. Storytelling and communication

### Plan for storytelling and communication

In [None]:
# Various choices of communication that we thought about but discarded


In [None]:
# Final good choice for communication. Has to be reasonable and sound.


---

# 4. Conclusions

### 4.1. Summary of the notebook


### 4.2. Results obtained


### 4.3. Problems encountered

---

# 5. Future lines