# MILESTONE 2

`File name:` milestone2.py

`Authors:`
- Víctor González
- Alvaro Bautista 
- Alicia Soria 
- Kamil Czerniak

`Date created:` 05/11/2021

`Date last modified:` 12/11/2021

`Python Version:` 3.9.2


---

## Table of contents

**1. INTRODUCTION**
   * Context
   * Project idea
   * Project goals
   * Motivation
   * Feasibility
   
   
   
**2. THE DATA**
   * Quotebank
   * External data
    
    
    
**3. PIPELINE**
   * Load data
   * Examine our data
   * Clean up data
   * Modeling
   * Interpreting
   * Storytelling and communication
    
    
    
**4. CONCLUSIONS**
   * Summary
   * Results
   * Problems encountered
    
    
    
**5. FUTURE LINES**

---

# 1. Introduction

### 1.1. Context

# Brief introduction
Welcome to Milestone 2 Python notebook.
In this notebook we will answer several research questions surrounding the Breixit event.
We will employ the Quotebank dataset mainly, as well as additional databases to enrich the data and obtain more
complete conclusions.

### 1.2. Project idea


# Explain in clear, reasonable, and thorough way the project idea
Pinpoint and determine the arguments for and against Breixit in different social groups

### 1.3. Project goals

# Clear project goals
-
-
-

### 1.4. Motivation


In [None]:
# What story do we want to tell, why?
# Critical awareness of the project (social, cultural, political, economic, education. ... impact)

### 1.5. Feasibility

In [None]:
# Justify feasibility given the data

---

# 2. The data

In [None]:
# Description of the data

- `quoteID:` Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
- `quotation:` Text of the longest encountered original form of the quotation
- `date:` Earliest occurrence date of any version of the quotation
- `phase:` Corresponding phase of the data in which the quotation first occurred (A-E)
- `probas:` Array representing the probabilities of each speaker having uttered the quotation. The probabilities across different occurrences of the same quotation are summed for each distinct candidate speaker and then normalized
- `proba:` Probability for a given speaker
- `speaker:` Most frequent surface form for a given speaker in the articles where the quotation occurred
- `speaker:` Selected most likely speaker. This matches the the first speaker entry in `probas`
- `qids:` Wikidata IDs of all aliases that match the selected speaker
- `numOccurrences:` Number of time this quotation occurs in the articles
- `urls:` List of links to the original articles containing the quotation 

### 2.1. Quotebank

In [None]:
# Take into account it goes from 2015 to 2020 and it is not the sampled database

### 2.2. External data

---

# 3. Pipeline

# Pipeline we are going to follow
- Load data
- Examine our data
- Clean up data
- Exploring and visualizing
- Modeling
- Interpreting our data
- Storytelling and communication

## 3.1. Load data

# Capable of handling whole dataset
- Load the whole dataset
- Open it with Google Collaborate

Notes:
1. Each step will generate its own output files, for each year, to limit the strain on following steps and allow analysis of what was removed in each step by comparing pre-step data.

In [2]:
import re, json, bz2
import pandas as pd
YEARS = ["2015", "2016", "2017", "2018", "2019", "2020"]

In [19]:
# Step 1: only keep quotes that contain phrases: Brexit OR ((leaving OR leave OR exiting OR exit) AND (EU or European Union))

regex_text = "(brexit)|((leave|leaving|exit|exiting).*(\W+eu\W+|\W+european union))"
regex = re.compile(regex_text, re.IGNORECASE)

for year in YEARS:
    path_to_file = f'data/quotes-{year}.json.bz2' 
    path_to_out = f'data/quotes-{year}-step1.json.bz2'

    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance) # loading a sample
                quotation = instance['quotation'] + " " # extracting quotation, space needed to match EU at the end of a sentence
                if(regex.match(quotation) is not None):
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in a new file

FileNotFoundError: [Errno 2] No such file or directory: 'data/quotes-2015.json.bz2'

In [4]:
# Step 2: only keep quotes that have an attributed speaker
for year in YEARS:
    path_to_file = f'data/quotes-{year}-step1.json.bz2' 
    path_to_out = f'data/quotes-{year}-step2.json.bz2'

    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance) # loading a sample
                speaker = instance['speaker'] # extracting quotation
                if(speaker != "None"):
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

In [None]:
# Step 3: merge data from Quotebank and Wikidata
# NOTE: Wikidata parquet requires ~7 GB of RAM available - please use Colab for this step
wd_df = pd.read_parquet('data/speaker_attributes.parquet')
wd_df.set_index('id', inplace=True)

wd_desc_df = pd.read_csv('data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

for year in YEARS:
    path_to_file = f'data/quotes-{year}-step2.json.bz2' 
    path_to_out = f'data/quotes-{year}-step3.json.bz2'
    index_of_wd_df = set(wd_df.index)
    with bz2.open(path_to_file, 'rb') as s_file:
        with bz2.open(path_to_out, 'wb') as d_file:
            for instance in s_file:
                instance = json.loads(instance) # loading a sample
                qid = list(index_of_wd_df.intersection(set(instance['qids'])))
                if(len(qid) == 0):
                    # Overall, two quotes get ignored due to no match in Wikidata
                    d_file.write((json.dumps(instance)+'\n').encode('utf-8')) 
                    continue
                if (wd_df['gender'][qid][0] is not None):
                  instance['gender'] = wd_desc_df['Label'][wd_df['gender'][qid][0]].tolist()
                else:
                  instance['gender'] = []
                if (wd_df['date_of_birth'][qid][0] is not None):
                  instance['date_of_birth'] = wd_df['date_of_birth'][qid][0].tolist()
                else:
                  instance['date_of_birth'] = []
                if (wd_df['nationality'][qid][0] is not None):
                  instance['nationality'] = wd_desc_df['Label'][wd_df['nationality'][qid][0]].tolist()
                else:
                  instance['nationality'] = []
                if (wd_df['occupation'][qid][0] is not None):
                  instance['occupation'] = wd_desc_df['Label'][wd_df['occupation'][qid][0]].tolist()
                else:
                  instance['occupation'] = []
                if (wd_df['party'][qid][0] is not None):
                  instance['party'] = wd_desc_df['Label'][wd_df['party'][qid][0]].tolist()
                else:
                  instance['party'] = []
                if (wd_df['academic_degree'][qid][0] is not None):
                  instance['academic_degree'] = wd_desc_df['Label'][wd_df['academic_degree'][qid][0]].tolist()
                else:
                  instance['academic_degree'] = []
                if (wd_df['candidacy'][qid][0] is not None):
                  instance['candidacy'] = wd_desc_df['Label'][wd_df['candidacy'][qid][0]].tolist()
                else:
                  instance['candidacy'] = []
                if (wd_df['religion'][qid][0] is not None):
                  instance['religion'] = wd_desc_df['Label'][wd_df['religion'][qid][0]].tolist()
                else:
                  instance['religion'] = []
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

## 3.2. Examine our data

# Understanding what is in data. Get acquainted with the data.
- Formats
- Distributions
- Missing values
- Correlations

# Justify feasibility given the data
- Format is appropriate for analysis
- Enough quotes
- Quotes addressing the issue
- Different speakers involved
- Uniformity of speakers (not all from the same speaker)
- Long and short sentences
- Not many missing quotes
- Not many missing speakers
- There shouldnt be many correlations between (i dont know)

## 3.3. Clean up data

# Generic cleaning
- Identify duplicates
- Remove empty quotes
- Remove empty speakers

# NLP cleaning
- Remove capital letters, punctuations, emojis, links
- Remove quotes not mentioning Breixit

## 3.4. Exploring and visualization

### Plan for analysis

# For each of the questions include the following (even though it is not enriched in the end for ex):
- Various choices of analyses that we thought about but discarded 
- Consider ways to enrich, filter, transform data according to needs 
- Final good choice for analysis. Has to be reasonable and sound
- Complete necessary descriptive statistic tasks

# State and describe all of our questions
- Q1: Which percentage of the speakers supported or was against Brexit?
- Q2: What arguments did the members of each category use to support their beliefs? 
- Q3: Who were the main supporters of each of the categories? Analyze them according to age, gender, occupation, ...
- Q4: How did the opinion towards Brexit change during the 5 year span? Did the arguments of each group also change?

#### Q1: Which percentage of the speakers supported or was against Brexit?

#### Q2: What arguments did the members of each category use to support their beliefs? 


#### Q3: Who were the main supporters of each of the categories? Analyze them according to age, gender, occupation, ...


#### Q4: How did the opinion towards Brexit change during the 5 year span? Did the arguments of each group also change?

## 3.5. Modeling


## 3.6. Interpreting our data

## 3.7. Storytelling and communication

### Plan for storytelling and communication

In [None]:
# Various choices of communication that we thought about but discarded


In [None]:
# Final good choice for communication. Has to be reasonable and sound.


---

# 4. Conclusions

### 4.1. Summary of the notebook


### 4.2. Results obtained


### 4.3. Problems encountered

---

# 5. Future lines