# Data loading and pre-processing notebook 
As explained in the abstract, the aim of the project is to analyse data concerning topics such as same-sex relationships and feminism. 
The aim of this notebook is to obtain a clean dataset that contain only quotes to be analysed regarding our topic of interest. 
To do that, we'll perform below the following steps : 
- Data extraction : The Quotebank dataset contains data by year. Each file is too big to process. We divided each year in parts/chunks in order to process it. Then, we filtered each chunks to extract quotes containing keywords we have previously defined. 
- Wikidata handling : From the extracted quotes, we labelled the authors with their corresponding nationality, gender, occupation, date of birth.
- Final merging : The last step was to get the final cleaned dataset with quotes of interest, and all labels from wikidata.


In [None]:
# get files from Google Drive 
from google.colab import drive  
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# import libraries 
import bz2
import json
import pandas as pd
import numpy as np 
import os
from matplotlib import pyplot as plt
import re
import nltk
from tqdm import tqdm 
from nltk.stem import WordNetLemmatizer
import spacy 
from collections import Counter

In [None]:
YEARS = np.arange(2015, 2021)  # define dataset parameters
PROJECT_PATH = "/content/drive/Shareddrives/ADA/" 

## 1) Data extraction 
The dataset is large, so we had to divide it into parts in order to avoid problems with RAM . We get data by years and extracted from it fields which are interesting for us: ID, Quote, Speaker, numOccurrences. Each chunk has size 1e6 and is loaded to a new json file.

In [None]:
# creates path string to parsed file 
get_file_name = lambda year, part: os.path.join(PROJECT_PATH, f"Data/Parsed data/IdQuotes{year}_{part}.json")

In [None]:
for year in tqdm(YEARS): 
  path_file = os.path.join(PROJECT_PATH, f"Quotebank/quotes-{year}.json.bz2")
  decompressed_file = bz2.BZ2File(path_file, "r")  
  quote_idx_dataset = {"ID": [], "Quote": [], "Speaker": [], "numOccurrences": []}
  count_parts = 0  # count of parts by which we divide a dataset 
    
  for idx, line in enumerate(decompressed_file):
      cur_quote = json.loads(line)

      # complete dictionary 
      quote_idx_dataset["ID"].append(cur_quote["quoteID"])
      quote_idx_dataset["Quote"].append(cur_quote["quotation"])
      quote_idx_dataset["Speaker"].append(cur_quote["speaker"])
      quote_idx_dataset["numOccurrences"].append(cur_quote["numOccurrences"])
      
      # divide by parts of 1e6 quotes and save them to json   
      if idx > (count_parts + 1) * 1e6:  
        pd.DataFrame(quote_idx_dataset).to_json(get_file_name(year, count_parts))
        count_parts += 1  
        quote_idx_dataset = {"ID": [], "Quote": [], "Speaker": [], "numOccurrences": []}
  
  print(year, "is done!")
  pd.DataFrame(quote_idx_dataset).to_json(get_file_name(year, count_parts))

To limit dataset size we took only quotes that contain keywords concerning our topics. 

### Key word list

In [None]:
KEY_WORDS = "women's rights|sex equality|women's role|role of women|women's liberation|\
    gender|feminism|feminists|marriage equality|gay|same-sex|same-sex marriage|\
    gay marriage|homosexual marriage|same-gender marriage|lesbian|lesbians|gays|\
    lesbian|sexism|bisexual|transgender|transsexual|queer|questioning sexual identity|\
    questioning gender identity|intersex|asexual|pansexual|LGBT|LGBTQ|LGBTQI|LGBTQIA|\
    LGBTQIAA|LGBTQI2A|LGBTQIAAP|LGBTQI2AP|LGBTTQIAAP|LGBTTQI2AP|LGBT+|LGBTQ+|LGBTQI+|\
    LGBTQIA+|LGBTQIAA+|LGBTQI2A+|LGBTQIAAP+|LGBTQI2AP+|LGBTTQIAAP+|LGBTTQI2AP+| polysexual|\
    demisexual|gayprideasexual|asexuals|asexuality|bisexual|bisexuals|bisexuality|cisgender|\
    cisgenders|cisgenderism|demisexual|demisexuals|demisexuality|gay|gays|gaypride|gender \
    fluidity|heteronormative|heteronormatives|heteronormativity|hetero sexual|hetero sexuals|\
    heterosexual|heterosexuals|heterosexuality|homo sexual|homo sexuals|homosexual|homo sexuals|\
    homosexuality|inter sex|intersex|intersexual|intersexuals|intersexuality|lesbian|lesbians|\
    lgbt|lgbt+|lgbtq|lgbtq+|lgbtqi|lgbtqi+|lgbtqi2a|lgbtqi2a+|lgbtqi2ap|lgbtqi2ap+|lgbtqia|\
    lgbtqia+|lgbtqiaa|lgbtqiaa+|lgbtqiaap|lgbtqiaap+|lgbttqi2ap|lgbttqi2ap+|lgbttqiaap|\
    lgbttqiaap+|pan sexual|pan sexuals|pansexual|pansexuals|pansexuality|poly sexual|poly\
    sexuals|polysexual|polysexuals|polysexuality|queer|queers|questioning gender identity|\
    questioning sexual identity|trans gender|trans genders|transgender|transgenders|\
    transgenderism|trans misoginy|transmisoginy|trans phobia|transphobia|transphobic|\
    trans sexual|trans sexuals|transsexual|transsexuals|transsexuality|gender roles|\
    misandry|misogyny|patriarchy|sexism|woman empowerement|toxic masculinity"

In [None]:
def part_with_keywords(year, quote_part):
    """
    Return parts of dataset which contains keywords. 
    params: year - year of the dataset; 
    params: quote_part - number of part of full dataset.
    """
    cur_data = pd.read_json(get_file_name(year, quote_part))
    contains_key_words = cur_data["Quote"].str.contains(KEY_WORDS) # find index containing keywords
    return cur_data.iloc[np.where(contains_key_words)].reset_index()

In [None]:
# number of parts into which data is divided 
PARTS = {2015: 21, 2016: 14, 2017: 27, 2018: 28, 2019: 22, 2020: 6}

We save the file to Google Drive every step of the way in case the runtime is restarted.

In [None]:
for year in YEARS: 
  # collect data for the year 
  all_data = [part_with_keywords(year, quote_part) for quote_part in range(PARTS[year])]  
  full = pd.concat(all_data)
  full.reset_index(inplace=True)
  full.to_json(os.path.join(PROJECT_PATH, f"Data/key_words{year}.json"))

The Google Colab session was stopped here, so we load results from Google Drive. 

### full_keywords = dataframe containing the quotes of interest for all years

In [None]:
# concatenate data for all years 
full_keywords = pd.concat([pd.read_json(os.path.join(PROJECT_PATH, f"Data/key_words{year}.json")) \
                           for year in YEARS], ignore_index=True)

In [None]:
# check data and its shape
print("Shape of a dataset with quotes related to our topics is", full_keywords.shape)
full_keywords.head()  # data filtered by keywords

Shape of a dataset with quotes related to our topics is (352789, 6)


Unnamed: 0,level_0,index,ID,Quote,Speaker,numOccurrences
0,3,330,2015-02-09-001217,a proud transgender woman and for me to be vio...,,1
1,4,419,2015-06-17-018173,I did feel I was gently being prodded into the...,Colin Mathura-Jeffree,2
2,22,1532,2015-04-17-007557,Because my brother and sister are both gay and...,roseanne barr,1
3,23,1877,2015-03-09-028376,"I think Sam died with a smile on his face, kno...",Ingrid Newkirk,4
4,24,1952,2015-03-10-009758,But my question with that [ is ] when you look...,Tra Thomas,2


In [None]:
full_keywords.to_json(os.path.join(PROJECT_PATH, "keywords_full.json"))

## 2) Wikidata handling

We added speakers' features: nationality, gender (others can be added the same way) to the dataset from WikiData.

In [None]:
# go to folder which contains parquet files
%cd /content/drive/Shareddrives/ADA/Project\ datasets/speaker_attributes.parquet/

/content/drive/.shortcut-targets-by-id/1VAFHacZFh0oxSxilgNByb1nlNsqznUf0/Project datasets/speaker_attributes.parquet


WikiData is divided into 16 parts, so we iterate all of them and try to find speakers of quotes 

In [None]:
all_wiki_data = []  # concatenate all wiki data  
for wiki_part in tqdm(range(16)): 
  wiki_data = pd.read_parquet(f"part-000{wiki_part:02d}-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet") 
  # preprocess wikidata 
  wiki_data.rename(columns={"label": "Speaker"}, inplace=True)
    
  # we decrease size of wikidata in order to accelerate search of speakers     
  wiki_data.drop(["aliases", "lastrevid", "ethnic_group", "US_congress_bio_ID", "party", "id", \
                  "candidacy", "type"], axis=1, inplace=True)
  wiki_data.drop_duplicates(["Speaker"], inplace=True)  
  # we remarked that in Quotebank some names are written CAPS LOCK
  # so we have to normalize case 
  wiki_data.Speaker = wiki_data.Speaker.str.lower()
  all_wiki_data.append(wiki_data)

100%|██████████| 16/16 [00:39<00:00,  2.44s/it]


In [None]:
wiki_data = pd.concat(all_wiki_data)
full_keywords.Speaker = full_keywords.Speaker.str.lower()
wiki_data.head()

Unnamed: 0,date_of_birth,nationality,gender,occupation,academic_degree,Speaker,religion
0,[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",,george washington,[Q682443]
1,[+1952-03-11T00:00:00Z],[Q145],[Q6581097],"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,douglas adams,
2,[+1868-08-23T00:00:00Z],[Q31],[Q6581097],"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,paul otlet,
3,[+1946-07-06T00:00:00Z],[Q30],[Q6581097],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",,george w. bush,"[Q329646, Q682443, Q33203]"
4,[+1599-06-06T00:00:00Z],[Q29],[Q6581097],[Q1028181],,diego velázquez,


In [None]:
 # WikiData contains a lot of persons with the same names, so we will leave the most popular 
wiki_data.drop_duplicates("Speaker", inplace=True) 
wiki_data.shape

(7281692, 7)

In [None]:
full_wiki = pd.merge(wiki_data, full_keywords, on=['Speaker'])

In [None]:
# not founded in wikidata speakers (depends on keywords)
no_wikidata = pd.concat([full_keywords.Speaker,full_wiki.Speaker]).drop_duplicates()
no_wikidata

0                          none
1         colin mathura-jeffree
2                 roseanne barr
3                ingrid newkirk
4                    tra thomas
                  ...          
352736            mario falcone
352738              john church
352752           mary patterson
352774              hank wilson
352786            nicholas phan
Name: Speaker, Length: 43661, dtype: object

In [None]:
no_wikidata = full_keywords.iloc[no_wikidata.index]
no_wikidata.to_json(os.path.join(PROJECT_PATH, "Data/no_wikidata.json"))

In [None]:
print("shape without none speakers", full_wiki.shape)
full_wiki.head()  # data filtered by keywords with unlabelled wikidata fields 

shape without none speakers (209000, 11)


Unnamed: 0,date_of_birth,nationality,gender,occupation,academic_degree,Speaker,religion,index,ID,Quote,numOccurrences
0,[+1952-03-11T00:00:00Z],[Q145],[Q6581097],"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,douglas adams,,272571,2015-05-21-068797,To claim that homosexual behavior is wrong wou...,1
1,[+1952-03-11T00:00:00Z],[Q145],[Q6581097],"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,douglas adams,,125117,2017-11-12-089989,"You can't get away with saying, `If you try to...",1
2,[+1946-07-06T00:00:00Z],[Q30],[Q6581097],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",,george w. bush,"[Q329646, Q682443, Q33203]",119789,2015-03-20-081663,"There should be protections, and so, as it rel...",2
3,[+1946-07-06T00:00:00Z],[Q30],[Q6581097],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",,george w. bush,"[Q329646, Q682443, Q33203]",669032,2015-06-21-004378,But there is one area where I have done much t...,1
4,[+1946-07-06T00:00:00Z],[Q30],[Q6581097],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",,george w. bush,"[Q329646, Q682443, Q33203]",69752,2016-05-08-005838,Bush is set to accept the award at an event wh...,1


In [None]:
full_wiki.reset_index(inplace=True)
full_wiki.to_json(os.path.join(PROJECT_PATH, "Data/wiki_non_labelled.json"))

Initial WikiData contains IDs that have to be mapped with labels.   

In [None]:
full_wiki = pd.read_json(os.path.join(PROJECT_PATH, "Data/wiki_non_labelled.json"))

In [None]:
labels_qid_path = "../wikidata_labels_descriptions_quotebank.csv.bz2"
labels_qid = pd.read_csv(labels_qid_path)
# we decrease size of wikidata labels in order to accelerate search of ids 
labels_qid.drop(["Description"], axis=1, inplace=True)

To speed up the comparison of identifiers and tags, we find all the unique values of tags and select a mini dataframe of our ids and wikidata labels

In [None]:
# all unique ids
nationalities = np.unique(np.hstack(full_wiki["nationality"].dropna().to_numpy())) 
genders = np.unique(np.hstack(full_wiki["gender"].dropna()))
occupations =  np.unique(np.hstack(full_wiki["occupation"].dropna().to_numpy()))
all_id = np.concatenate([genders, nationalities, occupations])  # unique id 
labels_qid = labels_qid[labels_qid['QID'].isin(all_id)]  # mini dataframe

### Attribute to each speaker its gender, nationality and occcupation 

In [None]:
for column in ["gender", "nationality", "occupation"]:  # wikidata features which we want to label 
    # Some speakers have many nationalities, so we have to iterate lists of different lengths
    full_wiki[column] = full_wiki[column].dropna().apply(lambda k: 
                                [labels_qid[labels_qid.QID == elem]["Label"] for elem in k])

Finally, we got the dataset aggregating quotes and information about speakers. 

In [None]:
print("Shape of a labelled dataset with quotes related to our topics is", full_wiki.shape)
full_wiki.head()  # data filtered by keywords with some labelled wikidata fields 

Shape of a labelled dataset with quotes related to our topics is (209000, 12)


Unnamed: 0,level_0,date_of_birth,nationality,gender,occupation,academic_degree,Speaker,religion,index,ID,Quote,numOccurrences
0,0,[+1952-03-11T00:00:00Z],[[United Kingdom]],[[male]],"[[playwright], [screenwriter], [novelist], [ch...",,douglas adams,,272571,2015-05-21-068797,To claim that homosexual behavior is wrong wou...,1
1,1,[+1952-03-11T00:00:00Z],[[United Kingdom]],[[male]],"[[playwright], [screenwriter], [novelist], [ch...",,douglas adams,,125117,2017-11-12-089989,"You can't get away with saying, `If you try to...",1
2,2,[+1946-07-06T00:00:00Z],[[United States of America]],[[male]],"[[politician], [motivational speaker], [autobi...",,george w. bush,"[Q329646, Q682443, Q33203]",119789,2015-03-20-081663,"There should be protections, and so, as it rel...",2
3,3,[+1946-07-06T00:00:00Z],[[United States of America]],[[male]],"[[politician], [motivational speaker], [autobi...",,george w. bush,"[Q329646, Q682443, Q33203]",669032,2015-06-21-004378,But there is one area where I have done much t...,1
4,4,[+1946-07-06T00:00:00Z],[[United States of America]],[[male]],"[[politician], [motivational speaker], [autobi...",,george w. bush,"[Q329646, Q682443, Q33203]",69752,2016-05-08-005838,Bush is set to accept the award at an event wh...,1


In [None]:
full_wiki.to_json(os.path.join(PROJECT_PATH, "Data/wiki_labelled.json"))

### Getting the final clean dataset : result_data.json

In [None]:
# from mapping values {QID: label} to array of labels 
for col in ["occupation", "nationality", "gender"]:
  full_wiki[col] = [np.hstack([list(k.values()) for k in elem]) if elem is not None else None for elem in full_wiki[col].values] 

In [None]:
# drop extra columns 
full_wiki.drop(["level_0", "academic_degree", "religion", "index"], axis=1, inplace=True)

In [None]:
# from date of birth to year of birth 
full_wiki.date_of_birth = [elem[0].split("-")[0][1:] if elem is not None else None for elem in full_wiki.date_of_birth]

In [None]:
# from id to year of citation 
full_wiki["quote_year"] = [elem.split("-")[0] for elem in full_wiki.ID]

In [None]:
# from id to month of citation 
full_wiki["quote_month"] = [elem.split("-")[1] for elem in full_wiki.ID]

In [None]:
# ID is not necessary anymore 
full_wiki.drop(["ID"], axis=1, inplace=True)

In [None]:
full_wiki.head()  # check data 

Unnamed: 0,date_of_birth,nationality,gender,occupation,Speaker,Quote,numOccurrences,quote_year,quote_month
0,1952,[United Kingdom],[male],"[playwright, screenwriter, novelist, children'...",douglas adams,To claim that homosexual behavior is wrong wou...,1,2015,5
1,1952,[United Kingdom],[male],"[playwright, screenwriter, novelist, children'...",douglas adams,"You can't get away with saying, `If you try to...",1,2017,11
2,1946,[United States of America],[male],"[politician, motivational speaker, autobiograp...",george w. bush,"There should be protections, and so, as it rel...",2,2015,3
3,1946,[United States of America],[male],"[politician, motivational speaker, autobiograp...",george w. bush,But there is one area where I have done much t...,1,2015,6
4,1946,[United States of America],[male],"[politician, motivational speaker, autobiograp...",george w. bush,Bush is set to accept the award at an event wh...,1,2016,5


In [None]:
full_wiki.to_json(os.path.join(PROJECT_PATH, "Data/result_data.json"))

This dataset contains 9 features and 209000 samples.

## Data problems 

The  speakers' names don't share the same format between quotes : there are still some "à" / "é" characters, some names are written as "A.", some include titles like "duke", and some are inverted or written with commas.

1) We remarked similar but not identical quotes because f punctuation etc. 

In [None]:
full_wiki.Quote.iloc[55]

"Sometimes, I think we're hurt. We hurt our boys by calling something toxic masculinity. I do. And I don't find putting those two words together [ works ] because women can be pretty f *** ng toxic. It's toxic people. We have our good angles, and we have our bad ones,"

In [None]:
full_wiki.Quote.iloc[56]

"Sometimes, I think we're hurt. We hurt our boys by calling something toxic masculinity. I do. And I don't find [ that ] putting those two words together... because women can be pretty fucking toxic,"

2) Writing of speakers names in quotes dataset is not normalized (we tried to normalize case, but there are also à, é etc. Also, some names are written as A. or B. etc. There are descriptions like "duke". Some names are inversed and written with commas.  

In [None]:
no_wiki = pd.read_json(os.join.path(PROJECT_PATH, "Data/no_wikidata.json")).dropna()["Speaker"]

In [None]:
no_wiki.iloc[np.where(no_wiki.str.contains(",."))]

4668                      james pickens , jr. .
7355           catherine , duchess of cambridge
8199                        george bush , sr. .
9007                          alan ladd , jr. .
10265                          armstrong , jack
                          ...                  
321273                      frank white , jr. .
330201                     lou chibbaro , jr. .
346138    prince daniel , duke of västergötland
349951            prince harry , duke of sussex
350429              sophie , countess of wessex
Name: Speaker, Length: 62, dtype: object

3) There are not english quotes. Some of them are translated in different ways in different quotes that duplicates data. 

In [None]:
full_wiki.Quote.iloc[95]

'Beimano ki barbaadi ka waqt shuru ho gaya hai aur imandaro ki takleefein kum hongi,'