# Mouting the Google Drive

It is possible to mount your Google Drive to Colab if you need additional storage or if you need to use files from it. To do that run (click on play button or use keyboard shortcut 'Command/Ctrl+Enter') the following code cell:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Setup

In [2]:
PATH_ROOT = '/content/drive/MyDrive/ADA'
PATH_PARQUET = PATH_ROOT + '/Project datasets'
PATH_QUOTEBANK = PATH_ROOT + '/Quotebank'

In [3]:
# Comment the following line if you're not using Google Colab
# We need older version of pandas when working with Google Colab to be able to
# read large data in chunks.
!pip install pandas==1.0.5



In [4]:
# Dependency to read Parquet
!pip install pyarrow



In [5]:
import pandas as pd
import numpy as np
import seaborn as sns

# Sample task

You can read the data directly from the Google Drive you mounted following the process above. Make sure you mounted the drive to which you saved the shortcut to the Quotebank data. We will go through a simple task with this data - we will extract domain names of URLs for each sample from the year 2020.

## Extracting the domain names (Skip this for now, will be useful for getting newspapers)

This is an example on how to extract domain names from a sample. To do that, we can use *tld* library. To install it:

In [None]:
!pip install tld

Following function then gives domain name. It takes as an argument an URL and returns the domain name:

In [None]:
from tld import get_tld

def get_domain(url):
    res = get_tld(url, as_object=True)
    return res.tld


Now we will have to read the data. Each sample has property 'urls' which contains a list of links to the original articles containing the quotation. We will extract domain names for these links. Then, we will save a new file that contains samples with extracted domains. The new file will be saved in local storage in Colab but you can change path_to_out variable (optionally) if you want to save it directly to the drive. To generate a new file, run this cell:

In [None]:
path_to_file = '/content/drive/MyDrive/ADA/Quotebank/quotes-2020.json.bz2' 
path_to_out = '/content/drive/MyDrive/ADA/output/quotes-2020-domains.json.bz2'

In [None]:
import bz2
import json

path_to_file = '/content/drive/MyDrive/ADA/Quotebank/quotes-2020.json.bz2' 
path_to_out = '/content/drive/MyDrive/ADA/output/quotes-2020-domains.json.bz2'

with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
        for instance in s_file:
            instance = json.loads(instance) # loading a sample
            urls = instance['urls'] # extracting list of links
            domains = []
            for url in urls:
                tld = get_domain(url)
                domains.append(tld)
            instance['domains'] = domains # updating the sample with domain name
            d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

It should take around 25min for this cell to finish running and you will be able to see a file (*quotes-2020-domains.json.bz2*) in the file explorer on the left side once it is done.

You are all set, good luck! :)

# Processing in Chunks

In [6]:
def process_chunk(chunk):
        print(f'Processing chunk with {len(chunk)} rows')
        print(chunk.columns)

# Doesn't work with older version of pandas
'''with pd.read_json(path_or_buf=path_to_out, lines=True, compression='bz2', chunksize=1000000) as df_reader:
    for chunk in df_reader:
        process_chunk(chunk)'''

df = pd.read_json(path_or_buf=PATH_QUOTEBANK + '/quotes-2020.json.bz2', orient="records", compression='bz2', lines=True, chunksize=5)

for chunk in df:
  df1 = chunk
  break
df1

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


# Cleaning

# Reading wikidata labels

In [6]:
df_wikidata_labels = pd.read_csv(PATH_PARQUET + '/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
df_wikidata_labels

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America
...,...,...
Q106302506,didgeridooist,musician who plays the didgeridoo
Q106341153,biochemistry teacher,teacher of biochemistry at any level
Q106368830,2018 Wigan Metropolitan Borough Council electi...,
Q106369692,2018 Wigan Metropolitan Borough Council electi...,


# Reading a subset of Parquet

In [8]:
path_to_parquet_file = PATH_PARQUET + '/speaker_attributes.parquet/part-00000-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet'

df_parquet_subset = pd.read_parquet(path_to_parquet_file)

In [9]:
df_parquet_subset.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,


# Reading speakers parquet file

In [7]:
df_parquet = pd.read_parquet(PATH_PARQUET + '/speaker_attributes.parquet')
df_parquet.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,


In [8]:
df_parquet.id.is_unique

True

Safe to use Qids in parquet file as index

In [8]:
df_parquet.set_index(keys='id', inplace=True)
df_parquet.head()

Unnamed: 0_level_0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,label,candidacy,type,religion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Q23,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,George Washington,"[Q698073, Q697949]",item,[Q682443]
Q42,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Douglas Adams,,item,
Q1868,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Paul Otlet,,item,
Q207,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
Q297,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Diego Velázquez,,item,


## Getting different gender values (Just messing around with parquet and wikidata)

In [11]:
def get_array_length(x):
  if x is not None:
    return len(x) 
  return 0

def get_second_gender(x):
  if x is None:
    return 'Q0'
  if len(x) == 2:
    return x[1]
  else:
    return 'Q0'

def get_first_gender(x):
  if x is None:
    return 'Q0'
  if len(x) > 0:
    return x[0]
  else:
    return 'Q0'

In [12]:
# How many have more than one gender value
len(df_parquet[df_parquet.gender.apply(get_array_length) > 1])

1380

In [13]:
# Group all the first gender values in a set, all the second in another set, and do set difference
# This confirms that first gender values encapsulate all possible gender values and we can use that for later
set(df_parquet.gender.apply(get_second_gender).unique()).difference(set(df_parquet.gender.apply(get_first_gender).unique()))

{'Q15145782', 'Q15145783', 'Q281833', 'Q3277905', 'Q51415', 'Q8964773'}

In [14]:
gender_qids = pd.concat([df_parquet.gender.apply(get_first_gender), df_parquet.gender.apply(get_second_gender)]).drop_duplicates()

In [31]:
# Getting all different gender values
df_wikidata_labels.merge(right = gender_qids, how='right', left_on='QID', right_on='gender')[['gender', 'Label']]

Unnamed: 0,gender,Label
0,Q281833,Taira no Kiyomori
1,Q505371,agender
2,Q1097630,intersex
3,Q1289754,neutrois
4,Q1775415,feminine
5,Q3177577,muxe
6,Q6581097,male
7,Q27679766,transmasculine
8,Q44148,male organism
9,Q179294,eunuch


We have speakers who are demiboys...
But what is Taira no Kiyomori Q281833?

In [16]:
#df_parquet_labels_2 = pd.read_csv(PATH_TO_PARQUET + '/wikidata_labels_descriptions.csv.bz2', compression='bz2', index_col='QID')
#df_parquet_labels_2

**Note**: You can use [explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) pandas function to transform each element of a list-like to a row, replicating index values. 

In [36]:
df_gender_exploded = df_parquet.explode('gender')
df_gender_exploded[df_gender_exploded.gender == 'Q281833']

Unnamed: 0_level_0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,label,candidacy,type,religion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Q710537,,[+1157-01-01T00:00:00Z],[Q17],Q281833,1386969157,,,[Q38142],,,Taira no Shigehira,,item,


Finding: Taira no Shigehira has bad gender value Q281833	which refers to another speaker.


# Group speakers based on gender (from 2020)

# Sample 10000 random speakers from 2020 dataset

In [9]:
sample_10000_speakers = df_parquet.sample(n=10000)
sample_10000_speakers

Unnamed: 0_level_0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,label,candidacy,type,religion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Q33106702,,[+1942-07-17T00:00:00Z],,[Q6581072],1263799499,,,,,,Martine Betch,,item,
Q96465173,,,[Q974],[Q6581072],1369842897,,,[Q82955],,,MBELU Elisabeth,,item,
Q86493487,,,,[Q6581097],1363488691,,,[Q1650915],,,Daniel Royzman,,item,
Q6097767,,[+1872-00-00T00:00:00Z],,[Q6581097],895921488,,,,,,Mehmed Celaleddin Saygun,,item,
Q16065852,,[+1798-01-01T00:00:00Z],,[Q6581097],1336183976,,,[Q333634],,,Jonathan Edwards Ryland,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q47500497,[Cláudio Tavares],[+1964-02-05T00:00:00Z],[Q155],[Q6581097],1331048472,,,[Q13382576],,,Cláudio Mello Tavares,,item,
Q65031945,,[+1958-05-11T00:00:00Z],,,1257787258,,,[Q11774891],,,,,item,
Q10395491,,[+1977-02-19T00:00:00Z],[Q155],[Q6581097],1394214951,,,"[Q33999, Q177220, Q36834]",,,Zéu Britto,,item,
Q88140681,,,,,1354584194,,,[Q1650915],,,Alfredis González Hernández,,item,


In [None]:
selected_columns=['quoteID', 'quotation', 'speaker', 'qids']
quotes_speaker_2020 = pd.DataFrame(columns = selected_columns)

df_2020_dataset = pd.read_json(path_or_buf=PATH_QUOTEBANK + '/quotes-2020.json.bz2', compression='bz2', lines=True, chunksize=500000)

length = 0
for chunk in df_2020_dataset:
  #quotes_speaker_2020 = pd.concat([quotes_speaker_2020, chunk[selected_columns]], ignore_index=True)
  length += len(chunk)
length

**Some speakers have multiple qids**!

# Extracting topics from quotes (2020) 
