# Retrieve metadata about speakers

This notebook is dedicated to the retrieve of the speakers metadata. First, it uses the parquet file that was provided to retrieve extra-information (occupation, date of birth, gender, religion) on speakers. 

## Using provided parquet file

Pandas requires pyarrow to read parquet files, which can be installed using conda install pyarrow -c conda-forge.
You can load this file as a pandas dataframe using df = pd.read_parquet(<path_to_file>).

In [1]:
#importing the required modules
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
# Small adjustments to default style of plots, making sure it's readable and colorblind-friendly everywhere
plt.style.use('seaborn-colorblind')
plt.rcParams.update({'font.size' : 12.5,
                     'figure.figsize':(10,7)})

### Import data


In [4]:
#Define the path for folder containing data
#TO BE MODIFIED ACCORDING HIS OWN FOLER containing data in local
path = '../ADA_project_data/'
df_metadata = pd.read_parquet(path+'speaker_attributes.parquet')
df_metadata.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,,,,[Q6581097],985453603,,,,,,Q45441526,Cui Yan,,item,
1,,,[Q9903],[Q6581097],1008699604,,,,,,Q45441555,Guo Ziyi,,item,
2,,,[Q9903],[Q6581097],1008699709,,,,,,Q45441562,Wan Zikui,,item,
3,,,[Q9903],[Q6581097],1008699728,,,,,,Q45441563,Lin Pei,,item,
4,,,[Q9683],[Q6581097],985261661,,,,,,Q45441565,Guan Zhen,,item,


In [5]:
#Import the dataset sample
path = '../ADA_project_data/'
df_sample = pd.read_json(path+'Sample.json.bz2',compression = 'bz2',lines = True)
df_sample.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2015-11-11-109291,They'll call me lots of different things. Libe...,Chris Christie,[Q63879],2015-11-11 00:55:12,1,"[[Chris Christie, 0.7395], [Bobby Jindal, 0.15...",[http://thehill.com/blogs/ballot-box/259760-ch...,E
1,2015-11-04-105046,"The choices are not that easy,",Dr. John,"[Q511074, Q54593093]",2015-11-04 18:13:06,2,"[[Dr. John, 0.5531], [None, 0.4469]]",[http://delawareonline.com/story/news/health/2...,E
2,2015-09-11-070666,It's kind of the same way it's been with the R...,Niklas Kronwall,[Q722939],2015-09-11 19:54:00,1,"[[Niklas Kronwall, 0.7119], [None, 0.2067], [H...",[http://redwings.nhl.com/club/news.htm?id=7787...,E
3,2015-01-12-082489,"We're now going back to the frozen tundra, and...",Frances McDormand,[Q204299],2015-01-12 01:40:00,3,"[[Frances McDormand, 0.484], [None, 0.4495], [...",[http://feeds.people.com/~r/people/headlines/~...,E
4,2015-11-09-033345,I had a chuckle: They were showing a video of ...,Kris Draper,[Q948695],2015-11-09 00:57:45,3,"[[Kris Draper, 0.8782], [None, 0.1043], [Serge...",[http://ca.rd.yahoo.com/sports/rss/nfl/SIG=13u...,E


### How to tranlate items into meanings name ?

Understanding items
Items and their data are interconnected

In Wikidata, items are used to represent all the things in human knowledge, including topics, concepts, and objects. For example, the "1988 Summer Olympics", "love", "Elvis Presley", and "gorilla" are all items in Wikidata. 

#### Importing the sample to join metadata with quotations : 

Join quotations data with metadata :

In [48]:
#Create sub-sample to group the same authors togethers
df_sub = df_sample.sample(int(5*10e3))
s1 = len(df_sub)

#For the moment, choose rows which at minimum 1 qid associated
#Remove this row when using the data that have been wrangled
df_sub = df_sub[[len(a)>0 for a in df_sub['qids']]]

s2 = len(df_sub)

print('There is {} % of rows that have no qid associated with'.format((1-s2/s1)*100))
df_sub['qid_unique'] = df_sub['qids'].apply(lambda x: x[0]) 
df_sub['qid_unique']

There is 34.424 % of rows that have no qid associated with


913958       Q728063
145250      Q1194856
1245292    Q30015089
709231     Q12150206
445402     Q52435478
             ...    
513342      Q7324127
810869      Q4716898
1186579     Q2349264
228595           Q76
467213     Q11894442
Name: qid_unique, Length: 32788, dtype: object

In [64]:
#Group quotation by speaker

df_grouped=df_sub.groupby('qid_unique') \
['quoteID'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) 

df_grouped.head(20)

#Interesting to note that there is the same speaker called the by different ways (e.g : Donald Trump and President Trump)

#for speaker, quotation in df_group:
#    print('Speaker : ', speaker)
#    print('Quotations', quotation)


Unnamed: 0,qid_unique,count
8581,Q22686,377
22011,Q76,77
13169,Q359442,59
262,Q1058,57
14654,Q450675,53
19800,Q6294,48
19763,Q6279,43
685,Q11673,40
14382,Q434706,36
22637,Q83106,35


In [70]:
#Merging with metadata

#df_merged = df_group.merge(df,how='inner',on='speaker',right_on='label')
df_merged = df_grouped.merge(df_metadata,how='inner',left_on='qid_unique', right_on='id')\.sort_values(['count'], ascending=False)
df_merged.head(10)

Unnamed: 0,qid_unique,count,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,Q76,77,"[Barack Hussein Obama II, Barack Obama II, Bar...",[+1961-08-04T00:00:00Z],[Q30],[Q6581097],1395141963,"[Q49085, Q6935055, Q12826303, Q6392846]",O000167,"[Q82955, Q40348, Q15958642, Q28532974, Q372436]",[Q29552],"[Q1765120, Q1540185]",Q76,Barack Obama,"[Q3586671, Q45578, Q4226, Q4791768, Q17067714,...",item,"[Q23540, Q1062789, Q960252, Q426316]"
1,Q359442,59,[Bernard Sanders],[+1941-09-08T00:00:00Z],[Q30],[Q6581097],1392561607,[Q7325],S000033,"[Q82955, Q1930187, Q154549, Q1622272, Q36180, ...","[Q6542163, Q29552, Q327591, Q327591, Q29552]",[Q1765120],Q359442,Bernie Sanders,,item,[Q9268]
2,Q450675,53,"[Jorge Mario Bergoglio, Francisco, Pope Franci...",[+1936-12-17T00:00:00Z],"[Q237, Q414]",[Q6581097],1396054797,[Q1056744],,"[Q63173086, Q593644, Q104050302, Q36180, Q1234...",,[Q1233889],Q450675,Francis,,item,[Q9592]
3,Q6294,48,"[Hillary Rodham Clinton, Hillary Rodham, Hilla...",[+1947-10-26T00:00:00Z],[Q30],[Q6581072],1393846565,,C001041,"[Q82955, Q40348, Q193391, Q36180, Q18814623, Q...","[Q29552, Q29468]","[Q163727, Q1540185]",Q6294,Hillary Clinton,"[Q699872, Q4791768]",item,[Q33203]
4,Q11673,40,[Andrew Mark Cuomo],[+1957-12-06T00:00:00Z],[Q30],[Q6581097],1393197828,[Q974693],,"[Q40348, Q82955]",[Q29552],[Q1540185],Q11673,Andrew Cuomo,"[Q22023432, Q65047185]",item,[Q1841]
5,Q434706,36,"[Elizabeth Ann Warren, Senator Warren, Elisabe...",[+1949-06-22T00:00:00Z],[Q30],[Q6581072],1394365746,[Q49078],W000817,"[Q185351, Q82955, Q37226, Q1622272, Q40348, Q1...",[Q29552],"[Q163727, Q1540185]",Q434706,Elizabeth Warren,[Q28220813],item,[Q33203]
6,Q457786,33,"[Rodrigo Roa Duterte, Rodrigo ""Rody"" Roa Duter...",[+1945-03-28T00:00:00Z],[Q928],[Q6581097],1348817804,[Q1290600],,[Q82955],[Q7140531],,Q457786,Rodrigo Duterte,,item,
7,Q43723,32,"[Binyamin Netanyahu, Bibi, Bibi Netanyahu, Ben...",[+1949-10-21T00:00:00Z],[Q801],[Q6581097],1391756559,[Q7325],,"[Q82955, Q193391, Q372436, Q47064, Q15958642, ...",[Q187009],"[Q787674, Q950900]",Q43723,Benjamin Netanyahu,"[Q2917048, Q2480394, Q2689039]",item,[Q9268]
8,Q473239,28,"[Michael Richard ""Mike"" Pompeo, Michael Richar...",[+1963-12-30T00:00:00Z],[Q30],[Q6581097],1393216542,,P000602,"[Q40348, Q43845, Q2961975, Q189290, Q82955]",[Q29468],,Q473239,Mike Pompeo,,item,[Q178169]
9,Q180589,27,"[Boris, Alexander Boris de Pfeffel Johnson, Bo...",[+1964-06-19T00:00:00Z],"[Q30, Q145]",[Q6581097],1395092363,[Q7994501],,"[Q1930187, Q82955, Q1607826, Q36180, Q11774202...",[Q9626],,Q180589,Boris Johnson,"[Q30173038, Q30325756, Q428598, Q590740, Q3586...",item,[Q6423963]


In [75]:
#Replace label column at the first place

# shift column 'Name' to first position
first_column = df_merged.pop('label')
  
# insert column using insert(position,column_name,
# first_column) function
df_merged.insert(0, 'label', first_column)



In [76]:
df_merged.head(50)

Unnamed: 0,label,qid_unique,count,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,candidacy,type,religion
0,Barack Obama,Q76,77,"[Barack Hussein Obama II, Barack Obama II, Bar...",[+1961-08-04T00:00:00Z],[Q30],[Q6581097],1395141963,"[Q49085, Q6935055, Q12826303, Q6392846]",O000167,"[Q82955, Q40348, Q15958642, Q28532974, Q372436]",[Q29552],"[Q1765120, Q1540185]",Q76,"[Q3586671, Q45578, Q4226, Q4791768, Q17067714,...",item,"[Q23540, Q1062789, Q960252, Q426316]"
1,Bernie Sanders,Q359442,59,[Bernard Sanders],[+1941-09-08T00:00:00Z],[Q30],[Q6581097],1392561607,[Q7325],S000033,"[Q82955, Q1930187, Q154549, Q1622272, Q36180, ...","[Q6542163, Q29552, Q327591, Q327591, Q29552]",[Q1765120],Q359442,,item,[Q9268]
2,Francis,Q450675,53,"[Jorge Mario Bergoglio, Francisco, Pope Franci...",[+1936-12-17T00:00:00Z],"[Q237, Q414]",[Q6581097],1396054797,[Q1056744],,"[Q63173086, Q593644, Q104050302, Q36180, Q1234...",,[Q1233889],Q450675,,item,[Q9592]
3,Hillary Clinton,Q6294,48,"[Hillary Rodham Clinton, Hillary Rodham, Hilla...",[+1947-10-26T00:00:00Z],[Q30],[Q6581072],1393846565,,C001041,"[Q82955, Q40348, Q193391, Q36180, Q18814623, Q...","[Q29552, Q29468]","[Q163727, Q1540185]",Q6294,"[Q699872, Q4791768]",item,[Q33203]
4,Andrew Cuomo,Q11673,40,[Andrew Mark Cuomo],[+1957-12-06T00:00:00Z],[Q30],[Q6581097],1393197828,[Q974693],,"[Q40348, Q82955]",[Q29552],[Q1540185],Q11673,"[Q22023432, Q65047185]",item,[Q1841]
5,Elizabeth Warren,Q434706,36,"[Elizabeth Ann Warren, Senator Warren, Elisabe...",[+1949-06-22T00:00:00Z],[Q30],[Q6581072],1394365746,[Q49078],W000817,"[Q185351, Q82955, Q37226, Q1622272, Q40348, Q1...",[Q29552],"[Q163727, Q1540185]",Q434706,[Q28220813],item,[Q33203]
6,Rodrigo Duterte,Q457786,33,"[Rodrigo Roa Duterte, Rodrigo ""Rody"" Roa Duter...",[+1945-03-28T00:00:00Z],[Q928],[Q6581097],1348817804,[Q1290600],,[Q82955],[Q7140531],,Q457786,,item,
7,Benjamin Netanyahu,Q43723,32,"[Binyamin Netanyahu, Bibi, Bibi Netanyahu, Ben...",[+1949-10-21T00:00:00Z],[Q801],[Q6581097],1391756559,[Q7325],,"[Q82955, Q193391, Q372436, Q47064, Q15958642, ...",[Q187009],"[Q787674, Q950900]",Q43723,"[Q2917048, Q2480394, Q2689039]",item,[Q9268]
8,Mike Pompeo,Q473239,28,"[Michael Richard ""Mike"" Pompeo, Michael Richar...",[+1963-12-30T00:00:00Z],[Q30],[Q6581097],1393216542,,P000602,"[Q40348, Q43845, Q2961975, Q189290, Q82955]",[Q29468],,Q473239,,item,[Q178169]
9,Boris Johnson,Q180589,27,"[Boris, Alexander Boris de Pfeffel Johnson, Bo...",[+1964-06-19T00:00:00Z],"[Q30, Q145]",[Q6581097],1395092363,[Q7994501],,"[Q1930187, Q82955, Q1607826, Q36180, Q11774202...",[Q9626],,Q180589,"[Q30173038, Q30325756, Q428598, Q590740, Q3586...",item,[Q6423963]


#### Representativity of the data

In [91]:
#Check the repartition of the data according several criterias 

#Plot an histogram of ages of speakers by number of quotations
df_merged['nationality'].apply(lambda x:x[0])
country = df_merged.groupby('nationality')['count'] 
country.head(20)

TypeError: 'NoneType' object is not subscriptable

## Aditionnal metadata from Wikidata 

Script to retrieve metedata from Quotebank entries : 