# Female citations in UKs leading news papers

This notebook serves as a first presentation of our project for milestone 2. It is structured and written in such  a way that we can directly continue on it for milestone 3.

## Content
1. [Setup](#setup)   
    1.1 [Imports](#imports)  
    1.2 [Data paths](#data_paths)   
    1.3 [Utility functions](#utility_functions)   
2. [Data preparation](#data_prep)    
    2.1 [Columns and rows selection](#cols_rows_select)  
    2.2 [News paper selection](#newspaper_select)           
    2.3 [Filtering raw data](#filter_raw_data)

3. [Additional data set: speaker attributes](#speaker_attr)    
    3.1 [Explore attributes](#explore_attr)  
    3.2 [Retrieve quotebank wikidata labels](#reatrieve_labels)           
    3.3 [Filtering raw data](#filter_raw_data)   

4. [Data exploration and cleaning](#data_explore_clean)  
    4.1 [Import prepared data](#import_prep_data)   
    4.2 [Set index](#set_index)    
    4.3 [Save cleaned data frame as pickle](#save_pickle)   
5. [Research questions](#research_question)     
    5.1 [Load pickled dataframes](#load_pickle)

## 1. Setup
<a id="setup"></a>

### 1.1 Imports
<a id="imports"></a>

In [None]:
import pandas as pd
import numpy as np
import json
import bz2
import matplotlib.pyplot as plt

from tqdm import tqdm
from collections import Counter

### 1.2 Data paths
<a id="Data paths"></a>

**Important**: The raw and prepared data are stored locally in the root folder _Quotebank_ and _Filtered data_. To execute section [Data preperation](#data_prep) the raw data in the folder _Quotebank_ is needed. This section has to be executed only once. When using Google Colab the ```use_colab``` variable has to be set to true. This way the paths can be access directly from our shared drive.

You can download the raw data here (EPFL google account required): [Quotebank](), [Speakers]() \
The cleaned data can be found using this link:
[Cleaned data]()

In [None]:
# Comment the files which aren't locally stored
# In Colab everything should be available
RAW_QUOTES_2020_PATH = 'Quotebank/quotes-2020.json.bz2' 
QUOTES_2020_PATH = 'Filtered data/quotes-2020-gb.json.bz2' 

#RAW_QUOTES_2019_PATH = 'Quotebank/quotes-2019.json.bz2' 
#QUOTES_2019_PATH = 'Filtered data/quotes-2019-gb.json.bz2' 

#RAW_QUOTES_2018_PATH = 'Quotebank/quotes-2018.json.bz2' 
#QUOTES_2018_PATH = 'Filtered data/quotes-2018-gb.json.bz2' 

#RAW_QUOTES_2017_PATH = 'Quotebank/quotes-2017.json.bz2' 
#QUOTES_2017_PATH = 'data/quotes-2017-gb.json.bz2' 

#RAW_QUOTES_2016_PATH = 'raw_data/quotes-2016.json.bz2' 
#QUOTES_2016_PATH = 'data/quotes-2016-gb.json.bz2' 

#RAW_QUOTES_2015_PATH = 'raw_data/quotes-2015.json.bz2' 
#QUOTES_2015_PATH = 'data/quotes-2015-gb.json.bz2'

# Additional data set
SPEAKER_ATTRIBUTES_PATH = 'Project datasets/speaker_attributes.parquet'
LABELS_WIKIDATA_PATH = 'Project datasets/wikidata_labels_descriptions_quotebank.csv.bz2'

In [None]:
# Change to True if you want to use google colab
use_colab = True

We mount the drive and go to the right directory.

In [None]:
# Import epfl google drive!
if use_colab:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    
    %cd /content/drive/Shareddrives/ADA-project

Mounted at /content/drive
/content/drive/Shareddrives/ADA-project


We have to install an older version of pandas in order to be able to use the 'chunksize' feature in colab.

In [None]:
if use_colab:
    !pip install pandas==1.0.5

    # Reimport
    import pandas as pd
    print(pd.__version__)

1.0.5


## 1.3 Utility functions
<a id="utility_functions"></a>

In [None]:
def load_mini_version_of_data(path_to_file, chunksize, nb_chunks):
    """
    Returns a mini dataframe from of a bz2 compressed json file.
    :path_to_file: file path as string
    :chunksize: size to iterate
    :nb_chunks: how many chunks
    :return: pandas.DataFrame with chunksize*nb_chunks of rows
    """
    
    curr_chunk = 0
    chunk_list = []
    
    if use_colab:
          for chunk in pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize):
              if curr_chunk == nb_chunks:
                  break
              curr_chunk = curr_chunk + 1
              chunk_list.append(chunk)
    else:
      with pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize) as df_reader:
          for chunk in df_reader:
              if curr_chunk == nb_chunks:
                  break
          
              curr_chunk = curr_chunk + 1
              chunk_list.append(chunk)
    
    df = pd.concat(chunk_list)
    return df

## 2. Data preparation
<a id="data_prep"></a>

The quotebank dataset is too large to directly access it with a dataframe. This section provides all the steps to filter the data we need for our analysis. The filtering and preperation is done based on our research question. Please check the README for details. Further explanations are given under [Research question](#research_question).

The data preperation can be done on a per year basis of the Quotebank data set.

### 2.1 Column and row selection
<a id="cols_rows_select"></a>

In [None]:
# A quick look at a small subset of the data of the selected year
year_sample_df = load_mini_version_of_data(RAW_QUOTES_2020_PATH, 10000, 10)
year_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


In [None]:
# How many quotations don't have an assigned speaker?
sum(year_sample_df['speaker'] == 'None')

34316

The cell above shows that there around 1/3 of the quotations are 'None' speakers. As we want to make a gender based study will will not need these rows. This eliminitation will drasticly reduce the size of the data we have to analyse.

Furthermore the colums which aren't of interest for our study are:\
**phase**: we don't care\
**probas**: as we will select the the speaker with highest probablity (note that 'None' speakers are already neglected)

### 2.2 Newspaper selection
<a id="nespaper_select"></a>
In first analysis we will pick quotations of 3 of the top 12 UKs newspapers with the most reach both in prints and digital reach. See [this]() statistic for further details

In [None]:
# List of selected newspapers and their urls
newspapers_list = [['The Sun', 'thesun.co.uk'], 
                  ['The Guardian', 'theguardian.com'],
                  ['The Times', 'thetimes.co.uk']]

# Dataframe
newspapers_df = pd.DataFrame(newspapers_list, columns = ['name', 'website_url'])
newspapers_df.head()

Unnamed: 0,name,website_url
0,The Sun,thesun.co.uk
1,The Guardian,theguardian.com
2,The Times,thetimes.co.uk


### 2.3 Filtering raw data
<a id="filter_raw_data"></a>

Following the reasoning above we can extract the infos needed from the compressed file of a year of quotations. Let's create a helper function to check the primary urls of a quotation:

In [None]:
def filter_data(path_in, path_out):
  # Loop through all instances of json file and extract the desired rows
  # Save the file in the data directory
  with bz2.open(path_in, 'rb') as s_file:
      with bz2.open(path_out, 'wb') as d_file:
          for instance in s_file:
              instance = json.loads(instance) # loading a sample
              if instance['speaker'] == 'None':
                  continue
              urls = instance['urls'] # extracting list of links
              newspapers = []
              for url in urls:
                  for name, website_url in zip(newspapers_df['name'],newspapers_df['website_url']):
                      if website_url in url:
                          newspapers.append(name)
                          instance['newspapers'] = newspapers # updating the sample with domain name
              # We remove unnecessary columns
              instance.pop('probas')
              instance.pop('phase')
              # If there is a newspaper that we want to keep we write the instance to the output file
              if newspapers: 
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                

In [None]:
#filter_data(RAW_QUOTES_2020_PATH,QUOTES_2020_PATH)

In [None]:
# We check that the new file contains the right data
filtered_sample_df = load_mini_version_of_data(QUOTES_2020_PATH, 10000, 10)
filtered_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,urls,newspapers
0,2020-01-31-008580,As you reach or have reached the apex of your ...,Keyon Dooling,[Q304349],2020-01-31 19:07:55,1,[https://www.theguardian.com/sport/2020/jan/31...,[The Guardian]
1,2020-01-20-006469,At the same time we want to remain friends wit...,Tim Martin,"[Q20670776, Q20713880, Q7803899, Q7803900]",2020-01-20 09:08:24,4,[https://www.dailystar.co.uk/real-life/wethers...,[The Sun]
2,2020-04-03-006933,Been home-schooling a 6-year-old and 8-year-ol...,Shonda Rhimes,[Q242329],2020-04-03 16:00:00,1,[http://www.thetimes.co.uk/edition/magazine/ca...,[The Times]
3,2020-04-15-018814,I am now in agreement that we should move forw...,David Boies,[Q5231515],2020-04-15 15:46:38,1,[https://www.thesun.co.uk/news/11403669/jeffre...,[The Sun]
4,2020-02-16-014286,I don't want to make a career out of [ remakin...,Ramiro Gomez,"[Q30693403, Q43130877]",2020-02-16 15:00:32,1,[https://www.theguardian.com/artanddesign/2020...,[The Guardian]


Let's check that there are no 'None' speakers:

In [None]:
filtered_sample_df[filtered_sample_df.speaker=='None'].empty

True

Now let us do this filtering for the remaining data of years 2015-2019.

In [None]:
#filter_data(RAW_QUOTES_2019_PATH,QUOTES_2019_PATH)
#filter_data(RAW_QUOTES_2018_PATH,QUOTES_2018_PATH)
#filter_data(RAW_QUOTES_2017_PATH,QUOTES_2017_PATH)
#filter_data(RAW_QUOTES_2016_PATH,QUOTES_2016_PATH)
#filter_data(RAW_QUOTES_2015_PATH,QUOTES_2015_PATH)

## 3. Additional data set: speaker attributes
<a id="speaker_attr"></a>

The filtered Quotebank years are ready. The next step consists of integrating speaker attributes from our additional data set.

### 3.1 Explore attributes
<a id="explore_attr"></a>


In [None]:
# Load speaker attributes in df
speakers_df = pd.read_parquet(SPEAKER_ATTRIBUTES_PATH)
speakers_df.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,


We would like to do an analysis based on the gender of the speakers. Thus we are not interested in the quotes without a speaker. We first proceed to remove them.


In [None]:
# Fraction of lines with null gender : 
speakers_df[speakers_df.gender.isnull()].size / speakers_df.size

0.21536937853557775

In [None]:
# Remove null gender
speakers_df = speakers_df.drop(speakers_df[speakers_df.gender.isnull()].index)

We aren't intersted in a lot of those columns. First of we'll remove all the unneccessary ones for our analysis.

In [None]:
# Remove unncessary columns
speakers_df = speakers_df.drop(['lastrevid', 'US_congress_bio_ID', 'party', 'candidacy', 'type'], axis=1) 

And there seem to be no dublicates. Perfect.

In [None]:
# Dublicates
duplicates = speakers_df[speakers_df.duplicated(subset='id', keep='first')] 
duplicates.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,ethnic_group,occupation,academic_degree,id,label,religion


In addition to the gender column, we are also interested in the columns academic_degree and nationality. Let's check how many 'None' values we have and remove them.

In [None]:
# Fraction of lines with null nationality
speakers_df[speakers_df.nationality.isnull()].size / speakers_df.size

0.48707104255798245

In [None]:
# Fraction of lines with null academic degree :
speakers_df[speakers_df.academic_degree.isnull()].size / speakers_df.size

In [None]:
# Fraction of lines with null date of birth
speakers_df[speakers_df.date_of_birth.isnull()].size / speakers_df.size

0.31948139495609096

In [None]:
# Fraction of lines with null ethnic_group
speakers_df[speakers_df.ethnic_group.isnull()].size / speakers_df.size

0.9819298862868723

In [None]:
# Fraction of lines with null religion
speakers_df[speakers_df.religion.isnull()].size / speakers_df.size

0.9727195451474893

In [None]:
# Fraction of lines with null label
speakers_df[speakers_df.label.isnull()].size / speakers_df.size

0.055411365683404636

We will remove 'None' gender columns and the ethnic group, religion academic_degree column

In [None]:
# Remove ethnic group column as it has above 50% of null values
speakers_df = speakers_df.drop('ethnic_group', axis=1)

In [None]:
# Remove religion column as it has above 50% of null values : 
speakers_df = speakers_df.drop('religion', axis=1)

In [None]:
# Remove academic degree column as it has above 50% of null values : 
speakers_df = speakers_df.drop('academic_degree', axis=1)

In [None]:
speakers_df.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,occupation,id,label
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",Q23,George Washington
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",Q42,Douglas Adams
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",Q1868,Paul Otlet
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",Q207,George W. Bush
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],[Q1028181],Q297,Diego Velázquez


### 3.2 Retrieve quotebank wikidata labels
<a id="reatrieve_labels"></a>



In [None]:
labels = pd.read_csv(LABELS_WIKIDATA_PATH, compression='bz2', index_col='QID')
labels.head()

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America


In [None]:
labels.Label.isnull().count()/labels.size

0.5

In [None]:
#labels = labels.drop(labels[labels.Label.isnull()].index)
labels.head()

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America


In [None]:
labels.loc['Q6581097']['Label']

'male'

In [None]:
#Replacing gender QUIDs by value
#for index, row in speakers_df.iterrows():
#  quid = row.loc['gender'][0]
#  row.loc['gender'] = labels.loc[quid]['Label']
#--> Not useful, replace only the ones needed

###Test and first exploration with the 2019 New York Times Data 

We first looked at the sample with quotes from the 2019 NewYork Times data for an easier overall observation

In [None]:
df_quotesNY = pd.read_json('quotes-2019-nytimes.json', lines=True)
df_quotesNY.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2019-04-17-024782,"It is not a low-income immigration,",James Fisher,"[Q16213953, Q20707104, Q43143598, Q58886302, Q...",2019-04-17 13:31:18,1,"[[James Fisher, 0.7475], [None, 0.2525]]",[https://www.nytimes.com/2019/04/17/realestate...,E
1,2019-04-02-001128,a champion figure skater switching to roller s...,John Updike,[Q105756],2019-04-02 14:58:33,2,"[[John Updike, 0.5856], [None, 0.4144]]",[https://www.nytimes.com/2019/04/02/opinion/vl...,E
2,2019-05-09-055187,It makes it much more difficult for him to mak...,,[],2019-05-09 18:11:29,1,"[[None, 0.6493], [President Bill Clinton, 0.27...",[http://mobile.nytimes.com/2019/05/09/world/as...,E
3,2019-10-31-056366,"It puts me in a predicament,",Xavier Becerra,[Q1855840],2019-10-31 16:45:15,3,"[[Xavier Becerra, 0.9065], [None, 0.0909], [St...",[http://www.nytimes.com/2019/10/31/technology/...,E
4,2019-01-04-001792,A Pile of Leaves.,,[],2019-01-04 10:00:07,1,"[[None, 0.8737], [Jason Fulford, 0.1263]]",[https://www.nytimes.com/2019/01/04/books/revi...,E


In [None]:
#Filter null speaker 
df_quotesNY = df_quotesNY.drop(df_quotesNY[df_quotesNY.speaker == 'None'].index)
df_quotesNY.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2019-04-17-024782,"It is not a low-income immigration,",James Fisher,"[Q16213953, Q20707104, Q43143598, Q58886302, Q...",2019-04-17 13:31:18,1,"[[James Fisher, 0.7475], [None, 0.2525]]",[https://www.nytimes.com/2019/04/17/realestate...,E
1,2019-04-02-001128,a champion figure skater switching to roller s...,John Updike,[Q105756],2019-04-02 14:58:33,2,"[[John Updike, 0.5856], [None, 0.4144]]",[https://www.nytimes.com/2019/04/02/opinion/vl...,E
3,2019-10-31-056366,"It puts me in a predicament,",Xavier Becerra,[Q1855840],2019-10-31 16:45:15,3,"[[Xavier Becerra, 0.9065], [None, 0.0909], [St...",[http://www.nytimes.com/2019/10/31/technology/...,E
5,2019-08-15-002017,A Senator we can call our own.,Tom Rath,[Q7817334],2019-08-15 22:36:33,1,"[[Tom Rath, 0.7598], [None, 0.1993], [Warren R...",[http://www.nytimes.com/2019/08/15/us/politics...,E
8,2019-07-22-032883,"It's a success, a relief and a technical feat,",Florence Parly,[Q3074013],2019-07-22 02:37:50,21,"[[Florence Parly, 0.9262], [None, 0.0738]]",[http://www.breitbart.com/news/french-submarin...,E


We would like to determine the number of genders present in the the data.

In [None]:
speaker_genders = speakers_df['gender']
speaker_genders.value_counts()

TypeError: ignored

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'numpy.ndarray'


[Q6581097]              5418464
[Q6581072]              1684170
[Q1052281]                  887
[Q48270]                    307
[Q2449503]                  228
                         ...   
[Q6581097, Q6581072]          1
[Q179294, Q6581097]           1
[Q179294, Q6581097]           1
[Q179294, Q6581097]           1
[Q6581097, Q6581072]          1
Name: gender, Length: 1411, dtype: int64

In [None]:
print(labels.loc['Q6581097']['Label'],
      labels.loc['Q6581072']['Label'],
      labels.loc['Q1052281']['Label'],
      labels.loc['Q48270']['Label'],
      labels.loc['Q2449503']['Label'])

male female transgender female non-binary transgender male


In [None]:
#ALL ALIASES
#Goal : struct dictionary{key = quote_id, value = dict{key=speaker_id, value = [genders per speaker]}}

# Retrieve New York times speakers' gender, in the order of the quotations df
dict_quote_id_genders = {}

for NY_index, NY_row in df_quotesNY.iterrows(): #Loop over all quotes
  quote_id = NY_row.quoteID #Future key in dict_quote_id_genders
  id_aliases = NY_row.loc['qids'] 
  dict_genders_speaker = {} #Dict of genders per alias speaker
  for id_speaker in id_aliases:
    list_genders_speaker = [] #List of genders per alias speaker
    id_genders_speaker = speakers_df.loc[speakers_df.id==id_speaker]['gender'].to_numpy() #List of all genders ID per speaker
    for id_g in id_genders_speaker:
      gender= labels.loc[id_g]['Label'].values #Gender value
      list_genders_speaker.append(gender)
    dict_genders_speaker[id_speaker] = list_genders_speaker
    print(dict_genders_speaker)
  dict_quote_id_genders[quote_id] = dict_genders_speaker

{'Q16213953': [array(['male'], dtype=object)]}
{'Q16213953': [array(['male'], dtype=object)], 'Q20707104': [array(['male'], dtype=object)]}
{'Q16213953': [array(['male'], dtype=object)], 'Q20707104': [array(['male'], dtype=object)], 'Q43143598': [array(['male'], dtype=object)]}
{'Q16213953': [array(['male'], dtype=object)], 'Q20707104': [array(['male'], dtype=object)], 'Q43143598': [array(['male'], dtype=object)], 'Q58886302': [array(['male'], dtype=object)]}
{'Q16213953': [array(['male'], dtype=object)], 'Q20707104': [array(['male'], dtype=object)], 'Q43143598': [array(['male'], dtype=object)], 'Q58886302': [array(['male'], dtype=object)], 'Q6133913': [array(['male'], dtype=object)]}
{'Q105756': [array(['male'], dtype=object)]}
{'Q1855840': [array(['male'], dtype=object)]}
{'Q7817334': [array(['male'], dtype=object)]}
{'Q3074013': [array(['female'], dtype=object)]}
{'Q7812406': [array(['male'], dtype=object)]}
{'Q977546': [array(['male'], dtype=object)]}
{'Q50049': [array(['female'], 

KeyboardInterrupt: ignored

In [None]:
#ONLY FIRST ALIAS
#Goal : struct dictionary{key = quote_id, value = ([genders],[nationalities])}

# Retrieve quotes speakers' gender, in the order of the quotations df
# Input : df_quotes with only first qid in new column qid (instead of qids)
def gender_nat_speaker_dict(df_quotes):
  dict_quote_id_genders = {}

  for q_index, q_row in tqdm(df_quotes.head(10).iterrows(), total=df_quotes.shape[0]): #Loop over all quotes
    quote_id = q_row.quoteID #Future key in dict_quote_id_genders
    id_speaker = q_row.loc['qid'] 
    list_genders_speaker = [] #List of genders per alias speaker
    list_nat_speaker = [] #List of nationalities per alias speaker
    id_genders_speaker = speakers_df.loc[speakers_df.id==id_speaker]['gender'].to_numpy() #List of all genders ID per speaker
    id_nat_speaker = speakers_df.loc[speakers_df.id==id_speaker]['nationality'].to_numpy() #List of all nationalities ID per speaker
    for id_g in id_genders_speaker:
      gender= labels.loc[id_g]['Label'].values #Gender value
      list_genders_speaker.append(gender)
    for id_n in id_nat_speaker:
      if(id_n is not None):
        nat= labels.loc[id_n]['Label'].values #Nationality value
        list_nat_speaker.append(nat)
    dict_quote_id_genders[quote_id] = (list_genders_speaker, list_nat_speaker)
  return dict_quote_id_genders

In [None]:
#ONLY FIRST ALIAS DICTIONARY
#Goal : struct dictionary{key = quote_id, value = ([genders],[nationalities])}

# Retrieve quotes speakers' gender, in the order of the quotations df
# Input : df_quotes with only first qid in new column qid (instead of qids)
def list_attribute_speaker(id_speaker, attributes):
  list_attr_speaker = []
  for attr in attributes:
    list_attr = []
    id_attr_speaker = speakers_df.loc[speakers_df.id==id_speaker][attr].to_numpy() #List of all attribute ID per speaker
    for id_a in id_attr_speaker:
      if(id_a is not None):
        attr= labels.loc[id_a]['Label'].values #Attribute value
        list_attr.append(attr[0])
    list_attr_speaker.append(list_attr)
  return (id_speaker, list_attr_speaker)

In [None]:
dict_genders = {}
dict_genders = dict(df_quotesNY['qid'].apply(lambda i: list_attribute_speaker(i, ['gender', 'nationality'])).tolist())
dict_genders

KeyError: ignored

In [None]:
#ONLY FIRST ALIAS
#Goal : add directly in df gender and nationality

# Retrieve quotes speakers' gender, in the order of the quotations df
# Input : df_quotes with only first qid in new column qid (instead of qids)
def gender_nat_speaker(df_quotes):
  df_result = df_quotes.copy()
  list_genders_speakers = [] #List of genders alias speakers
  list_nat_speakers = [] #List of nationalities alias speakers
  for q_index, q_row in df_result.iterrows(): #Loop over all quotes
    quote_id = q_row.quoteID #Future key in dict_quote_id_genders
    list_genders_speaker = [] #List of genders per alias speaker
    list_nat_speaker = [] #List of nationalities per alias speaker
    id_genders_speaker = speakers_df.loc[speakers_df.id==id_speaker]['gender'].to_numpy() #List of all genders ID per speaker
    id_nat_speaker = speakers_df.loc[speakers_df.id==id_speaker]['nationality'].to_numpy() #List of all nationalities ID per speaker
    for id_g in id_genders_speaker:
      gender= labels.loc[id_g]['Label'].values #Gender value
      list_genders_speaker.append(gender)
    for id_n in id_nat_speaker:
      if(id_n is not None):
        nat= labels.loc[id_n]['Label'].values #Nationality value
        list_nat_speaker.append(nat)
    list_genders_speakers.append(list_genders_speaker)
    list_nat_speakers.append(list_nat_speaker)
  df_result.insert(4, "gender", list_genders_speakers)
  df_result.insert(5, 'nationality', list_nat_speakers)
  return df_result

In [None]:
def get_single_qid(df_quotes):

  df_copy =  df_quotes.copy()
  df_result =  df_quotes.copy()
  for idx,row in df_copy.iterrows():  
    id_speaker = row.loc['qids'][0] 
    df_result.at[idx,'qids'] = id_speaker

  df_result = df_result.rename(columns={"qids": "qid"})
  return df_result

In [None]:
# Add gender column
# df_quotesNY.insert(3, 'gender', speakerGender)

## 4 Data exploration and cleaning
<a id="data_explore_clean"></a>

### 4.1 Import prepared data
<a id="import_prep_data"></a>

### 4.2 Set the index
<a id="set_index"></a>

### 4.3 Save cleaned data frame as pickle
<a id="save_pickle"></a>

## 5 Research questions
<a id="research_questions"></a>

### 5.1 Load pickled dataframes
<a id="load_pickle"></a>