<a href="https://colab.research.google.com/github/morwald/ada_project/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of gender distribution in UK's leading newspapers

This notebook serves as a first presentation of our project for milestone 2. It is structured and written in such  a way that we can directly continue on it for milestone 3.

## Content
1. [Setup](#setup)   
    1.1 [Imports](#imports)  
    1.2 [Data paths](#data_paths)   
    1.3 [Utility functions](#utility_functions)   
2. [Data preparation](#data_prep)    
    2.1 [Columns and rows selection](#cols_rows_select)  
    2.2 [News paper selection](#newspaper_select)
3. [Additional data set: speaker attributes](#speaker_attr)    
    3.1 [Explore attributes](#explore_attr)  
    3.2 [Merge speaker attributes](#merge_speaker_attr)                
4. [Research questions](#research_question)

## 1. Setup
<a id="setup"></a>

### 1.1 Imports
<a id="imports"></a>

In [2]:
import pandas as pd
import numpy as np
import json
import bz2
import tqdm

from tqdm.notebook import tqdm_notebook

### 1.2 Data paths
<a id="data_paths"></a>

**Important**: When using Google Colab the ```use_colab``` variable has to be set to true. This way the files can be access directly from our shared drive. If you wan't to work locally the raw and filtered data have to be stored locally in the root folder under _Quotebank_ and _Filtered data_. To execute section the [Data preparation](#data_prep) the raw data in the folder _Quotebank_ is needed but this section has to be executed only once. 

You can download the raw data here: [Quotebank](https://zenodo.org/record/4277311#.YYzk6_oo9hE)

The addtional dataset can be found here: [Speakers](https://drive.google.com/drive/folders/1VAFHacZFh0oxSxilgNByb1nlNsqznUf0)

In [39]:
# Comment the files which aren't locally stored
# In Colab everything should be available
RAW_QUOTES_2020_PATH = 'Quotebank/quotes-2020.json.bz2' 
QUOTES_2020_PATH = 'Filtered data/quotes-2020-gb.json.bz2' 

#RAW_QUOTES_2019_PATH = 'Quotebank/quotes-2019.json.bz2' 
QUOTES_2019_PATH = 'Filtered data/quotes-2019-gb.json.bz2' 

#RAW_QUOTES_2018_PATH = 'Quotebank/quotes-2018.json.bz2' 
QUOTES_2018_PATH = 'Filtered data/quotes-2018-gb.json.bz2' 

#RAW_QUOTES_2017_PATH = 'Quotebank/quotes-2017.json.bz2' 
QUOTES_2017_PATH = 'Filtered data/quotes-2017-gb.json.bz2' 

#RAW_QUOTES_2016_PATH = 'raw_data/quotes-2016.json.bz2' 
QUOTES_2016_PATH = 'Filtered data/quotes-2016-gb.json.bz2' 

#RAW_QUOTES_2015_PATH = 'raw_data/quotes-2015.json.bz2' 
QUOTES_2015_PATH = 'Filtered data/quotes-2015-gb.json.bz2'

# Additional data set
SPEAKER_ATTRIBUTES_PATH = 'Project datasets/speaker_attributes.parquet'
LABELS_WIKIDATA_PATH = 'Project datasets/wikidata_labels_descriptions_quotebank.csv.bz2'

In [4]:
# Change to true if you want to use google colab
use_colab = True

We mount the drive and go to the our shared directory if necessary.

In [5]:
# Import with EPFL google drive!
if use_colab:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    
    %cd /content/drive/Shareddrives/ADA-project

Mounted at /content/drive
/content/drive/Shareddrives/ADA-project


We have to install an older version of pandas in order to be able to use the 'chunksize' feature in colab.

In [6]:
if use_colab:
    !pip install pandas==1.0.5

    # Reimport
    import pandas as pd
    print(pd.__version__)

1.0.5


### 1.3 Utility functions
<a id="utility_functions"></a>

In [7]:
def load_mini_version_of_data(path_to_file, chunksize, nb_chunks):
    """
    Returns a mini dataframe from of a bz2 compressed json file.
    :path_to_file: file path as string
    :chunksize: size to iterate
    :nb_chunks: how many chunks
    :return: pandas.DataFrame with chunksize*nb_chunks of rows
    """
    
    curr_chunk = 0
    chunk_list = []
    
    if use_colab:
          for chunk in pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize):
              if curr_chunk == nb_chunks:
                  break
              curr_chunk = curr_chunk + 1
              chunk_list.append(chunk)
    else:
      with pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize) as df_reader:
          for chunk in df_reader:
              if curr_chunk == nb_chunks:
                  break
          
              curr_chunk = curr_chunk + 1
              chunk_list.append(chunk)
    
    df = pd.concat(chunk_list)
    return df

## 2. Data preparation
<a id="data_prep"></a>

The quotebank dataset is too large to directly access it with a dataframe. This section provides all the steps to filter the data we need for our analysis. The filtering and preperation is done based on our research question. Please check the README for details. Further explanations are given under [Research question](#research_question).

The data preperation can be done on a per year basis of the Quotebank data set.

### 2.1 Column and row selection
<a id="cols_rows_select"></a>

In [8]:
# A quick look at a small subset of the data of the selected year
year_sample_df = load_mini_version_of_data(RAW_QUOTES_2020_PATH, 10000, 10)
year_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


In [9]:
# How many quotations don't have an assigned speaker?
sum(year_sample_df['speaker'] == 'None')

34316

The cell above shows that there around 1/3 of the quotations are 'None' speakers. As we want to make a gender based study will will not need these rows. This eliminitation will drasticly reduce the size of the data we have to analyse.

Furthermore the colums which aren't of interest for our study are:\
**phase**: We don't care. \
**probas**: We will select the speaker with highest probablity (note that 'None' speakers are already neglected). See section... for further clarification.

### 2.2 Newspaper selection
<a id="nespaper_select"></a>
In a first analysis we will pick quotations of 3 of the top 12 UKs newspapers with the most reach both in prints and digital reach. See [this](https://www.statista.com/statistics/246077/reach-of-selected-national-newspapers-in-the-uk/) statistic for more details.

In [10]:
# List of selected newspapers and their urls
newspapers_list = [['The Sun', 'thesun.co.uk'], 
                  ['The Guardian', 'theguardian.com'],
                  ['The Times', 'thetimes.co.uk']]

# Dataframe
newspapers_df = pd.DataFrame(newspapers_list, columns = ['name', 'website_url'])
newspapers_df.head()

Unnamed: 0,name,website_url
0,The Sun,thesun.co.uk
1,The Guardian,theguardian.com
2,The Times,thetimes.co.uk


### 2.3 Filtering raw data
<a id="filter_raw_data"></a>

Following the reasoning above we can extract the infos needed from the compressed file of a year of quotations. Let's create a helper function to check the primary urls of a quotation:

In [11]:
def filter_data(path_in, path_out):
  # Loop through all instances of json file and extract the desired rows
  # Save the file in the data directory
  with bz2.open(path_in, 'rb') as s_file:
      with bz2.open(path_out, 'wb') as d_file:
          for instance in s_file:
              instance = json.loads(instance) # loading a sample
              if instance['speaker'] == 'None':
                  continue
              urls = instance['urls'] # extracting list of links
              newspapers = []
              for url in urls:
                  for name, website_url in zip(newspapers_df['name'],newspapers_df['website_url']):
                      if website_url in url:
                          newspapers.append(name)
                          instance['newspapers'] = newspapers # updating the sample with domain name
              # We remove unnecessary columns
              instance.pop('probas')
              instance.pop('phase')
              # If there is a newspaper that we want to keep we write the instance to the output file
              if newspapers: 
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                

In [14]:
#filter_data(RAW_QUOTES_2020_PATH,QUOTES_2020_PATH)

In [12]:
# We check that the new file contains the right data
filtered_sample_df = load_mini_version_of_data(QUOTES_2020_PATH, 10000, 10)
filtered_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,urls,newspapers
0,2020-01-31-008580,As you reach or have reached the apex of your ...,Keyon Dooling,[Q304349],2020-01-31 19:07:55,1,[https://www.theguardian.com/sport/2020/jan/31...,[The Guardian]
1,2020-01-20-006469,At the same time we want to remain friends wit...,Tim Martin,"[Q20670776, Q20713880, Q7803899, Q7803900]",2020-01-20 09:08:24,4,[https://www.dailystar.co.uk/real-life/wethers...,[The Sun]
2,2020-04-03-006933,Been home-schooling a 6-year-old and 8-year-ol...,Shonda Rhimes,[Q242329],2020-04-03 16:00:00,1,[http://www.thetimes.co.uk/edition/magazine/ca...,[The Times]
3,2020-04-15-018814,I am now in agreement that we should move forw...,David Boies,[Q5231515],2020-04-15 15:46:38,1,[https://www.thesun.co.uk/news/11403669/jeffre...,[The Sun]
4,2020-02-16-014286,I don't want to make a career out of [ remakin...,Ramiro Gomez,"[Q30693403, Q43130877]",2020-02-16 15:00:32,1,[https://www.theguardian.com/artanddesign/2020...,[The Guardian]


Let's verify if there are no 'None' speakers:

In [13]:
filtered_sample_df[filtered_sample_df.speaker=='None'].empty

True

In [None]:
filtered_sample_df = get_single_qid(filtered_sample_df)

Now let us do this filtering for the remaining data of years 2015-2019.

In [None]:
#filter_data(RAW_QUOTES_2019_PATH,QUOTES_2019_PATH)
#filter_data(RAW_QUOTES_2018_PATH,QUOTES_2018_PATH)
#filter_data(RAW_QUOTES_2017_PATH,QUOTES_2017_PATH)
#filter_data(RAW_QUOTES_2016_PATH,QUOTES_2016_PATH)
#filter_data(RAW_QUOTES_2015_PATH,QUOTES_2015_PATH)

## 3. Additional data set: speaker attributes
<a id="speaker_attr"></a>

The filtered Quotebank years are ready. The next step consists of integrating speaker attributes from our additional data set.

### 3.1 Explore attributes
<a id="explore_attr"></a>


In [8]:
# Load speaker attributes in df
speakers_df = pd.read_parquet(SPEAKER_ATTRIBUTES_PATH)
speakers_df.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,


Let us first check for duplicates. There seem to be none.

In [9]:
# Duplicates
duplicates = speakers_df[speakers_df.duplicated(subset='id', keep='first')] 
duplicates.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion


We are mostly interested in the column gender. But let us see which ones we could keep. How many 'None' values do we have?

In [10]:
# Fraction of lines with null values:
print('gender: ' + str(speakers_df[speakers_df.gender.isnull()].size / speakers_df.size))
print('nationality: ' + str(speakers_df[speakers_df.nationality.isnull()].size / speakers_df.size))
print('occupation: ' + str(speakers_df[speakers_df.occupation.isnull()].size / speakers_df.size))
print('academic_degree: ' + str(speakers_df[speakers_df.academic_degree.isnull()].size / speakers_df.size))
print('ethnic_group: ' + str(speakers_df[speakers_df.ethnic_group.isnull()].size / speakers_df.size))
print('religion ' + str(speakers_df[speakers_df.religion.isnull()].size / speakers_df.size))

gender: 0.21536937853557775
nationality: 0.5896797928352544
occupation: 0.29625691573337004
academic_degree: 0.9889581261268106
ethnic_group: 0.9856023328670853
religion 0.9783254845609769


Unfortunately we have to drop all attributes appart from gender, occupation, nationality. All the other ones have more than 98% 'None' values.

In [11]:
# Keep the columns gender nationality and occupation
speakers_df = speakers_df[['id', 'gender', 'nationality', 'occupation']]

We will remove all 'None' genders as this is the most important attribute for our analysis.

In [12]:
# Remove null genders
speakers_df = speakers_df.drop(speakers_df[speakers_df.gender.isnull()].index)
speakers_df.head()

Unnamed: 0,id,gender,nationality,occupation
0,Q23,[Q6581097],"[Q161885, Q30]","[Q82955, Q189290, Q131512, Q1734662, Q294126, ..."
1,Q42,[Q6581097],[Q145],"[Q214917, Q28389, Q6625963, Q4853732, Q1884422..."
2,Q1868,[Q6581097],[Q31],"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q..."
3,Q207,[Q6581097],[Q30],"[Q82955, Q15982858, Q18814623, Q1028181, Q1408..."
4,Q297,[Q6581097],[Q29],[Q1028181]


### 3.2 Merge speaker attributes
<a id="merge_speaker_attr"></a>

Now we can merge the speaker attributes in our filtered dataset. First of all we need to load the corresponding labels for the quote id's.



In [44]:
labels_df = pd.read_csv(LABELS_WIKIDATA_PATH, compression='bz2', index_col='QID')

# We now there will be a missing label in our dataset
# We will just add this one
labels_df.loc['Q6363085'] = labels_df.loc['Q380075']

labels_df.head()

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America


Let's see what the fraction of 'None' labels is:

In [14]:
labels_df.Label.isnull().count() / labels_df.size

0.5

So half of the quote id's don't have a label. We will merge everything which is available.

To efficiently look up the attributes we create a dictionary to acces the desired ids.

In [None]:
speaker_genders = speakers_df['gender']
speaker_genders.value_counts()

[Q6581097]              5418464
[Q6581072]              1684170
[Q1052281]                  887
[Q48270]                    307
[Q2449503]                  228
                         ...   
[Q6581097, Q6581072]          1
[Q179294, Q6581097]           1
[Q179294, Q6581097]           1
[Q179294, Q6581097]           1
[Q6581097, Q6581072]          1
Name: gender, Length: 1411, dtype: int64

We print the different registered genders :

In [None]:
print(labels.loc['Q6581097']['Label'],
      labels.loc['Q6581072']['Label'],
      labels.loc['Q1052281']['Label'],
      labels.loc['Q48270']['Label'],
      labels.loc['Q2449503']['Label'])

male female transgender female non-binary transgender male


In [46]:
# Create dictionary of labels and the attributes genders, nationalities and occupations
labels_dict = pd.Series(labels_df.Label, index=labels_df.index)
genders_dict = pd.Series(speakers_df.gender.values, index=speakers_df.id).to_dict()
nationalities_dict = pd.Series(speakers_df.nationality.values, index=speakers_df.id).to_dict()
occupations_dict = pd.Series(speakers_df.occupation.values, index=speakers_df.id).to_dict()

We need a function to translate from a qid to a corresponding label. And we need a function to get the first quid.

In [32]:
def translate_qid2label(id_speaker, attr_dict):
  """
    Retrieve quotes speakers attribute, in the order of the quotations DataFrame.
    :id_speaker: First qid in new column qid (instead of qids)
    :attr_dict: dictionary for specified attributes
    :nb_chunks: how many chunks
    :return: attributes as list
  """
  list_attr = []
  
  if id_speaker in attr_dict:
    id_attr_speaker = attr_dict[id_speaker]
    if(id_attr_speaker is not None):
      for id_a in id_attr_speaker:
          attr = labels_dict[id_a] # Attribute value
          list_attr.append(attr)
    else:
      list_attr = None

  return list_attr

In [33]:
#We want to transform the 'qids' column in a single 'qid' element
def get_single_qid(df_quotes):

  df_copy =  df_quotes.copy()
  df_result =  df_quotes.copy()
  for idx,row in df_copy.iterrows():  
    id_speaker = row.loc['qids'][0] 
    df_result.at[idx,'qids'] = id_speaker

  df_result = df_result.rename(columns={"qids": "qid"})
  return df_result

Now we can put all these together and apply the above function to all our speaker attributes and quotes.


In [34]:
def add_speaker_attributes(path_to_file):
  df = pd.read_json(path_to_file, lines=True,compression='bz2')
  # First we get a single qid:
  df = get_single_qid(df)
  tqdm_notebook.pandas()
  genders = df['qid'].apply(lambda i: translate_qid2label(i, genders_dict))
  nationalities = df['qid'].apply(lambda i: translate_qid2label(i, nationalities_dict))
  occupations = df['qid'].apply(lambda i: translate_qid2label(i, occupation_dict))
  df.insert(3, 'gender', genders)
  df.insert(4,'nationality', nationalities)
  df.insert(5,'occupations', occupations)

  return df

In [35]:
# Merge speaker attributes for filtered quotes of 2020
df_2020 = add_speaker_attributes(QUOTES_2020_PATH)
df_2020.head()

Unnamed: 0,quoteID,quotation,speaker,gender,nationality,occupations,qid,date,numOccurrences,urls,newspapers
0,2020-01-31-008580,As you reach or have reached the apex of your ...,Keyon Dooling,[male],[United States of America],[basketball player],Q304349,2020-01-31 19:07:55,1,[https://www.theguardian.com/sport/2020/jan/31...,[The Guardian]
1,2020-01-20-006469,At the same time we want to remain friends wit...,Tim Martin,[male],,[American football player],Q20670776,2020-01-20 09:08:24,4,[https://www.dailystar.co.uk/real-life/wethers...,[The Sun]
2,2020-04-03-006933,Been home-schooling a 6-year-old and 8-year-ol...,Shonda Rhimes,[female],[United States of America],"[film director, screenwriter, writer, film pro...",Q242329,2020-04-03 16:00:00,1,[http://www.thetimes.co.uk/edition/magazine/ca...,[The Times]
3,2020-04-15-018814,I am now in agreement that we should move forw...,David Boies,[male],[United States of America],[lawyer],Q5231515,2020-04-15 15:46:38,1,[https://www.thesun.co.uk/news/11403669/jeffre...,[The Sun]
4,2020-02-16-014286,I don't want to make a career out of [ remakin...,Ramiro Gomez,[male],,[artist],Q30693403,2020-02-16 15:00:32,1,[https://www.theguardian.com/artanddesign/2020...,[The Guardian]


In [47]:
# Merge speaker attributes of all other years
df_2016 = add_speaker_attributes(QUOTES_2016_PATH)
df_2017 = add_speaker_attributes(QUOTES_2017_PATH)
df_2018 = add_speaker_attributes(QUOTES_2018_PATH)
df_2019 = add_speaker_attributes(QUOTES_2019_PATH)

## 4 Research questions
<a id="research_questions"></a>