<a href="https://colab.research.google.com/github/morwald/ada_project/blob/main/main_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Female citations in UKs leading news papers

This notebook serves as a first presentation of our project for milestone 2. It is structured and written in such  a way that we can directly continue on it for milestone 3.

## Content
1. [Setup](#setup)   
    1.1 [Imports](#imports)  
    1.2 [Data paths](#data_paths)   
    1.3 [Utility functions](#utility_functions)   
2. [Data preparation](#data_prep)    
    2.1 [Columns and rows selection](#cols_rows_select)  
    2.2 [News paper selection](#newspaper_select)           
    2.3 [Filtering raw data](#filter_raw_data)   
    2.4 [Merging speaker information](#merging_speakers)   
3. [Data exploration and cleaning](#data_explore_clean)  
    3.1 [Import prepared data](#import_prep_data)   
    3.2 [Set index](#set_index)    
    3.3 [Save cleaned data frame as pickle](#save_pickle)   
4. [Research questions](#research_question)     
    4.1 [Load pickled dataframes](#load_pickle)

## 1. Setup
<a id="setup"></a>

### 1.1 Imports
<a id="imports"></a>

In [None]:
import pandas as pd
import json
import bz2
import matplotlib.pyplot as plt
from urllib.parse import urlparse

### 1.2 Data paths
<a id="Data paths"></a>

**Important**: The raw and prepared data are stored locally in the root folder _raw_data_ and _data_. To execute section [Data preperation](#data_prep) the raw data is needed. This section has to be executed only once.

You can download the raw data here (EPFL google account required): [Quotebank](), [Speakers]()

The cleaned data can be found using this link:
[Cleaned data]()

We mount the drive and go to the right directory.

In [None]:
# Import epfl google drive!
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
%cd /content/drive/Shareddrives/ADA-project

/content/drive/Shareddrives/ADA-project


We install an older version of pandas in order to be able to use the 'chunksize' feature in colab.

In [None]:
!pip install pandas==1.0.5




In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.__version__

'1.0.5'

We get the link to the files.

In [66]:
# Paths of the files
RAW_QUOTES_2020_PATH = 'Quotebank/quotes-2020.json.bz2' 
QUOTES_2020_PATH = 'Filtered data/quotes-2020-gb.json.bz2' 

RAW_QUOTES_2019_PATH = 'Quotebank/quotes-2019.json.bz2' 
QUOTES_2019_PATH = 'Filtered data/quotes-2019-gb.json.bz2' 

RAW_QUOTES_2018_PATH = 'Quotebank/quotes-2018.json.bz2' 
QUOTES_2018_PATH = 'Filtered data/quotes-2018-gb.json.bz2' 

RAW_QUOTES_2017_PATH = 'Quotebank/quotes-2017.json.bz2' 
QUOTES_2017_PATH = 'Filtered data/quotes-2017-gb.json.bz2' 

RAW_QUOTES_2016_PATH = 'Quotebank/quotes-2016.json.bz2' 
QUOTES_2016_PATH = 'Filtered data/quotes-2016-gb.json.bz2' 

RAW_QUOTES_2015_PATH = 'Quotebank/quotes-2015.json.bz2' 
QUOTES_2015_PATH = 'Filtered data/quotes-2015-gb.json.bz2'

## 1.3 Utility functions
<a id="utility_functions"></a>

This version of the function is for the older version of pandas (1.0.5) in Google Colab.

In [None]:
def load_mini_version_of_data_old_pandas(path_to_file, chunksize, nb_chunks):
    """
    Returns a mini dataframe from of a bz2 compressed json file.
    :path_to_file: file path as string
    :chunksize: size to iterate
    :nb_chunks: how many chunks
    :return: pandas.DataFrame with chunksize*nb_chunks of rows
    """
        
    curr_chunk = 0
    chunk_list = []
    
    for chunk in pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize):
      if curr_chunk == nb_chunks:
          break
            
      curr_chunk = curr_chunk + 1
      chunk_list.append(chunk)
    
    df = pd.concat(chunk_list)
    return df

## 2. Data preparation
<a id="data_prep"></a>

The quotebank dataset is too large to directly access it with a dataframe. This section provides all the steps to filter the data we need for our analysis. The filtering and preperation is done based on our research question. Please check the README for details. Further explanations are given under [Research question](#research_question).

The data preperation can be done on a per year basis of the Quotebank data set.

### 2.1 Column and row selection
<a id="cols_rows_select"></a>

In [None]:
# A quick look at a small subset of the data of the selected year
year_sample_df = load_mini_version_of_data_old_pandas(RAW_QUOTES_2020_PATH, 10000, 10)
year_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


In [None]:
# How many quotations don't have an assigned speaker?
sum(year_sample_df['speaker'] == 'None')

34316

The cell above shows that there around 1/3 of the quotations are 'None' speakers. As we want to make a gender based study will will not need these rows. This eliminitation will drasticly reduce the size of the data we have to analyse.

Furthermore the colums which aren't of interest for our study are:\
**phase**: we don't care\
**probas**: as we will select the the speaker with highest probablity (note that 'None' speakers are already neglected)

### 2.2 Newspaper selection
<a id="nespaper_select"></a>
In first analysis we will pick quotations of 3 of the top 12 UKs newspapers with the most reach both in prints and digital reach. See [this]() statistic for further details

In [22]:
# List of selected newspapers and their urls
newspapers_list = [['The Sun', 'thesun.co.uk'], 
              ['The Guardian', 'theguardian.com'],
              ['The Times', 'thetimes.co.uk']]

# Dataframe
newspapers_df = pd.DataFrame(newspapers_list, columns = ['name', 'website_url'])
newspapers_df.head()

Unnamed: 0,name,website_url
0,The Sun,thesun.co.uk
1,The Guardian,theguardian.com
2,The Times,thetimes.co.uk


### 2.3 Filtering raw data
<a id="filter_raw_data"></a>

Fowolling the reasoning above we can extract the infos needed from the compressed file of a year of quotations. Let's create a helper function to check the primary urls of a quotation:

In [68]:
def filter_data(path_in, path_out):
  # Loop through all instances of json file and extract the desired rows
  # Save the File in the data directory
  with bz2.open(path_in, 'rb') as s_file:
      with bz2.open(path_out, 'wb') as d_file:
          for instance in s_file:
              instance = json.loads(instance) # loading a sample
              if instance['speaker'] == 'None':
                  continue
              urls = instance['urls'] # extracting list of links
              newspapers = []
             # website_urls = get_website_names(urls)
              for url in urls:
                  for name, website_url in zip(newspapers_df['name'],newspapers_df['website_url']):
                      if website_url in url:
                          newspapers.append(name)
                          instance['newspapers'] = newspapers # updating the sample with domain name
              # we remove unnecessary columns
              instance.pop('probas')
              instance.pop('phase')
              # if there is a newspaper that we want to keep we write the instance to the output file
              if newspapers: 
                d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                

In [69]:
filter_data(RAW_QUOTES_2020_PATH,QUOTES_2020_PATH)

In [70]:
#we check that the new file contains the right data
filtered_sample_df = load_mini_version_of_data_old_pandas(QUOTES_2020_PATH, 10000, 10)
filtered_sample_df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,urls,newspapers
0,2020-01-31-008580,As you reach or have reached the apex of your ...,Keyon Dooling,[Q304349],2020-01-31 19:07:55,1,[https://www.theguardian.com/sport/2020/jan/31...,[The Guardian]
1,2020-01-20-006469,At the same time we want to remain friends wit...,Tim Martin,"[Q20670776, Q20713880, Q7803899, Q7803900]",2020-01-20 09:08:24,4,[https://www.dailystar.co.uk/real-life/wethers...,[The Sun]
2,2020-04-03-006933,Been home-schooling a 6-year-old and 8-year-ol...,Shonda Rhimes,[Q242329],2020-04-03 16:00:00,1,[http://www.thetimes.co.uk/edition/magazine/ca...,[The Times]
3,2020-04-15-018814,I am now in agreement that we should move forw...,David Boies,[Q5231515],2020-04-15 15:46:38,1,[https://www.thesun.co.uk/news/11403669/jeffre...,[The Sun]
4,2020-02-16-014286,I don't want to make a career out of [ remakin...,Ramiro Gomez,"[Q30693403, Q43130877]",2020-02-16 15:00:32,1,[https://www.theguardian.com/artanddesign/2020...,[The Guardian]


Let's check that there are no 'None' speakers nore there is 'NaN' newspapers :

In [71]:
filtered_sample_df[filtered_sample_df.speaker=='None'].empty

True

In [72]:
filtered_sample_df[filtered_sample_df.newspapers=='NaN'].empty

True

Now let us do this filtering for the remaining data of years 2015-2019.

In [38]:
filter_data(RAW_QUOTES_2019_PATH,QUOTES_2019_PATH)
filter_data(RAW_QUOTES_2018_PATH,QUOTES_2018_PATH)
filter_data(RAW_QUOTES_2017_PATH,QUOTES_2017_PATH)
filter_data(RAW_QUOTES_2016_PATH,QUOTES_2016_PATH)
filter_data(RAW_QUOTES_2015_PATH,QUOTES_2015_PATH)


KeyboardInterrupt: ignored

### 2.4 Merge speaker informations
<a id="merge_speaker_infos"></a>
**Lorena**

## 3 Data exploration and cleaning
<a id="data_explore_clean"></a>
**Marie**

### 3.1 Import prepared data
<a id="import_prep_data"></a>

### 3.2 Set the index
<a id="set_index"></a>

In [None]:
# The index is now unique
df.index.is_unique

True

### 3.3 Save cleaned data frame as pickle
<a id="save_pickle"></a>

## 4 Research questions
<a id="research_questions"></a>

### 4.1 Load pickled dataframes
<a id="load_pickle"></a>