<a href="https://colab.research.google.com/github/morwald/ada_project/blob/master/gender_topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of gender distribution in UK's leading newspapers
# Gender representation

## Content
1. [Setup](#setup)   
    1.1 [Global](#global_setup)  
    1.2 [Local](#local_setup)

## 1. Setup
<a id="setup"></a>

### 1.1 Global
<a id="global_setup"></a>

In [1]:
# Change to true if you want to use google colab
use_colab = True

# Import with EPFL google drive!
if use_colab:
    from google.colab import drive
    drive._mount('/content/drive', force_remount=True)
    %cd /content/drive/Shareddrives/ADA-project
    #!pip install pandas==1.0.5 # downgrade pandas for chunk processing support

Mounted at /content/drive
/content/drive/Shareddrives/ADA-project


In [2]:
# Defined paths for the data
from scripts.path_defs import *

# Defined newspapers and urls
from scripts.newspapers import *

# Globally used functions
from scripts.utility_functions import load_mini_version_of_data
from scripts.utility_functions import convert_to_1Dseries
from scripts.utility_functions import process_data_in_chunks

### 1.2 Local 
<a id="local_setup"></a>

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import bz2


In [4]:
! pip install empath

Collecting empath
  Downloading empath-0.89.tar.gz (57 kB)
[?25l[K     |█████▊                          | 10 kB 20.4 MB/s eta 0:00:01[K     |███████████▍                    | 20 kB 11.8 MB/s eta 0:00:01[K     |█████████████████               | 30 kB 9.3 MB/s eta 0:00:01[K     |██████████████████████▊         | 40 kB 8.5 MB/s eta 0:00:01[K     |████████████████████████████▍   | 51 kB 5.4 MB/s eta 0:00:01[K     |████████████████████████████████| 57 kB 2.7 MB/s 
Building wheels for collected packages: empath
  Building wheel for empath (setup.py) ... [?25l[?25hdone
  Created wheel for empath: filename=empath-0.89-py3-none-any.whl size=57821 sha256=2643afe09129cb191143fd67f536eef1ffc0f9eb4f484d7763517afce03cd23d
  Stored in directory: /root/.cache/pip/wheels/2b/78/a8/37d4505eeae79807f4b5565a193f7cfcee892137ad37591029
Successfully built empath
Installing collected packages: empath
Successfully installed empath-0.89


In [5]:
from empath import Empath
lexicon = Empath()

In [6]:
quotes_df = pd.read_json(MERGED_QUOTES_2020_PATH, lines=True, compression='bz2')
doc_complete = quotes_df.quotation.tolist()

['[ The delay ] will have an impact [ on Slough ] but that might be mitigated by the fact we are going to have this Western Rail Link to Heathrow. It looks like that may come in sooner than Crossrail.',
 "And for the record, Eamonn Holmes made me laugh, he lightened a very emotional moment and I'm very happy that he did.",
 'And help he always did. For someone who preferred to be behind the scenes, he was at the center of absolutely everything.']

In [7]:
doc_complete[1]

"And for the record, Eamonn Holmes made me laugh, he lightened a very emotional moment and I'm very happy that he did."

In [8]:
lexicon.create_category("metoo",["metoo","#metoo","consent", "harassment", "sexual assault", "sexual misconduct"], model='nytimes')



In [9]:
cat = lexicon.analyze(doc_complete[1], normalize=True)


In [10]:
cate = { key : value for key,value in cat.items() if value > 0}
list(cate.keys())

['wedding',
 'cheerfulness',
 'suffering',
 'optimism',
 'childish',
 'celebration',
 'sadness',
 'emotional',
 'party',
 'positive_emotion']

In [17]:
def get_topics(quote):
  dic_topics = lexicon.analyze(quote,normalize=True)
  categories = { key : value for key,value in dic_topics.items() if value > 0}
  return list(categories.keys())

In [18]:
quotes_df['topics'] = quotes_df['quotation'].apply(lambda x : get_topics(x))


KeyboardInterrupt: ignored

In [19]:
def add_topics(path_in, path_males, path_females, path_others):
  # Loop through all instances of json file and extract the desired rows
  # Save the file in the filtered data directory
  with bz2.open(path_in, 'rb') as s_file:
      with bz2.open(path_males, 'wb') as male_file:
              with bz2.open(path_females, 'wb') as female_file:
                   with bz2.open(path_others, 'wb') as other_file:
                      for instance in s_file:
                          instance = json.loads(instance) # loading a sample
                        
                          quote = instance['quotation'] # extracting the quote
                          gender = instance['gender']
                          categories = lexicon.analyze(quote, normalize=True)
                          cat = { key : value for key,value in categories.items() if value > 0}
                          instance['topics'] = list(cat.keys())
                          instance['proba_topics'] = list(cat.values())

                          if not instance['topics']: # if there are no topics we don't keep this quote
                            continue

                          if 'female' in gender: 
                            female_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                          elif 'male' in gender:
                            male_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                          else:
                            other_file.write((json.dumps(instance)+'\n').encode('utf-8'))  
                           

In [None]:
add_topics(MERGED_QUOTES_2019_PATH, 'Data with topics/QUOTES_MALES_2019', 'Data with topics/QUOTES_FEMALES_2019', 'Data with topics/QUOTES_OTHERS_2019')

In [None]:
topics_males = pd.read_json('Data with topics/QUOTES_MALES_2020', lines=True,compression='bz2')

In [None]:
topics_males.head()

Unnamed: 0,quoteID,quotation,speaker,date,numOccurrences,urls,newspapers,qid,gender,nationality,occupation,topics,proba_topics
0,2020-01-17-000357,[ The delay ] will have an impact [ on Slough ...,Dexter Smith,2020-01-17 13:03:00,1,[http://www.sloughexpress.co.uk/gallery/slough...,[Daily Express],Q5268447,[male],[Bermuda],[cricketer],"[dispute, violence, communication, injury, str...","[0.024390243902439, 0.024390243902439, 0.02439..."
1,2020-02-07-005251,"And for the record, Eamonn Holmes made me laug...",Phillip Schofield,2020-02-07 20:30:49,2,[https://www.dailystar.co.uk/showbiz/breaking-...,[Daily Star],Q7185804,[male],"[United Kingdom, New Zealand]",[television presenter],"[wedding, cheerfulness, suffering, optimism, c...","[0.045454545454545005, 0.045454545454545005, 0..."
2,2020-01-31-008580,As you reach or have reached the apex of your ...,Keyon Dooling,2020-01-31 19:07:55,1,[https://www.theguardian.com/sport/2020/jan/31...,[The Guardian],Q304349,[male],[United States of America],[basketball player],"[pride, strength]","[0.045454545454545005, 0.09090909090909001]"
3,2020-01-20-006469,At the same time we want to remain friends wit...,Tim Martin,2020-01-20 09:08:24,4,[https://www.dailystar.co.uk/real-life/wethers...,"[Daily Star, The Sun]",Q20670776,[male],,[American football player],"[help, social_media, trust, friends, achieveme...","[0.043478260869565, 0.043478260869565, 0.04347..."
4,2020-02-11-011721,But for an 18-month period he didn't change. H...,Tyson Fury,2020-02-11 19:57:51,1,[https://www.mirror.co.uk/3am/celebrity-news/p...,[Daily Mirror],Q1000592,[male],"[United Kingdom, Ireland]",[boxer],[school],[0.09090909090909001]
