<a href="https://colab.research.google.com/github/morwald/ada_project/blob/master/gender_topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of gender distribution in UK's leading newspapers
# Gender representation

## Content
1. [Setup](#setup)   
    1.1 [Global](#global_setup)  
    1.2 [Local](#local_setup)

## 1. Setup
<a id="setup"></a>

### 1.1 Global
<a id="global_setup"></a>

In [1]:
# Change to true if you want to use google colab
use_colab = True

# Import with EPFL google drive!
if use_colab:
    from google.colab import drive
    drive._mount('/content/drive', force_remount=True)
    %cd /content/drive/Shareddrives/ADA-project
    !pip install pandas==1.0.5 # downgrade pandas for chunk processing support

Mounted at /content/drive
/content/drive/Shareddrives/ADA-project


In [2]:
# Defined paths for the data
from scripts.path_defs import *

# Defined newspapers and urls
from scripts.newspapers import *

# Globally used functions
from scripts.utility_functions import load_mini_version_of_data
from scripts.utility_functions import convert_to_1Dseries
from scripts.utility_functions import process_data_in_chunks

### 1.2 Local 
<a id="local_setup"></a>

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import bz2


In [5]:
! pip install empath

Collecting empath
  Downloading empath-0.89.tar.gz (57 kB)
[?25l[K     |█████▊                          | 10 kB 25.4 MB/s eta 0:00:01[K     |███████████▍                    | 20 kB 28.5 MB/s eta 0:00:01[K     |█████████████████               | 30 kB 12.0 MB/s eta 0:00:01[K     |██████████████████████▊         | 40 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████▍   | 51 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 57 kB 2.9 MB/s 
Building wheels for collected packages: empath
  Building wheel for empath (setup.py) ... [?25l[?25hdone
  Created wheel for empath: filename=empath-0.89-py3-none-any.whl size=57821 sha256=5756f7a79aff2b9ab397eb0ae060e1ffda8c8457bf8bbd9da54e1a2ea9ed92b0
  Stored in directory: /root/.cache/pip/wheels/2b/78/a8/37d4505eeae79807f4b5565a193f7cfcee892137ad37591029
Successfully built empath
Installing collected packages: empath
Successfully installed empath-0.89


In [6]:
from empath import Empath
lexicon = Empath()

In [11]:
quotes_df = pd.read_json(MERGED_QUOTES_2020_PATH, lines=True, compression='bz2')
doc_complete = quotes_df.quotation.tolist()

doc_complete[:3]

['[ The delay ] will have an impact [ on Slough ] but that might be mitigated by the fact we are going to have this Western Rail Link to Heathrow. It looks like that may come in sooner than Crossrail.',
 "And for the record, Eamonn Holmes made me laugh, he lightened a very emotional moment and I'm very happy that he did.",
 'And help he always did. For someone who preferred to be behind the scenes, he was at the center of absolutely everything.']

In [16]:
doc_complete[1]

"And for the record, Eamonn Holmes made me laugh, he lightened a very emotional moment and I'm very happy that he did."

In [17]:
lexicon.create_category("metoo",["metoo","#metoo","consent", "harassment", "sexual assault", "sexual misconduct"])

["consent", "harassment", "claims", "death_sentence", "permission", "obligation", "violation", "complaint", "consequence", "interference", "disobedience", "disregard", "liable", "ownership", "policy", "partnership", "commission", "restriction", "punishments", "fraud", "conduct", "spouse", "affairs", "limitation", "free_will", "refusal", "restrictions", "obligations", "execution", "creditor", "penalty", "relations", "termination", "association", "testimony", "treason", "offender", "privileges", "conditions", "expressly", "necessity", "violating", "superiors", "regards", "witness", "death_penalty", "prejudice", "associates", "liability", "permit", "compensation", "authorization", "justification", "crimes", "claim", "discrimination", "debtor", "authorized", "adultery", "benefit", "terminate", "discretion", "felony", "conception", "testator", "imposed", "Therefore", "imprisonment", "repercussions", "regard", "involvement", "other_means", "thereby", "principles", "punishment", "misbehavior"

In [18]:
cat = lexicon.analyze(doc_complete[1], normalize=True)


In [29]:
cate = { key : value for key,value in cat.items() if value > 0}
list(cate.keys())

['wedding',
 'cheerfulness',
 'suffering',
 'optimism',
 'childish',
 'celebration',
 'sadness',
 'emotional',
 'party',
 'positive_emotion']

In [55]:
def add_topics(path_in, path_males, path_females, path_others):
  # Loop through all instances of json file and extract the desired rows
  # Save the file in the filtered data directory
  counter = 0
  with bz2.open(path_in, 'rb') as s_file:
      with bz2.open(path_males, 'wb') as male_file:
              with bz2.open(path_females, 'wb') as female_file:
                   with bz2.open(path_others, 'wb') as other_file:
                      for instance in s_file:
                          instance = json.loads(instance) # loading a sample
                        
                          quote = instance['quotation'] # extracting the quote
                          gender = instance['gender']
                          categories = lexicon.analyze(quote, normalize=True)
                          cat = { key : value for key,value in categories.items() if value > 0}
                          instance['topics'] = list(cat.keys())
                          instance['proba_topics'] = list(cat.values())

                          if not instance['topics']: # if there are no topics we don't keep this quote
                            continue

                          if 'female' in gender: 
                            female_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                          elif 'male' in gender:
                            male_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file
                          else:
                            other_file.write((json.dumps(instance)+'\n').encode('utf-8'))  
                          counter += 1
                          if counter > 100:
                            break       

In [56]:
add_topics(MERGED_QUOTES_2020_PATH, 'Data with topics/QUOTES_MALES_2020', 'Data with topics/QUOTES_FEMALES_2020', 'Data with topics/QUOTES_OTHERS_2020')

In [58]:
topics_males = pd.read_json('Data with topics/QUOTES_OTHERS_2020', lines=True,compression='bz2')

In [59]:
topics_males.head()

Unnamed: 0,quoteID,quotation,speaker,date,numOccurrences,urls,newspapers,qid,gender,nationality,occupation,topics,proba_topics
0,2020-02-11-025317,He was firing some banter back. Kind of left m...,Jamie Clayton,2020-02-11 13:18:13,2,[https://www.irishmirror.ie/showbiz/celebrity-...,[Daily Mirror],Q6146739,[transgender female],[United States of America],"[model, actor, television actor, film actor]","[childish, surprise, shape_and_size, weapon, c...","[0.055555555555555004, 0.055555555555555004, 0..."
