## Header

This notebook is organized followings the different steps used in our pipeline. We first create a list of keywords using [web scraping](#Webscraping) and a personal list of keywords. 
With this list of keywords we [select](#Dataset-selection-from-Quotebank-database) a subset of the Quotebank database. This subset will be our starting dataset for our project. 

## General librairies

In [6]:
import pandas as pd
import numpy as np

## Webscraping

We decided to scrape the **usnews.com** website because they have some topic pages that list all articles on the specified topic.   
For example this page https://www.usnews.com/topics/subjects/feminism identifies all the articles from usnews.com that are relevant on the topic of feminism in *the latest* column. Further down we will call these topic pages, primary URLs. We then access all the identified articles on their urls, i.e. secondary urls, and retrieve their contents to create a corpus of text relevant to our topic. The corpus is saved in *Articles_Contents.txt.*   
The corpus is used to retrieve bigrams. We decided to not count onegrame because they are too general for our purpose, for example 'women' gives a lot of results but isn't always of interest. The following quote "a woman, a woman, a woman." from an unknown speaker isn't relevant for our purpose.
The bigrams complete a manual list of keywords that are used to select our quotes of interest. The web-scraped keywords are necessary to ensure that we don't miss frequent bigrams and to remove some of the bias that exist in a personal keywords list.
Note that at this point we used only usnews as a source and we were not able to mimick the infinite scrolling so only a limited list of articles per topic is available.

### Libraries

In [2]:
import json 
import requests #http library
import nltk #natural language processing library
nltk.download('stopwords') #common english words to ignore 
from bs4 import BeautifulSoup #extraction from HTML and XML files
from collections import Counter #dictionary subclass for counting hashable objects

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aminamatt/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Functions

In [72]:
def get_urls_usnews(URL):
    '''
    Description: Retrieving of urls of articles from a topic page of usnews.com
    Input: the primary URL string, i.e. the URL with the list of relevant articles
    Output: a list of urls strings referring to relevant articles 
    Requirements : Request, BeautifulSoup libraries
    Use: This function is made to be used to scrap the to usnews.com website. 
    If you want to adapt it to another website the class tag should be adapted.
    '''
    
    headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
    response = requests.get(URL,headers=headers) #http request with a user-agent string to avoid blocking from server
    soup = BeautifulSoup(response.text, 'html.parser') #parse the document with html format
    latest = soup.find('div',{'class':"LoadMoreWrapper__Container-zwyk5c-0 himujt"}) #get all the elements within 'the latest'category
   

    #Find all the urls in the articles of latest category
    list_of_urls = []
    try :
        for a in latest.find_all('a'):
            list_of_urls.append(a['href'])
    except : 
         print("An exception occurred")
    usnews_urls = list(set(list_of_urls))
    
    return usnews_urls

In [73]:
def article_from_url(url):
    '''
    Description: Retrieving article content from a url and cleaning out the copyright mention
    Input: url string of a single article
    Output: string with all the article text
    Requirements : Requests, BeautifulSoup,Json
    Use: This function is made to be used to scrap the to usnews.com website. 
    If you want to adapt it to another website the copyright sentence should be adapted.
    '''
    headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
    response = requests.get(url,headers=headers) #http request with a user-agent string to avoid blocking from server
    soup = BeautifulSoup(response.text, 'html.parser') #parse the document with html format
    #find the article in the html page
    jsonArticle = json.loads(soup.find(type="application/ld+json").string)
    text=jsonArticle['articleBody']
    #remove the copyright sentence to avoid it to appear in the most frequent bigrams
    clean_text = text.replace('.Copyright 2021 The&nbsp;Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.',' ').replace('Associated Press',' ').replace('quot',' ')
    
    return clean_text

In [74]:
def get_all_articles(usnews_urls):
    '''
    Description: Loop on a primary list on URL to call the article_from_url function
    Input: list of urls strings
    Output: one string with all articles contents appended
    Requirements : Requests, BeautifulSoup, Json
    Use: see article_from_url
    '''
    all_articles = ''
    for url in usnews_urls:
        all_articles = all_articles +' '+article_from_url(url)
    
    return all_articles

In [75]:
def ngram_frequency(text):
    '''
    Description: Counting the frequency of n-grams in the text
    Input: A single string containing the text of interest 
    Output: List of bigram and their counts in the text in the format ((string,string),integer)
    Requirement: Nltk with stopwords, Counter 
    Use: this function is set to find bigrams, it can be extended for other n-grams
    '''
    
    #separate the text into words 
    allWords = nltk.tokenize.word_tokenize(text) 
    
    #gets rid on 1-letter words and 2-letters words
    allLongWords = []
    for word in allWords:
        if len(word) > 2: 
            allLongWords.append(word)   
    #get rid of common english words
    stopwords = nltk.corpus.stopwords.words('english') #list of words such as a, the, and etc..
    allWordExceptStop =[]
    for w in allLongWords:
        if w.lower() not in stopwords:
            allWordExceptStop.append(w)
    #create a list of bigrams words in the text. Can be adapted to n-grams zipping more words
    bigrams = zip(allWordExceptStop, allWordExceptStop[1:])
    #calculate the frequency of each bigram 
    bigramsFreq = nltk.FreqDist(bigrams) 
    return bigramsFreq

### Initialization

**Usnews.com** has a long list of [topics](https://www.usnews.com/topics/subjects). We decided to focus on political women's rights topics and we've chosen the 5 following links. We tried to run the bigram frequency with women's health and women's history included but too many words related to health or history were coming up.

In [76]:
#list of primary links containing articles of interest
URL_TOPIC_LIST = ['https://www.usnews.com/topics/subjects/feminism',
            'https://www.usnews.com/topics/subjects/gender',
            'https://www.usnews.com/topics/subjects/gender_bias',
            'https://www.usnews.com/topics/subjects/sexism',
            'https://www.usnews.com/topics/subjects/women\'s rights' ]

### Retrieving of articles of interest

With the functions defined above we scrape the topic pages for articles references and retrieve the articles contents. The functions deal with different selection steps to avoid all the other contents at each step,i.e. advertisement, galleries, recommended articles etc...

In [77]:
all_articles = ''
for url_topic in URL_TOPIC_LIST:
    #Retrieve all urls for latest articles in the specific feminism subject page
    usnews_topic_urls = get_urls_usnews(url_topic)
    
    #Retrieve all the articles contents for the latest articles
    all_articles_topic =  get_all_articles(usnews_topic_urls)
    
    #append articles to create one text
    all_articles = all_articles +' '+all_articles_topic

An exception occurred


In [78]:
print(all_articles[0:1250])

  MADRID (AP) — A Spanish foundation on Wednesday awarded one of the country’s most prestigious awards to U.S. writer and activist Gloria Steinem.The jury that decides the Princess of Asturias Awards announced that Steinem has won its annual prize for communication and humanities.It praised 87-year-old Steinem’s long career in journalism, her bestselling books and her dedication to feminism since the 1960s, ensuring her place as “one of the most significant and iconic figures of the women’s rights movement” in the United States.The citation singled out her contribution to the legalization of abortion, pay equality and equal rights, as well as her fight against the death penalty, female genital mutilation and child abuse.The 50,000-euro award ($61,000) is one of eight prizes, including in the arts, social sciences and sports, handed out annually by a foundation named for Spanish Crown Princess Leonor  By ERALDO PERES and DIANE JEANTET,  GOIANIA, BRAZIL (AP) — Struck with grief, tens of 

In [79]:
len(all_articles)

173557

In [80]:
#Export all the articles of interest in a single text file
text_file = open("generated_data/Articles-Contents.txt", "w")
text_file.write(all_articles)
text_file.close()

### Frequency computation for bigrams 

In [81]:
#Couting bigram frequencies for all articles of interest
usNewsFEMbigramFreq = ngram_frequency(all_articles)

In [82]:
MAX = 50

#Visualize the most common bigrams
for word, frequency in usNewsFEMbigramFreq.most_common(MAX):
        print('%s;%d' % (word, frequency))

('Los', 'Angeles');23
('gender', 'equality');22
('New', 'York');19
('Black', 'women');13
('child', 'care');13
('White', 'House');12
('Angeles', 'County');12
('health', 'care');11
('men', 'pay');11
('percentage', 'men');11
('Women', 'pay');10
('pay', 'percentage');10
('Hillary', 'Clinton');8
('Best', 'Countries');8
('vice', 'president');8
('sexual', 'harassment');8
('women', 'girls');7
('United', 'States');7
('girls', 'women');7
('Washington', 'D.C.');7
('electoral', 'system');7
('Donald', 'Trump');7
('Supreme', 'Court');6
('coronavirus', 'pandemic');6
('share', 'women');6
('one', 'highest');6
('women', 'according');6
('rates', 'women');6
('Countries', 'rankings');6
('Middle', 'East');6
('Board', 'Supervisors');6
('female', 'mayors');6
('young', 'people');5
('Iron', 'John');5
('York', 'Times');5
('first', 'time');5
('five', 'years');5
('Ford', 'Foundation');5
('social', 'media');5
('gender', 'stereotypes');5
('Hayes', 'said');5
('COVID-19', 'crisis');5
('COVID-19', 'pandemic');5
('sex',

The most common bigrams list also contain a lot of Named Entities (NE) like cities, persons etc... 
We can see *'Los', 'Angeles'* and *'Donald' 'Trump'* as common bigrams.
Here we use the naive approach to ignore this name by using the word capitalization to select them. Note that there are more advanced way to recognize NE (for example Stanforde NER library) but we believe that it will be overkilled for our usage.

### Final List

In [83]:
bigram_final_list = []
MAX = 100
for word, frequency in usNewsFEMbigramFreq.most_common(MAX):
    if (word[0][0].isupper()==False and word[1][0].isupper()==False): #ignore the Named Entities
        bigram_final_list.append(word[0]+' '+word[1])

bigram_final_list

['gender equality',
 'child care',
 'health care',
 'men pay',
 'percentage men',
 'pay percentage',
 'vice president',
 'sexual harassment',
 'women girls',
 'girls women',
 'electoral system',
 'coronavirus pandemic',
 'share women',
 'one highest',
 'women according',
 'rates women',
 'female mayors',
 'young people',
 'first time',
 'five years',
 'social media',
 'gender stereotypes',
 'sex discrimination',
 'public schools',
 'states women',
 'death rate',
 'state budgets',
 'gender gap',
 'women representation',
 'became first',
 "women n't",
 "n't matter",
 'lose weight',
 'women rights',
 'women men',
 'two years',
 'women movement',
 'six years',
 '100 million',
 'gender-based violence',
 'women still',
 'best states',
 'top five',
 'top states',
 'states plus',
 'based gender',
 'public school',
 'federal government',
 'education health',
 'proportional electoral',
 'female candidates',
 'regions say',
 'metro area',
 'entirely female',
 'across country',
 'largest cities',


We use this list (in its 150 word long version) to extend our personal list of bigrams. However, maybe because of the corpus size there are still some bigrams that aren't of interest. For example, the *health care* or *vice president* are ignored because the former is too general and the latter irrelevant.

In [84]:
selected_usnews_keywords = ['gender equality','child care','men pay','percentage men',
              'pay percentage','sexual harassment','women girls','girls women',
              'rates women','women according','female mayors','share women','women movement',
              'see women','gender stereotypes','gender gap', 'women representation','sex discrimination',
              'women rights','woman time','based gender', 'female candidates','gender-based violence','entirely female']
            
#Personal keywords list 
my_bigrams = ['women\'s right','Equal opportunities','Equal rights','Equal status',
           'equal pay','gender gap','Gender discrimination','Gender equality','Sexual harrasment','Women empowerment',
            'women victim','women immigration','Women emancipation','women\'s participation','Western women','non-western woman',
              'Muslim women','Muslim woman', 'Equal wages','Gender equality',
             'gender equity','Men and women', 'women and men', 'women oppression','abortion', 'niqab ban'
           'struggle of girls','struggle of women', 'war against women','oppression of girls',
            'oppression of women','women oppression','women\'s opression','liberate women','religious oppresion',
           'abuse of women','Male oppression','Female oppression','Exploitation of women',
           'Indigenous women','Patriarchal culture']

all_bigrams = my_bigrams + selected_usnews_keywords
all_bigrams

["women's right",
 'Equal opportunities',
 'Equal rights',
 'Equal status',
 'equal pay',
 'gender gap',
 'Gender discrimination',
 'Gender equality',
 'Sexual harrasment',
 'Women empowerment',
 'women victim',
 'women immigration',
 'Women emancipation',
 "women's participation",
 'Western women',
 'non-western woman',
 'Muslim women',
 'Equal wages',
 'Gender equality',
 'gender equity',
 'Men and women',
 'women and men',
 'women oppression',
 'abortion',
 'niqab banstruggle of girls',
 'struggle of women',
 'war against women',
 'oppression of girls',
 'oppression of women',
 'women oppression',
 "women's opression",
 'liberate women',
 'religious oppresion',
 'abuse of women',
 'Male oppression',
 'Female oppression',
 'Exploitation of women',
 'Indigenous women',
 'Patriarchal culture',
 'gender equality',
 'child care',
 'men pay',
 'percentage men',
 'pay percentage',
 'sexual harassment',
 'women girls',
 'girls women',
 'rates women',
 'women according',
 'female mayors',
 '

The next step save a keywords text file and recall it. This is done once to save important information but the notebook could ba run directly without the export and import.

In [85]:
#Export all the keywords in a single text file
text_file = open("generated_data/Keywords.txt", "w")
for bigram in all_bigrams:
    text_file.write(bigram+',')
text_file.close()

In [86]:
#Import all the keywords in a single text file
KEYWORDS_LIST = [] 
# opening the text file
with open("generated_data/Keywords.txt", "r") as file:
 
    # reading each line    
    for line in file:
   
        # reading each word        
        for word in line.split(','):
   
            # displaying the words           
            KEYWORDS_LIST.append(word) 
#KEYWORDS_LIST

## Dataset selection from Quotebank database

### Librairies

In [87]:
import os 

### Functions

In [88]:
#Processing on chunk
#Input
#Output
def process_chunk(chunk, vocabulary):
    print(f'Processing chunk with {len(chunk)} rows')
    #print(chunk.columns)
    occurences = np.zeros(len(vocabulary))
    for index, word in enumerate(vocabulary):
        occurences[index] = np.sum(chunk['quotation'].str.contains(word)) 
    return occurences

#Select quotes containing keywords
def select_quotes_chunk(chunk, keywords):
    print(f'Processing chunk with {len(chunk)} rows')
    return chunk[chunk['quotation'].str.contains('|'.join(keywords),case=False)]

#Use the selection function on each chunk of the full dataset 
def select_quotes_one_year(path_to_file, vocabulary, chunksize = 10 ** 4):
    with pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize) as df_reader:
        for index, chunk in enumerate(df_reader):
            if not index==0:
                selected_df = pd.concat([selected_df, select_quotes_chunk(chunk, vocabulary)])
            else: 
                selected_df = select_quotes_chunk(chunk, vocabulary)
    return selected_df

#Use the selection function on each chunk of the full dataset 
#Dumps the selected quotes into a new json file
def select_and_dump(path_to_file, vocabulary, chunksize = 10 ** 4, year = 'replace_me'):
    with pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=chunksize) as df_reader:
        for index, chunk in enumerate(df_reader):
            #Dump selected quotes
            selected_df = select_quotes_chunk(chunk, vocabulary)
            pickle_file_name = year + '_chunk_' + str(index) + '.pkl'
            selected_df.to_pickle('files/'+pickle_file_name)
            #if not index==0:
                #selected_df = pd.concat([selected_df, select_quotes_chunk(chunk, vocabulary)])
            #else: 
               # selected_df = select_quotes_chunk(chunk, vocabulary)
    return selected_df


import random, string

def randomword(length):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(length))

### Constants 

In [10]:
DATA_FOLDER = 'data/'
QUOTEBANK_2020 = DATA_FOLDER+ "quotes-2020.json.bz2"
QUOTEBANK_2019 = DATA_FOLDER+ "quotes-2019.json.bz2"
QUOTEBANK_2017 = DATA_FOLDER+ "quotes-2017.json.bz2"
QUOTEBANK_2015 = DATA_FOLDER+ "quotes-2015.json.bz2"
QUOTEBANK_2018 = DATA_FOLDER+ "quotes-2018.json.bz2"
QUOTEBANK_2016 = DATA_FOLDER+ "quotes-2016.json.bz2"

PATH = 'generated_data/'

PARQUET_FILE = PATH +  "speaker_attributes.parquet"

KEYWORDS_LIST = ('women\'s right','Equal opportunities','Equal rights','Equal status','equal pay',
              'gender gap','Gender discrimination','Gender equality','Sexual harrassment',
              'Women empowerment','women victim','women immigration','Women emancipation',
              'women\'s participation','Western women','non-western woman','Muslim women',
              'Equal wages','Gender equality','gender equity','Men and women','women and men',
              'women oppression','niqab ban','struggle of girls','struggle of women','war against women',
              'oppression of girls','oppression of women','women oppression','women\'s opression','liberate women',
              'religious oppresion','abuse of women','Male oppression','Female oppression','Exploitation of women',
              'Indigenous women','Patriarchal culture','gender equality','child care','men pay','percentage men',
              'pay percentage','sexual harassment','women girls','girls women',
              'rates women','women according','female mayors','share women','women movement',
              'see women','gender stereotypes','gender gap',
              'women representation','sex discrimination','states women',
              'women rights','woman time',
              'based gender',
              'proportional electoral','female candidates','gender-based violence','entirely female','cities female')

### Select and pickle of quotes of interest

Note: This code has to be run once to create the pickle files containing the quotes of interest. For futher use, the dataframe is direcly loaded from the pickle files.

In [2]:
dataframesNames = ('QOI_2015_DF','QOI_2016_DF','QOI_2017_DF','QOI_2018_DF','QOI_2019_DF','QOI_2020_DF')

In [3]:
# %time QOI_2015_DF = select_quotes_one_year(QUOTEBANK_2015,KEYWORDS_LIST,10 ** 4)
# %time QOI_2016_DF = select_quotes_one_year(QUOTEBANK_2016,KEYWORDS_LIST,10 ** 4)
# %time QOI_2017_DF = select_quotes_one_year(QUOTEBANK_2017,KEYWORDS_LIST,10 ** 4)
# %time QOI_2018_DF = select_quotes_one_year(QUOTEBANK_2018,KEYWORDS_LIST,10 ** 4)
# %time QOI_2019_DF = select_quotes_one_year(QUOTEBANK_2019,KEYWORDS_LIST,10 ** 4)
# %time QOI_2020_DF = select_quotes_one_year(QUOTEBANK_2020,KEYWORDS_LIST,10 ** 4)

#for i in range(len(dataframesNames)):
#    dataframes[i].to_pickle('generated_data/'+dataframesNames[i]+'.pkl')

In [7]:
PATH = 'generated_data/'

In [8]:
# Concatenate into one dataframes the dataframes from each pickle file. 
df = pd.concat([pd.read_pickle(PATH+ fp +'.pkl') for fp in dataframesNames], ignore_index=True)

In [13]:
df["keywords"] = df["quotation"].apply(lambda x : [keyword for keyword in KEYWORDS_LIST if keyword in x])

In [14]:
df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,keywords
0,2015-03-09-004706,Anything less than women winning 50 per cent o...,Katy Gallagher,[Q463507],2015-03-09 12:30:00,1,"[[Katy Gallagher, 0.5872], [None, 0.4128]]",[http://www.smh.com.au/act-news/women-need-to-...,E,[gender equality]
1,2015-04-24-025718,I'd like to congratulate all the winners and f...,Helena Morrissey,[Q23762081],2015-04-24 15:33:00,1,"[[Helena Morrissey, 0.8706], [None, 0.1294]]",[http://www.cipd.co.uk/PM/peoplemanagement/b/w...,E,[gender equality]
2,2015-07-16-044620,I think what Deepika has spoken in the video m...,Kalki Koechlin,[Q3192216],2015-07-16 16:41:07,1,"[[Kalki Koechlin, 0.6377], [None, 0.3623]]",[http://www.pinkvilla.com/entertainmenttags/ka...,E,[Women empowerment]
3,2015-09-11-052815,if advocating for equal pay for equal work is ...,Hillary Clinton,[Q6294],2015-09-11 14:17:08,1,"[[Hillary Clinton, 0.8831], [None, 0.1105], [D...",[http://www.wrn.com/2015/09/hillary-clinton-ra...,E,[equal pay]
4,2015-04-23-037713,Men and women are understandably upset if they...,Jim McDermott,"[Q321457, Q6196778]",2015-04-23 21:52:22,1,"[[Jim McDermott, 0.629], [John F. Kerry, 0.190...",[http://www.atlanticcouncil.org/en/blogs/new-a...,E,[Men and women]


In [95]:
print(f'The dataframe has {len(df)} entries')

The dataframe has 87161 entries


In [96]:
pd.options.display.max_colwidth = 200
df.head()['quotation']

0    Anything less than women winning 50 per cent of new seats will be a loss not only for a progressive city's progress towards true gender equality but it would also be a loss for good governance in ...
1    I'd like to congratulate all the winners and finalists on their success. They have demonstrated clear leadership by moving women's progression from a `diversity' initiative to a core business prio...
2    I think what Deepika has spoken in the video makes sense. I do understand the counter argument too where everyone has been saying that had men said the same lines about having sex outside marriage...
3                                                                                    if advocating for equal pay for equal work is playing the gender card, deal me in. I am ready to play as hard as I can.
4      Men and women are understandably upset if they see a company close down and jobs lost. It's only natural people would look around and in their distress they find something o

Since the dates in the quotes don't seem to be a problem, and our current method to parse dates sometimes give a memory overflow, we currently won't be removing them but will remove only the html 