# *Analysis of speech behaviours between genders*

## Context

In this project, we are going to analyze data from Quotebank. Quotebank, as the name suggests, is an open corpus which gathers 178 million quotations from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020.

We are interested in using this dataset to answer the following question: Do speech behaviours related to confidence and uncertainty vary between men and women?

To answer this question, we'll go through the following points:

1. To what extent can we observe the differences in communicative acts in relation to gender within a professional area? Are there noticeable differences between those professional areas?
2. What are the roles of environment (nationality), culture/tradition (religion, ethnic groups), and education (whether the speaker obtained an academic degree) in determining those differences in speech between men and women? How are the lines drawn between the language we use and the environment around us?
3. Has there been a possible change over time (from 2015 to 2020)?

In order to have access to the speakers' information (e.g., speakers' genders), we use the open source data from wikidata (https://www.wikidata.org/wiki/Wikidata:Main_Page).

To analyse speech uncertainty, we use an uncertainty detection classifier, adapted from the following paper "P. A. Jean, S. Harispe, S. Ranwez, P. Bellot, and J. Montmain, “[Uncertainty detection in natural language: A probabilistic model](https://www.researchgate.net/publication/303842922)” ACM Int. Conf. Proceeding Ser., vol. 13-15-June, no. June, 2016, doi: 10.1145/2912845.2912873".

## Table of contents

[1. Pre-processing of the data](#pre-processing) 
- [Imports](#1imports)
- [Pathways](#1pathways)
- [Functions](#1functions)
- [Merging files from wikidata into one file containing the 9 million speakers](#1merging)
- [1.1 Loading and pre-processing of Quotebank data](#1.1)
- [1.2 Analysis of the quotes from Quotebank](#1.2)
- [1.3 Loading wikidata labels](#1.3) 
- [1.4 Pre-processing of wikidata](#1.4)
- [1.5 Exploratory Data Analysis of wikidata](#1.5)

[2. Creation of our sub data frames](#dataframes)
- [Imports](#2imports)
- [Functions](#2functions)
- [2.1 Creation of professional fields](#2.1)
- [2.2 Creation of sub-dataframes](#2.2) 
- [2.3 Creation of Data Frames with English Quotes only](#2.3)
- [2.4 Saving of all sub data frames](#2.4)

[3. Classification of the quotes](#classifier)
- [Pathways](#3pathways)
- [Functions](#3functions)
- [3.1 Reading of all sub data frames](#3.1)
- [3.2 Creation of the text files](#3.2)
- [3.3 Use of the uncertainty detection classifier](#3.3) 

[4. Results](#results)
- [Imports](#4imports)
- [Functions](#4functions)
- [4.1 Gender distribution accross occupations](#4.1)
- [4.2 Analysis of the gender distribution per professions](#4.2) 
- [4.3 Background influence](#4.3)
- [4.4 Possible variation from 2015 to 2020](#4.4)

[5. Statistical analysis](#statanalysis)
- [5.1 Analysis of the gender distribution](#5.1)
- [5.2 Analysis of the gender distribution per profession](#5.2)
- [5.3 Analysis of background influence regarding gender distribution](#5.3) 
- [5.4 Possible variation from 2015 to 2020](#5.4)

[6. Conclusion](#conclusion)

## 1. Pre-processing of the data <a class = anchor id="pre-processing"></a>

### Imports <a class = anchor id="1imports"></a>

Let's start by importing our libraries. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from langdetect import detect, DetectorFactory
import requests
import plotly
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Pathways <a class = anchor id="1pathways"></a>

In order to run this project, one needs to download the directories and files from the following drive: https://drive.google.com/drive/folders/1UgvnLUFhs14NDcZYH6NuZx2f_YC5i06N?usp=sharing. All directories need to be added in the same directory as this notebook. The data folders "Data", "Data_parquet" and "Images" are already important for the upcoming [1. Pre-processing of the data](#pre-processing); the directories "Classifier" and "txt_files" will be needed in the [3. Classification of the quotes](#classifier); the "Small Data" directory is needed for the [5. Statistical analysis](#statanalysis).

In [2]:
# DATA_PATH contains all data frames
DATA_PATH = 'Data/'
# PATH_PARQUET contains all the data from wikidata
PATH_PARQUET = 'Data_parquet/'
# IMAGE_PATH contains all saved images
IMAGE_PATH = 'Images/'

### Functions <a class = anchor id="1functions"></a>

The following function is needed for the pre-processing.

In [3]:
def saving_wikidata(path):
    """
    Transforms all the wikidata files in one dataset saved as pickle. 
    This allows to load the file and use it more quickly and easily.
    Input:
        * path : pathway where to save the pickle file
    """
    wikidata_all = pd.DataFrame()
    for i in range(1,16):
        if i < 10:
            DATA_FILE = 'part-0000{}-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet'.format(i)
        else:
            DATA_FILE = 'part-000{}-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet'.format(i)
        wikidata = pd.read_parquet(PATH_PARQUET + DATA_FILE)
        wikidata_all = pd.concat([wikidata_all, wikidata])
    wikidata_all.to_pickle(path)

def det(x):
    """
    Detects the language a quote.
    Input:
        * x : one quote
    Output:
        * lang : language of the quote
    """
    DetectorFactory.seed = seed
    try:
        lang = detect(x)
    except:
        lang = 'Other'
    return lang

### Merging files from wikidata into one file containing the 9 million speakers <a class = anchor id="1merging"></a>

Here, we create the file "wikidata_all.pck" containing all the raw data from wikidata. We already ran this cell and the file can be found in "Data". As this is a huge dataset, we save the file as pickle, as this is much less comutationally costly than directly working with the data.

In [4]:
""" Only intented to be run once
saving_wikidata(DATA_PATH + "wikidata_all.pck")""";

### 1.1. Loading and pre-processing of Quotebank data <a class = anchor id="1.1"></a>

The function `clean_quotebank` (in [Functions](#2functions) below) does a first cleaning of the quotebank dataset by leaving behind quotes from unknown speakers as they would not be useful towards our study. We chose to also drop the quotes from uncertain speakers, i.e. when the speaker probability was below 0.5. As this is a computationally costly operation, the cleaning is performed at the same time as we create our quotebank sub dataframes ([2. Creation of our sub data frames](#dataframes)).

### 1.2 Analysis of the quotes from Quotebank <a class = anchor id="1.2"></a>

We take a random sample of our dataframe to analyze the language distribution in the dataset. More information about this dataframe (saved as "df_no_conditions_2020.pck" in Data) will be provided in the [2. Creation of our sub data frames](#dataframes). We consider that this sample selected randomly is representative of the whole dataset.

In [4]:
df_no_conditions_2020 = pd.read_pickle(DATA_PATH + "df_no_conditions_2020.pck")

For reproducibility, a seed is fixed for the entire dataframe (`seed` = 2). 

In [3]:
seed = 2

In [6]:
small_df = df_no_conditions_2020.sample(n = 300, random_state = seed)

Here is a glampse at some quotes.

In [7]:
print(small_df.shape)
small_df.head(3)

(300, 24)


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,aliases,...,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
65382,2020-01-16-037739,I will seek to mend fences with our neighbouri...,John Campion,Q24572630,2020-01-16 06:01:00,1,"[[John Campion, 0.6249], [None, 0.3751]]",[http://www.shropshirestar.com/news/crime/2020...,E,[John-Paul Campion],...,,,[Q82955],[Q9626],,Q24572630,John Campion,,item,
27097,2020-02-13-119432,"We're training at Aston Villa again, mid journ...",Chris Beech,Q5105871,2020-02-13 17:27:16,1,"[[Chris Beech, 0.6725], [None, 0.238], [Dean S...",[https://www.birminghammail.co.uk/sport/footba...,E,,...,,,[Q937857],,,Q5105871,Chris Beech,,item,
25289,2020-01-23-019530,Exports of firearms and related items that do ...,R. Clarke Cooper,Q7273550,2020-01-23 22:15:25,1,"[[R. Clarke Cooper, 0.8201], [None, 0.1799]]",[https://www.cnbc.com/2020/01/23/gun-exports-g...,E,[René Clarke Cooper],...,,,"[Q11986654, Q189290, Q82955]",[Q29468],,Q7273550,R. Clarke Cooper,,item,[Q682443]


We use langdetect to identify quotations language.

In [8]:
small_df['quote_language'] = small_df['quotation'].apply(lambda x: det(x))

Let's print all languages contained in `small_df`.

In [9]:
small_df['quote_language'].unique()

array(['en', 'fr', 'et', 'ca', 'sw'], dtype=object)

We see that not all quotes are in English. Therefore, in part [2.3 Creation of Data Frames with English Quotes only](#2.3), we will select only the English quotes from our data frames.

Before the visualisation, we define a color palette that will be used carefully throughout the project to be robust for color blind people.

In [6]:
colors=["#FF0000", "#FF1B1B", "#FF3637", "#FF5252", "#FF6D6E", "#FE8889", "#FEA3A5", "#FEBFC0", "#FEDADC", "#FEF5F7"]

Let's visualise the languages in this sample of our dataset.

In [11]:
language_dist = small_df.groupby('quote_language').size()
language_dist = language_dist.div(language_dist.sum(axis = 0), axis = 0)

In [12]:
df_lang = pd.DataFrame()
df_lang['languages'] = language_dist.sort_values(ascending = False)
df_lang.round(decimals = 3)

Unnamed: 0_level_0,languages
quote_language,Unnamed: 1_level_1
en,0.983
fr,0.007
ca,0.003
et,0.003
sw,0.003


The languages are then retrieved using the "List of ISO 639-1 codes" (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

In [13]:
df_lang.index = ['English', 'French', 'Catalan', 'Estonian', 'Swahili']
df_lang.round(decimals = 3)

Unnamed: 0,languages
English,0.983
French,0.007
Catalan,0.003
Estonian,0.003
Swahili,0.003


We now display the languages in a plot.

In [14]:
trace_lang = go.Bar(x = df_lang.index, y = df_lang['languages'], marker_color = colors[0])

my_layout = {
    'title': 'Language distribution of the quotations',
    'xaxis': {'title': 'Languages'},
    'yaxis': {'title': 'Quotes (log scale)'},
    }

fig = go.Figure()
fig.add_trace(trace_lang)
fig.update_layout(my_layout, title_x = 0.5)
fig.update_yaxes(type = "log")

fig.show()

We save the image for the data story. All images have already been created and can be found in the folder "Images".

In [15]:
"""html_path = IMAGE_PATH + "language_dist_plot.html"
fig.write_html(html_path)""";

We observe that the dataset is composed in vast majority of english quotes (99%) but it also contains quotes coming from different languages like French and Swahili.

### 1.3 Loading wikidata labels <a class = anchor id="1.3"></a>

Wikidata labels are encoded in items called QIDs, to map them back to human readable labels, we upload the following dataset.

In [5]:
qid_label = pd.read_csv(DATA_PATH+'wikidata_labels_descriptions_quotebank.csv.bz2',
                        compression = 'bz2', index_col = 'QID')

Let's have a quick look at this data frame.

In [16]:
qid_label.head()

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America


### 1.4 Pre-processing of wikidata <a class = anchor id="1.4"></a>

To be able to find information about the speakers, we use wikidata. Let's start by creating a data frame of all the speakers from our pickle file "wikidata_all.pck".

In [17]:
wikidata_all = pd.read_pickle("Data/wikidata_all.pck")
wikidata_all.shape

(8583613, 15)

We now search for possible redundant speakers.

In [18]:
wikidata_all['id'].is_unique

True

As this returns `True`, we can see that there are no duplicate speakers in wikidata. We'll now create a new dataframe (`cleaned_wikidata`), that will contain a clean version of the data (without modifying the original one).

In [19]:
cleaned_wikidata = wikidata_all.copy(deep = True)

Let's visualise this data frame.

In [20]:
cleaned_wikidata.head()

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,,,,[Q6581097],985453603,,,,,,Q45441526,Cui Yan,,item,
1,,,[Q9903],[Q6581097],1008699604,,,,,,Q45441555,Guo Ziyi,,item,
2,,,[Q9903],[Q6581097],1008699709,,,,,,Q45441562,Wan Zikui,,item,
3,,,[Q9903],[Q6581097],1008699728,,,,,,Q45441563,Lin Pei,,item,
4,,,[Q9683],[Q6581097],985261661,,,,,,Q45441565,Guan Zhen,,item,


Now we want to remove rows where mandatory values for our study are missing like the name of the speaker and the gender.

In [21]:
cleaned_wikidata = cleaned_wikidata[~cleaned_wikidata.label.isin([None, "None"])]
print(cleaned_wikidata.shape)
cleaned_wikidata = cleaned_wikidata[~cleaned_wikidata.gender.isin([None, "None"])]
print(cleaned_wikidata.shape)

(8113215, 15)
(6288640, 15)


We see that there were 470.398 missing names of speakers and 1.824.575 missing speakers' genders. Our new cleaned data frame contains 6.288.640 speakers.

### 1.5 Exploratory Data Analysis of wikidata <a class = anchor id="1.5"></a>

We will now analyse the genders present in wikidata.

In [22]:
cleaned_wikidata['gender'] = cleaned_wikidata['gender'].apply(lambda x: x[0])
print("There are {} different unique genders in wikidata".format(cleaned_wikidata['gender'].unique().shape[0]))
cleaned_wikidata['gender'].unique()

There are 32 different unique genders in wikidata


array(['Q6581097', 'Q6581072', 'Q179294', 'Q1052281', 'Q48270',
       'Q2449503', 'Q18116794', 'Q12964198', 'Q15145779', 'Q189125',
       'Q859614', 'Q1097630', 'Q44148', 'Q1289754', 'Q301702',
       'Q106299064', 'Q27679684', 'Q15145778', 'Q52261234', 'Q207959',
       'Q505371', 'Q7130936', 'Q43445', 'Q96000630', 'Q27679766',
       'Q1984232', 'Q93954933', 'Q746411', 'Q48279', 'Q3177577',
       'Q1775415', 'Q6636'], dtype=object)

Let's compute then observe their distribution.

In [23]:
genders_dist = cleaned_wikidata.groupby('gender').size()
genders_dist = genders_dist.div(genders_dist.sum(axis = 0), axis = 0)
genders_dist.index = qid_label.loc[genders_dist.index].Label.values

In [24]:
df_genders = pd.DataFrame()
df_genders['genders'] = genders_dist.sort_values(ascending=False)

In [25]:
trace_genders = go.Bar(x = df_genders.index, y = df_genders['genders'], marker_color = colors[0])

my_layout = {
    'title': 'Gender distribution of the quotations',
    'xaxis': {'title': 'Genders'},
    'yaxis': {'title': 'Speakers (log scale)'},
    }

fig = go.Figure()
fig.add_trace(trace_genders)
fig.update_layout(my_layout)
fig.update_layout(title_x = 0.5)
fig.update_yaxes(type = "log")
fig.update_xaxes(tickangle = 45)

fig.show()

The figure is then saved.

In [26]:
"""html_path = IMAGE_PATH + "genders_dist_plot.html"
fig.write_html(html_path)""";

Let's now print fuller explanations to those genders.

In [28]:
qid_label.loc[cleaned_wikidata['gender'].unique()]

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q6581097,male,"to be used in ""sex or gender"" (P21) to indicat..."
Q6581072,female,"to be used in ""sex or gender"" (P21) to indicat..."
Q179294,eunuch,castrated male human
Q1052281,transgender female,female person who was assigned a different gen...
Q48270,non-binary,range of gender identities that are not exclus...
Q2449503,transgender male,person assigned to the female sex at birth who...
Q18116794,genderfluid,gender identity which doesn't conform to fixed...
Q12964198,genderqueer,range of gender identities that are not exclus...
Q15145779,cisgender female,female person who was assigned female at birth
Q189125,transgender person,person whose gender identity is different from...


We define the male and female genders for further analysis.

In [7]:
male_label = "Q6581097"
female_label = "Q6581072"

Let's compute the proportion of non-male and non-female genders.

In [30]:
genders_dist[(genders_dist.index != 'male') & (genders_dist.index != 'female')].sum()

0.0002668303480561776

We can see that less than 0.03% of the data is neither considered a male or a female. For simplicity in our study we will ignore those cases.

In [31]:
cleaned_wikidata = cleaned_wikidata[cleaned_wikidata.gender.isin([male_label, female_label])]

We know creates an occupation data frame where everyone's occupation is known (not `None`).

In [35]:
"""Only intented to be run once 
wiki_occupation = cleaned_wikidata[~cleaned_wikidata['occupation'].isin([None])]""";

Then, we create a complete data frame where occupation, religion, nationality, ethnic_group and academic_degree are known for every speaker in the dataset.

In [36]:
""" Only intented to be run once 
wiki_background = wiki_occupation[~(wiki_occupation['religion'].isin([None, "None"]) & 
                        wiki_occupation['nationality'].isin([None, "None"]) &
                        wiki_occupation['ethnic_group'].isin([None, "None"]) & 
                        wiki_occupation['academic_degree'].isin([None, "None"]))]""";

We save those data frames as pickle for further analysis. We already ran the following cell and the files can be found in "Data".

In [37]:
""" Only intented to be run once 
wiki_occupation.to_pickle(DATA_PATH + "wiki_occupation.pck")
wiki_background.to_pickle(DATA_PATH + "wiki_background.pck")""";

Now that all the pre-processing is done, we can start our analysis. 

## 2.  Creation of our sub data frames <a class = anchor id="dataframes"></a>

First, we will focus on the initial matter, being to what extent we can observe the differences in communicative acts in relation to gender within a professional area and whether there are noticeable differences between those professional areas.

### Imports <a class = anchor id="2imports"></a>

First we import Empath, a tool for analyzing text across lexical categories.

In [32]:
from empath import Empath

### Functions <a class = anchor id="2functions"></a>

Let's define the functions needed for this part.

In [33]:
def add_columns(column, target, init_df, name_column):
    """
    Checks if a target ('politician', 'male', 'female' etc...) is in a certain column.
    If it is, we return True in an additional column (name_column).
    Inputs:
        * column : name of column to search for target
        * target : item of interest 
        * init_df : initial data frame
        * name_column : name of new column of booleans
    Output:
        * final_df : dataframe with new column and only rows which contains the item
    """
    final_df = init_df.copy(deep = True)
    final_df[name_column] = final_df[column].apply(lambda x: np.any(x) in target)
    return final_df

def extracting_sub_df(quotebank, wikidata, column):
    """
    Creates a sub dataframe with information from quotebank and wikidata.
    We only take the rows in column which are True.
    Inputs:
        * quotebank : data frame extracted from quotebank
        * wikidata : data frame extracted from wikidata
        * column : column on which we base the merge
    Output:
        * sub_df : merged dataframe 
    """
    merged_df = pd.merge(quotebank, wikidata, left_on = 'speaker', right_on = 'label')
    merged_df['qids'] = merged_df['qids'].apply(lambda x : x[0])
    merged_df = merged_df[merged_df['qids'] == merged_df['id']]
    sub_df = merged_df[merged_df[column] == True]
    return sub_df

def clean_quotebank(df):
    """ 
    Cleans quotebank dataset by droping quotes from unknown speakers and
    quotes where the speaker is uncertain (p < 0.5).
    Input:
        * df : quotebank data frame to clean  
    Output:
        * df_copy : cleaned data frame
    """ 
    df_copy = df.copy(deep = True)
    df_copy = df_copy[~df_copy.speaker.isin(['None', None])]
    df_copy =  df_copy[df_copy['probas'].apply(lambda x: x[0][1]).values.astype(float) > 0.5]
    return df_copy 

def create_df_with_conditions(column, conditions, wikidata, columns_temp, start, stop):
    """
    From quotebank data, extracts and returns a data frame with only the rows that respect the condition.
    Inputs:
        * column : name of column where the condition is applied
        * conditions : conditions of interest 
        * wikidata : wiki database
        * columns_temp : name of new columns of booleans
    Output:
        * sub_df : data frame with only rows which contains the item
    """
    wiki_plus = wikidata.copy(deep = True)
    for j in range(len(conditions)):
        wiki_plus = add_columns(column, conditions[j], wiki_plus, columns_temp[j])
    sub_df = {}
    for g in range(len(columns_temp)):
        sub_df[str(g)] = pd.DataFrame()
    for i in range(start,stop):
        DATA_FILE = 'quotes-20{}.json.bz2'.format(i)
        with pd.read_json(DATA_PATH + DATA_FILE, lines = True, compression ='bz2', chunksize = 100000) as df_reader:
            for chunk in df_reader:
                for k in range(len(columns_temp)):
                    sub_df[str(k)] = pd.concat([sub_df[str(k)], extracting_sub_df(clean_quotebank(chunk),
                                                                                  wiki_plus, columns_temp[k])])
    for l in range(len(columns_temp)):
        sub_df[str(l)] = sub_df[str(l)].drop(columns_temp, axis = 1)
    return sub_df['0'], sub_df['1'], sub_df['2'], sub_df['3']

def merging(quotebank, wikidata):
    """
    Creates a sub dataframe with information from quotebank and wikidata.
    Inputs:
        * quotebank : data frame extracted from quotebank
        * wikidata : data frame extracted from wikidata
    Output:
        * sub_df : merged data frame 
    """
    merged_df = pd.merge(quotebank, wikidata, left_on = 'speaker', right_on = 'label')
    merged_df['qids'] = merged_df['qids'].apply(lambda x : x[0])
    merged_df = merged_df[merged_df['qids'] == merged_df['id']]
    return merged_df

def create_df(wikidata, start, stop):
    """
    From quotebank data, extracts and returns a data frame with only the rows where all informations are present.
    Inputs:
        * wikidata : wiki database
        * start : years of beginning, for example 15 for 2015. Have to be between 15 and 21
        * stop : years of end + 1, for example 16 if you only want 2015. Have to be between 15 and 21
    Output:
        * sub_df : data frame only the rows that have all the information of interest (without 'None')
    """
    sub_df = pd.DataFrame()
    for i in range(start, stop):
        DATA_FILE = 'quotes-20{}.json.bz2'.format(i)
        with pd.read_json(DATA_PATH + DATA_FILE, lines = True, compression ='bz2', chunksize = 100000) as df_reader:
            for chunk in df_reader:
                new_chunk = chunk.sample(n = 10000, random_state = seed)
                sub_df = pd.concat([sub_df, merging(clean_quotebank(new_chunk), wikidata)])
    return sub_df

def return_english_df(df):
    """
    Returns a data frame containing the rows of data frame if the quote is in English.
    Input:
        * df : data frame with quotes in all languages
    Output:
        * english : data frame with only English quotes 
    """
    english = df.copy(deep = True)
    english['language'] = english['quotation'].apply(det)
    english = english[english['language'] == 'en']
    english.drop('language', axis = 1)
    return english

### 2.1 Creation of professional fields <a class = anchor id="2.1"></a>

Let's create data frames with different fields of professions to compare quotations of men and women without the bias of the professional background. To do that, we start by choosing four professional fields: arts, science, economy and politics. We then create our lists of professions related to each field. For this, we use the `occupation` list from Empath, as well as other intuitive professions we felt relevant to complement it with. We then manually assign each profession/occupation to a professional field, after checking its presence and relevance in `qid_label`.

In [34]:
lexicon = Empath()
lexicon.cats["occupation"][:15]

['psychologist',
 'waiter',
 'electrician',
 'server',
 'worker',
 'manager',
 'chemist',
 'interview',
 'technician',
 'nanny',
 'accountant',
 'owner',
 'retirement',
 'hairdresser',
 'cashier']

In [35]:
art_professions = qid_label[qid_label['Label'].str.contains('cineast|painter|musician|sculpter|architect|dancer| \
        philosoph|writer|actor|actress|choreographer|music interpreter|singer|photographer|entertainer', na = False)]

scientific_professions = qid_label[qid_label['Label'].str.contains('scientific|researcher|mathematician|doctor| \
        astronomist|biologist|chemist|physicist|physician|psychologist|engineer|anatomist|neurologist| \
        pediatrician|veterinarian|pharmacist|obstetrician|gynecologist|therapist|dentist|surgeon|nurse| \
        psychiatrist|Scientific', na = False)]

economic_professions = qid_label[qid_label['Label'].str.contains('economist|banke|financ|CEO|CTO|chairman|auditor| \
        stockbroker|insurer|business manager|retail merchandizer|pricing analyst|statistician|financial consultant| \
        salesperson|risk analyst|Data analyst|accountant|economic researcher|Investm|actuary', na = False)]

political_professions = qid_label[qid_label['Label'].str.contains('politician|president|minister|government \
        accountant General', na = False)]

### 2.2 Creation of sub-dataframes  <a class = anchor id="2.2"></a>

To create the sub data frames, we need to read the pickle files that we previously saved. Still, the creation and saving of our sub pickle files is already done and those files can be found in "Data".

In [42]:
""" Only intented to be run once
wiki_occupation = pd.read_pickle(DATA_PATH + "wiki_occupation.pck")
wiki_background = pd.read_pickle(DATA_PATH + "wiki_background.pck")""";

The merge is done on the `label` for _Wikidata_ and `speaker` in _Quotebank_. As several `label` are often selected (speakers having the same name), we chose the speaker (`label`) having his/her `id` (in Wikidata) equal to the first speaker in `qids` (`qids`[0] in Quotebank).

#### Creation of sub-dataframes per professions (from 2015 to 2020)

We now want to combine the information of the speakers from wikidata with our professional fields to create data frames of specific speakers' professions.

In [43]:
""" Only intented to be run once 
df_per_year = {}
conditions = [political_professions.index, art_professions.index, scientific_professions.index,
                                                                                        economic_professions.index]
columns = ['ispolitician', 'isartist', 'isscientist','iseconomist']
for i in range(15,21):
    df_per_year['df_politicians_20' + str(i)], df_per_year['df_artists_20' + str(i)],
            df_per_year['df_scientists_20' + str(i)], df_per_year['df_economists_20' + str(i)] = \
                        create_df_with_conditions('occupation', conditions, wiki_occupation, columns, i, i + 1)""";

#### Creation of a general background sub-dataframes (from 2015 to 2020)

We now create our data frame from the more precise wikidata where the `occupation`, `religion`, `nationality`, `ethnic_group` and `academic_degree` are known for every speaker in the dataset.

In [44]:
""" Only intented to be run once
df_no_conditions_per_year = {}
for i in range(15,21):
    df_no_conditions_per_year['20' + str(i)] = create_df(wiki_background, i, i + 1)""";

### 2.3 Creation of Data Frames with English Quotes only <a class = anchor id="2.3"></a>

In [45]:
"""Only intended to be run once
df_english = {}
for i in range(15, 21):
    df_english['df_politicians_english_20' + str(i)] = return_english_df(df_per_year['df_politicians_20' + str(i)])
    df_english['df_artists_english_20' + str(i)] = return_english_df(df_per_year['df_artists_20' + str(i)])
    df_english['df_scientists_english_20' + str(i)] = return_english_df(df_per_year['df_scientists_20' + str(i)])
    df_english['df_economists_english_20' + str(i)] = return_english_df(df_per_year['df_economists_20' + str(i)])
    df_english['df_no_conditions_20' + str(i)] = return_english_df(df_no_conditions_per_year['20' + str(i)])""";

### 2.4 Saving of all sub data frames <a class = anchor id="2.4"></a>

Finally, we generate pickle files where we store all the english quotations of both genders with the same professional field, from 2015 to 2020, as well as the files containing the speakers with more background information (ex : "df_no_conditions_2020.pck"). As mentioned above, the saving is already done and the files can be found in "Data".

#### Saving of sub-dataframes per professions and general background sub-dataframes (from 2015 to 2020)

In [46]:
""" Only intented to be run once
for i in range(15, 21):
    df_english['df_politicians_english_20' + str(i)].to_pickle(DATA_PATH + "politicians_english_20" + str(i) + ".pck")
    df_english['df_artists_english_20' + str(i)].to_pickle(DATA_PATH + "artists_english_20" + str(i) + ".pck")
    df_english['df_scientists_english_20' + str(i)].to_pickle(DATA_PATH + "scientists_english_20" + str(i) + ".pck")
    df_english['df_economists_english_20' + str(i)].to_pickle(DATA_PATH + "economists_english_20" + str(i) + ".pck")
    df_english['df_no_conditions_20' + str(i)].to_pickle(DATA_PATH + "df_no_conditions_english_20" + str(i) + ".pck")
    """;

Our specific data frames are now saved and ready for the classification.

## 3. Classification of the quotes <a class = anchor id="classifier"></a>

To distinguish between uncertain and certain quotations, we use the uncertainty detection classifier from the following paper "P. A. Jean, S. Harispe, S. Ranwez, P. Bellot, and J. Montmain, “[Uncertainty detection in natural language: A probabilistic model](https://www.researchgate.net/publication/303842922)” ACM Int. Conf. Proceeding Ser., vol. 13-15-June, no. June, 2016, doi: 10.1145/2912845.2912873". Its public git repository is: https://github.com/PAJEAN/uncertaintyDetection.

Uncertainty is defined by speculative verbs (like suggest or presume), adjectives and adverbs (like probably, possibly), auxiliary verbs (must, should) or the use of some tense or modes of conjugation (subjunctive, conditional). This classifier is an automatic machine learning method to detect uncertainty in natural language. It is inspired by binary classification methods, and is based on an optimal features’ selection. This classifier has as learning ground three corpora: Bioscope (a corpus in the biomedical domain containing 1.871 sentences), WikiWeasel (a generic corpus composed of paragraphs extracted from Wikipedia) and SFU (17.263 sentences extracted from various resources (movies, books, etc.)). Here, we only use SFU to train our classifier. In milestone 3, we will try do have a more optimal classifier by training it using the three training sets. From the set of extracted features, a vectorial representation is then defined for each sentence. The most certain and uncertain sentences are then extracted from the our test data.

Still, as this classifier was created 6 years ago, it only runs on python2. As we are using python3, we modified it to run for our project. We describe below how to run it.

### Pathways <a class = anchor id="3pathways"></a>

As already mentionned in the [1. Pre-processing of the data](#pre-processing), the data folder "txt_files" must be downloaded and added in the same directory as this notebook.

In [8]:
PATH_TXT = 'txt_files/'

### Functions <a class = anchor id="3functions"></a>

In [37]:
def quotes_to_txt(file_name, df):
    """
    Changes quotes dataframe to a text file.
    Inputs:
        * file_name : text name
        * df : dataframe to convert
    """
    quotes = df.quotation.astype(str)
    with open(file_name, "w", encoding = "utf-8") as f:
        for ind, quote in enumerate(quotes):
            f.write(str(ind) + " " + quote + "\n")
    f.close()

### 3.1 Reading of all sub data frames <a class = anchor id="3.1"></a>

We start by reading our sub data frames.

#### Sub data frames per professions

We start by creating a dictionnary containing the data frames of all the professions per year (`df_per_year`).

In [8]:
df_per_year = {}
for i in range(15, 21): #(15, 21)
    df_per_year['df_politicians_20' + str(i)] = pd.read_pickle(DATA_PATH + "politicians_english_20" + str(i) + ".pck")
    df_per_year['df_artists_20' + str(i)] = pd.read_pickle(DATA_PATH + "artists_english_20" + str(i) + ".pck")
    df_per_year['df_scientists_20' + str(i)] = pd.read_pickle(DATA_PATH + "scientists_english_20" + str(i) + ".pck")
    df_per_year['df_economists_20' + str(i)] = pd.read_pickle(DATA_PATH + "economists_english_20" + str(i) + ".pck")

We create a small version to avoid memory errors.

In [25]:
"""Only intented to be run once
small_df_per_year = {}

for i in range(15, 21):
    print(i)
    small_df_per_year['df_politicians_20' + str(i)]= pd.read_pickle(DATA_PATH + \
        "politicians_english_20" + str(i) + ".pck").sample(n = 10000, random_state = seed)
    small_df_per_year['df_artists_20' + str(i)]= pd.read_pickle(DATA_PATH + \
        "artists_english_20" + str(i) + ".pck").sample(n = 10000, random_state = seed)
    small_df_per_year['df_scientists_20' + str(i)]= pd.read_pickle(DATA_PATH + \
        "scientists_english_20" + str(i) + ".pck").sample(n = 10000, random_state = seed)
    small_df_per_year['df_economists_20' + str(i)]= pd.read_pickle(DATA_PATH + \
        "economists_english_20" + str(i) + ".pck").sample(n = 10000, random_state = seed)
        
for i in range(15, 21):
    small_df_per_year['df_politiciancs_20' + str(i)].to_pickle(DATA_PATH + "small_politicians_20" + str(i) + ".pck")
    small_df_per_year['df_artists_20' + str(i)].to_pickle(DATA_PATH + "small_artists_english_20" + str(i) + ".pck")
    small_df_per_year['df_scientists_20' + str(i)].to_pickle(DATA_PATH + "small_scientists_20" + str(i) + ".pck")
    small_df_per_year['df_economists_20' + str(i)].to_pickle(DATA_PATH + "small_economists_20" + str(i) + ".pck")""";

#### Sub data frame with no conditions on profession

We now create a dictionnary containing the data frames of background information of the speakers per year (`df_no_conditions_per_year`).

In [9]:
df_no_conditions_per_year = {}

for i in range(15, 21):
    df_no_conditions_per_year['20' + str(i)] = pd.read_pickle(DATA_PATH + \
                                                    "df_no_conditions_english_20" + str(i) + ".pck")

We also create a data frame with all background information of all years.

In [10]:
df_no_conditions_all_years = df_no_conditions_per_year['2015']

for i in range(16,21):
    df_no_conditions_all_years = pd.concat([df_no_conditions_all_years, df_no_conditions_per_year['20' + str(i)]])

### 3.2 Creation of the text files <a class = anchor id="3.2"></a>

#### Sub data frames per professions

To be able to run the classifier, we first need to create txt files with all the quotes and their indices for each professional field.

In [None]:
""" Only intented to be run once
for i in range(15,21):
    quotes_to_txt(PATH_TXT + "politicians_20" + str(i)+ ".txt", df_per_year['df_politicians_20' + str(i)])
    quotes_to_txt(PATH_TXT + "artists_20" + str(i)+ ".txt", df_per_year['df_artists_20' + str(i)])
    quotes_to_txt(PATH_TXT + "scientists_20" + str(i)+ ".txt", df_per_year['df_scientists_20' + str(i)])
    quotes_to_txt(PATH_TXT + "economists_20" + str(i)+ ".txt", df_per_year['df_economists_20' + str(i)])""";

#### Sub data frame with no conditions on professions

In [None]:
""" Only intented to be run once
for i in range(15,21):
    quotes_to_txt(PATH_TXT + "df_no_conditions_20" + str(i) + ".txt", df_no_conditions_per_year['20' + str(i)])""";

All text files have already been saved and can be found in "txt_files". As we have generated the txt files, we can now use them with the classifier.

### 3.3 Use of the uncertainty detection classifier <a class = anchor id="3.3"></a>

After creating the files, we need to use the classifier to separate the uncertain and certain quotes. To continue, you must have downloaded the "Classifier" directory (see the [1. Pre-processing of the data](#pre-processing)) and add its files to the Classifier folder of the github repository.

#### How to run the classifier?

To be able to run the file MUD.py, you will need in your ADA environment: 
* Python 3.8
* nltk library
* numpy library
* sklearn library

Once you have downloaded and installed all the libraries in your envrionment, you are good to go.  

To run the program, you have to open a terminal on jupyter notebook or an anaconda prompt on anaconda. You must go in the project repository (in the main folder) and then in the Classifier folder. Then you have to run the following command : **python MUD.py w Input/name_of_the_file.txt**

It is really important that you have the **ADA environment activated** to run this line.   

The following files are the ones that you can put in the classifier (instead of name_of_the_file) :
* politicians_year.txt
* artists_year.txt
* scientists_year.txt
* economists_year.txt
* df_no_conditions_year.txt

With year between 2015 and 2020.

#### What does it return?

The classifier will run for approximately 10 to 20 min depending on the file you give him. Once it is finished, you can find in the Data/Results folder two files, ***certainty_sentences*** and ***uncertainty_sentences***.  
The file ***certainty_sentences*** is a txt file with all the quotes that have been considered as "certain" by the classifier; the file ***uncertainty_sentences*** is a txt file with all the quotes that have been considered as "uncertain" by the classifier.  

You can find all these files under the following names in the txt_files folder : 
* uncertainty_politicians_year.txt
* uncertainty_artists_year.txt
* uncertainty_scientists_year.txt
* uncertainty_economists_year.txt
* uncertainty_df_no_conditions_year.txt

With year between 2015 and 2020.

In the next section, we will load those files and start the statistical analysis.

## 4. Results <a class = anchor id="results"></a>

### Imports <a class = anchor id="4imports"></a>

In [11]:
import statsmodels.formula.api as smf
import plotly.offline as offline
from plotly.graph_objs import *
from plotly.offline import init_notebook_mode, iplot

### Functions <a class = anchor id="4functions"></a>

We define some functions important for our analysis.

In [19]:
def extract_lines(txt_file):
    """
    Extracts the lines from a text file.
    Input:
        * txt_file : text file
    Output:
        * lines : all lines from text file
    """
    lines = []
    with open(txt_file, "r", encoding = "utf8") as file:
        for line in file:
            lines.append(line) 
    return lines

def extract_indices(lines):
    """
    Extracts the indices from every line.
    Input:
        * lines : all lines from text file
    Output:
        * indices : the indices from each lines
    """
    indices = []
    for i in range(0,len(lines)):
        index = ''
        line = lines[i]
        for j in range(0,len(line)):
            char = line[j]
            if char.isspace():
                break
            else:
                index += char
        indices.append(index)
    return indices

def create_df_from_txt(txt_file, df_profession):
    """
    Creates a sub data frame from df_profession containing the rows defined in the txt_file.
    Inputs:
        * txt_file : text file
        * df_profession : data frame containing all the quotes from a field
    Output:
        * df_uncertain_profession : dataframe containing the uncertain quotes from this field
    """
    df_uncertain_profession = df_profession.iloc[extract_indices(extract_lines(txt_file))]
    return df_uncertain_profession


def linear_reg(data, formula):
    """
     Inputs:
        * data : data to perfrom linear regression on
        * formula : contains which feature we want to predict and the features that we use for the prediction
    Output:
        * res : statistical summary of the linear regression model
    """   
    model_intercept = smf.ols(formula = formula, data = data)
    # Fits the model (find the optimal coefficients, adding a random seed ensures consistency)
    np.random.seed(seed)
    res = model_intercept.fit()
    return res

def create_gender_dist(df_profession, df_uncertain_profession):
    """
    Computes the gender distribution of uncertain speakers for a certain profession.
    Inputs:
        * df_profession : data frame containing all the quotes from a professional field
        * df_uncertain_profession : dataframe containing the uncertain quotes from this professional field
    Output:
        * gender_dist : the gender distribution of uncertain speakers for a certain profession in percentage
    """
    gender_dist = df_uncertain_profession.groupby('gender').size()/df_profession.groupby('gender').size() * 100
    return gender_dist

def calculate_percentage(uncertain, certain, influence):
    """
    Computes the gender ratio of uncertain speakers by gender for a certain influence.
    Inputs:
        * uncertain : data frame containing all the uncertain quotes of the data frame df_no_conditions_all_year
        * certain : dataframe containing all the certain quotes of the data frame df_no_conditions_all_year
        * influence : feature we want to take in the data frame (like nationality, religion, academic_degree)
    Output:
        * uncertain_quotes : data frame containing ratio of uncertain quotes by gender over all quotes of an influence
    """
    uncertain_quotes = uncertain.copy(deep = True)
    certain_quotes = certain.copy(deep = True)

    # We get rid of all the None in our influence.
    certain_quotes = certain_quotes[~certain_quotes[influence].isin([None, "None"])]
    uncertain_quotes = uncertain_quotes[~uncertain_quotes[influence].isin([None, "None"])]

    # Because some of the influence that we use are array we need to take only the first 
    # argument of the array.
    uncertain_quotes[influence] = uncertain_quotes[influence].apply(lambda x: x[0])
    certain_quotes[influence] = certain_quotes[influence].apply(lambda x: x[0])

    uncertain_quotes[influence] = qid_label.loc[uncertain_quotes[influence]].Label.values
    certain_quotes[influence] = qid_label.loc[certain_quotes[influence]].Label.values

    occur = uncertain_quotes.groupby(influence)[influence].agg('count').sort_values(ascending = False)
    occur_certain = certain_quotes.groupby(influence)[influence].agg('count').sort_values(ascending = False)

    uncertain_quotes = uncertain_quotes[uncertain_quotes[influence].isin(occur.index)]
    certain_quotes = certain_quotes[certain_quotes[influence].isin(occur_certain.index)]

    fem = uncertain_quotes[uncertain_quotes['gender'] == 'Q6581072']
    fem_certain = certain_quotes[certain_quotes['gender'] == 'Q6581072']

    mal = uncertain_quotes[uncertain_quotes['gender'] == 'Q6581097']
    mal_certain = certain_quotes[certain_quotes['gender'] == 'Q6581097']

    ratio_fem_uncertain = fem[influence].value_counts()/(fem[influence].value_counts() + fem_certain[influence].value_counts())
    ratio_mal_uncertain = mal[influence].value_counts()/(mal[influence].value_counts() + mal_certain[influence].value_counts())
    diff_mal_female = ratio_fem_uncertain - ratio_mal_uncertain

    uncertain_quotes = pd.concat([ratio_fem_uncertain, ratio_mal_uncertain, diff_mal_female], axis = 1, sort = True)
    uncertain_quotes.columns = ["Female", "Male", "Diff"]
    uncertain_quotes[influence] = uncertain_quotes.index    
    return uncertain_quotes

### 4.1 Gender distribution <a class = anchor id="4.1"></a>

#### Accross occupations & in the general dataframe (containing background information)

To have a general idea of the distribution of men and women speakers, we start by computing the number of men and women per professions.

In [23]:
gender_dist_prof = {}
gender_dist_prof['politicians'] = df_per_year['df_politicians_2015'].groupby('gender').size()
gender_dist_prof['artists'] = df_per_year['df_artists_2015'].groupby('gender').size()
gender_dist_prof['scientists'] = df_per_year['df_scientists_2015'].groupby('gender').size()
gender_dist_prof['economists'] = df_per_year['df_economists_2015'].groupby('gender').size()


for i in range(16, 21): 
    gender_dist_prof['politicians'] += df_per_year['df_politicians_20' + str(i)].groupby('gender').size()
    gender_dist_prof['artists'] += df_per_year['df_artists_20' + str(i)].groupby('gender').size()
    gender_dist_prof['scientists'] += df_per_year['df_scientists_20' + str(i)].groupby('gender').size()
    gender_dist_prof['economists'] += df_per_year['df_economists_20' + str(i)].groupby('gender').size()

In [24]:
df_prof = pd.DataFrame(gender_dist_prof)
df_prof.index = ['Female', 'Male']
df_prof

Unnamed: 0,politicians,artists,scientists,economists
Female,1990398,2624796,464902,570330
Male,7453866,5523509,1557733,2554092


The same is done for the general data frame.

In [27]:
gender_dist_background = df_no_conditions_per_year['2015'].groupby('gender').size()

for i in range(16,21):
    gender_dist_background += df_no_conditions_per_year['20' + str(i)].groupby('gender').size()

In [28]:
df_back = pd.DataFrame()
df_back['background'] = gender_dist_background
df_back.index = ['Female', 'Male']
df_back

Unnamed: 0,background
Female,1196564
Male,5494955


Let's now plot the genders' distributions per professions and for the background dataframe.

In [29]:
trace_back = go.Bar(x = df_back.index, y = df_back['background'], name = "All professions", marker_color = colors[0])
trace_pol = go.Bar(x = df_prof.index, y = df_prof['politicians'], name = "Politicians", marker_color = colors[3])
trace_art = go.Bar(x = df_prof.index, y = df_prof['artists'], name = "Artists", marker_color = colors[5])
trace_sci = go.Bar(x = df_prof.index, y = df_prof['scientists'], name = "Scientists", marker_color = colors[7])
trace_eco = go.Bar(x = df_prof.index, y = df_prof['economists'], name = "Economists", marker_color = colors[9])

my_layout = {
    'title': 'Gender distribution across the professional fields and in our general dataframe',
    'xaxis': {'title': 'Genders'},
    'yaxis': {'title': 'Speakers'},
    }

fig = go.Figure()
fig.add_trace(trace_back)
fig.add_trace(trace_pol)
fig.add_trace(trace_art)
fig.add_trace(trace_sci)
fig.add_trace(trace_eco)
fig.update_layout(my_layout,title_x = 0.5)

fig.show()

We save the plot

In [30]:
html_path = IMAGE_PATH + "gender_dist_plot_across_prof_and_back.html"
fig.write_html(html_path)

We can observe a majority of males in every occupation category, as well as in the general (background) dataframe.

#### Summary : computation of the ratios

In [31]:
politicians_female_ratio = gender_dist_prof['politicians'][female_label]/gender_dist_prof['politicians'].sum()
artist_female_ratio = gender_dist_prof['artists'][female_label]/gender_dist_prof['artists'].sum()
scientists_female_ratio = gender_dist_prof['scientists'][female_label]/gender_dist_prof['scientists'].sum()
economists_female_ratio = gender_dist_prof['economists'][female_label]/gender_dist_prof['economists'].sum()
background_female_ratio = gender_dist_background[female_label]/gender_dist_background.sum()

print("The female ratio for the different occupation groups are:", "\n", 
      "{:.2f} in politicians,".format(politicians_female_ratio),
      "\n",  "{:.2f} in artist,".format(artist_female_ratio), "\n", 
      "{:.2f} in scientists,".format(scientists_female_ratio),
      "\n", "{:.2f} in economists,".format(economists_female_ratio), "\n",
      "and {:.2f} accross all different occupations.".format(background_female_ratio))

The female ratio for the different occupation groups are: 
 0.21 in politicians, 
 0.32 in artist, 
 0.23 in scientists, 
 0.18 in economists, 
 and 0.18 accross all different occupations.


changer le pourcentage
This represents an important imbalance in the dataset that we will take into account during our study and futur analysis. One could note that the highest imbalance (18% of women) is in the economist professions.

### 4.2 Analysis of the genders distribution per profession <a class = anchor id="4.2"></a>

Let's compute the uncertainty genders' distributions per professions (e.g., how many uncertain artists women are there compared to the total number of artists women). We start by creating a dictionnary containing the number of speaker from each profession per gender.

In [11]:
number_by_gender_all_years = {}
number_by_gender_all_years['politicians'] = df_per_year['df_politicians_2015'].groupby('gender').size()
number_by_gender_all_years['artists'] = df_per_year['df_artists_2015'].groupby('gender').size()
number_by_gender_all_years['scientists'] = df_per_year['df_scientists_2015'].groupby('gender').size()
number_by_gender_all_years['economists'] = df_per_year['df_economists_2015'].groupby('gender').size()

for i in range(16,21): 
    number_by_gender_all_years['politicians'] += df_per_year['df_politicians_20' + str(i)].groupby('gender').size()
    number_by_gender_all_years['artists'] += df_per_year['df_artists_20' + str(i)].groupby('gender').size()
    number_by_gender_all_years['scientists'] += df_per_year['df_scientists_20' + str(i)].groupby('gender').size()
    number_by_gender_all_years['economists'] += df_per_year['df_economists_20' + str(i)].groupby('gender').size()

We then create our dictionnary (`df_uncertain_professions_per_year`) containing our uncertain quotes per profession per year. This will also later be used in part [4.4 Possible variation from 2015 to 2020](#4.4).

In [10]:
df_uncertain_professions_per_year = {}
for i in range(15,21):
    df_uncertain_professions_per_year['df_uncertain_politicians_20' + str(i)] = create_df_from_txt(PATH_TXT + \
                            "uncertainty_politicians_20" + str(i) + ".txt", df_per_year['df_politicians_20' + str(i)])
    
    df_uncertain_professions_per_year['df_uncertain_artists_20' + str(i)] = create_df_from_txt(PATH_TXT + \
                            "uncertainty_artists_20" + str(i) + ".txt", df_per_year['df_artists_20' + str(i)])
    
    df_uncertain_professions_per_year['df_uncertain_scientists_20' + str(i)] = create_df_from_txt(PATH_TXT + \
                            "uncertainty_scientists_20" + str(i) + ".txt", df_per_year['df_scientists_20' + str(i)])
    
    df_uncertain_professions_per_year['df_uncertain_economists_20' + str(i)] = create_df_from_txt(PATH_TXT + \
                            "uncertainty_economists_20" + str(i) + ".txt", df_per_year['df_economists_20' + str(i)])

We now create a dictionnary containing the number of all uncertain quotes per profession (regrouping all years).

In [13]:
number_uncertain_by_gender_all_years = {}
number_uncertain_by_gender_all_years['politicians'] = \
                        df_uncertain_professions_per_year['df_uncertain_politicians_2015'].groupby('gender').size()
number_uncertain_by_gender_all_years['artists'] = \
                            df_uncertain_professions_per_year['df_uncertain_artists_2015'].groupby('gender').size()
number_uncertain_by_gender_all_years['scientists'] = \
                            df_uncertain_professions_per_year['df_uncertain_scientists_2015'].groupby('gender').size()
number_uncertain_by_gender_all_years['economists'] = \
                            df_uncertain_professions_per_year['df_uncertain_economists_2015'].groupby('gender').size()

for i in range(16,21):
    number_uncertain_by_gender_all_years['politicians'] += df_uncertain_professions_per_year \
                                                    ['df_uncertain_politicians_20' + str(i)].groupby('gender').size()
    number_uncertain_by_gender_all_years['artists'] += df_uncertain_professions_per_year \
                                                    ['df_uncertain_artists_20' + str(i)].groupby('gender').size()
    number_uncertain_by_gender_all_years['scientists'] += df_uncertain_professions_per_year \
                                                    ['df_uncertain_scientists_20' + str(i)].groupby('gender').size()
    number_uncertain_by_gender_all_years['economists'] += df_uncertain_professions_per_year \
                                                    ['df_uncertain_economists_20' + str(i)].groupby('gender').size()

We now compute the gender ratios (in %) of each profession.

In [14]:
gender_dist_ratio = {}
gender_dist_ratio['politicians'] = number_uncertain_by_gender_all_years['politicians']/ \
                                                                        number_by_gender_all_years['politicians']*100
gender_dist_ratio['artists'] = number_uncertain_by_gender_all_years['artists']/ \
                                                                        number_by_gender_all_years['artists']*100
gender_dist_ratio['scientists'] = number_uncertain_by_gender_all_years['scientists']/ \
                                                                        number_by_gender_all_years['scientists']*100
gender_dist_ratio['economists'] = number_uncertain_by_gender_all_years['economists']/ \
                                                                        number_by_gender_all_years['economists']*100

Lastly, we plot those genders' distributions per occupation.

In [15]:
df_ratio_prof = pd.DataFrame(gender_dist_ratio)
df_ratio_prof.index = ['Female', 'Male']
df_ratio_prof.round(decimals = 2)

Unnamed: 0,politicians,artists,scientists,economists
Female,28.5,29.96,30.5,30.09
Male,28.86,30.03,30.73,30.57


In [16]:
trace_pol = go.Bar(x = df_ratio_prof.index, y = df_ratio_prof['politicians'], name = "Politicians", 
                   marker_color = colors[3])
trace_art = go.Bar(x = df_ratio_prof.index, y = df_ratio_prof['artists'], name = "Artists", 
                   marker_color = colors[5])
trace_sci = go.Bar(x = df_ratio_prof.index, y = df_ratio_prof['scientists'], name = "Scientists", 
                   marker_color = colors[7])
trace_eco = go.Bar(x = df_ratio_prof.index, y = df_ratio_prof['economists'], name = "Economists", 
                   marker_color = colors[9])

my_layout = {
    'title': 'Relative % of uncertain speakers within a gender for each profession',
    'xaxis': {'title': 'Genders'},
    'yaxis': {'title': '% of uncertain speakers'},
    }

fig = go.Figure()
fig.add_trace(trace_pol)
fig.add_trace(trace_art)
fig.add_trace(trace_sci)
fig.add_trace(trace_eco)
fig.update_layout(my_layout, title_x=0.5)

fig.show()

The plot is then saved.

In [17]:
html_path = IMAGE_PATH + "percentage_uncertain_plot_across_prof.html"
fig.write_html(html_path)

### 4.3 Background influence <a class = anchor id="4.3"></a>

Let's have a look at our second question which investigates the roles culture/traditions, education and place of living play in determining those differences in speech between men and women.

First, we create the dictionnary of uncertain quotes per year.

In [13]:
df_uncertain_no_conditions_per_year = {}

for i in range(15,21): # A CHANGER EN (15, 21)
    df_uncertain_no_conditions_per_year['20' + str(i)] = create_df_from_txt(PATH_TXT + \
                                                                "uncertainty_df_no_conditions_20" + str(i) + ".txt",
                                                                df_no_conditions_per_year['20' + str(i)])

We also create a data frame containing all uncertain quotes.

In [14]:
df_uncertain_no_conditions_all_years = df_uncertain_no_conditions_per_year['2015']

for i in range(16,21):
    df_uncertain_no_conditions_all_years = pd.concat([df_uncertain_no_conditions_all_years,
         df_uncertain_no_conditions_per_year['20' + str(i)]])

We do the same thing for the certain quotes.

In [15]:
df_certain_no_conditions_per_year = {}

for i in range(15,21):
    df_certain_no_conditions_per_year['20' + str(i)] = df_no_conditions_per_year['20' + str(i)] \
    [~df_no_conditions_per_year['20'+ str(i)].quoteID \
                                             .isin(df_uncertain_no_conditions_per_year['20'+ str(i)].quoteID)].copy()

In [16]:
df_certain_no_conditions_all_years = df_certain_no_conditions_per_year['2015']

for i in range(16,21):
    df_certain_no_conditions_all_years = pd.concat([df_certain_no_conditions_all_years,
         df_certain_no_conditions_per_year['20' + str(i)]])

Let's first have a look at our features and which could relate to culture, education or country.

In [17]:
df_uncertain_no_conditions_all_years.columns

Index(['quoteID', 'quotation', 'speaker', 'qids', 'date', 'numOccurrences',
       'probas', 'urls', 'phase', 'aliases', 'date_of_birth', 'nationality',
       'gender', 'lastrevid', 'ethnic_group', 'US_congress_bio_ID',
       'occupation', 'party', 'academic_degree', 'id', 'label', 'candidacy',
       'type', 'religion', 'language'],
      dtype='object')

We see that `nationality` and `religion` could give cultural, traditional and surrounding environmental background information on the speaker, while `academic_degree` gives information on the educational aspect of the quote's author.

#### Nationality

We will first take a look to the `nationality` feature. To do so, we compute the difference of the percentages of uncertain females and uncertain males, by country.

In [18]:
influence = 'nationality'

In [20]:
init_notebook_mode(connected=True)

data_slider = []
years = ['2015', '2016', '2017', '2018', '2019', '2020']


scl = [[0.0, '#000000'],[0.2, colors[2]],[0.4, colors[4]], 
       [0.6, colors[6]],[0.8, colors[8]],[1.0, colors[9]]]

for year in years:
    # Selection of the year
    df_percentage = calculate_percentage(df_uncertain_no_conditions_per_year[year], df_certain_no_conditions_per_year[year],influence)
    for col in df_percentage.columns:  # transform into string
        df_percentage[col] = df_percentage[col].astype(str)
    ### create the dictionary with the data for the current year
    data_one_year = dict(type='choropleth',locations = df_percentage['nationality'],
                        z=df_percentage['Diff'].astype(float),
                        locationmode='country names',
                        colorscale = scl,
                        )
    data_slider.append(data_one_year)

##  Creation of step for slider
steps = []
for i in range(len(data_slider)):
    step = dict(method='restyle',
                args=['visible', [False] * len(data_slider)],
                label='Year {}'.format(2015 + i)) # label to be displayed for each step (year)
    step['args'][1][i] = True
    steps.append(step)
    
#  creation of the 'sliders' object from the 'steps' 
sliders = [dict(active=0, steps=steps)]  
# Set up the layout (including slider option)
layout = dict(geo=dict(scope='world',
                       projection={'type': 'equirectangular'},
                       showcountries = True),
                       title = 'Difference of uncertainty between Females and Males over years (2015 to 2020)',
                       title_x = 0.5,
              sliders=sliders)

# Creation of the figure object:
fig = dict(data=data_slider, layout=layout) 

# to plot in the notebook
plotly.offline.iplot(fig)

#To save the map
offline.plot(fig, auto_open=True, 
             image_width=4000, image_height=2000, 
             filename='Images/world_map.html', validate=True)

'Images/world_map.html'

#### Religion

We now look at the `Religion` feature.

In [21]:
influence = 'religion'

In [22]:
df = calculate_percentage(df_uncertain_no_conditions_all_years, df_certain_no_conditions_all_years, influence)

We now focus on 5 of the major religions, best representing the global population.

In [24]:
labels = ['Certain speakers','Uncertain speakers']

# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows = 2, cols = 5, column_titles = ['Christianity', 'Hinduism', 'Islam','Atheism', 'Judaism'],
                    row_titles = ['Female', 'Male'], specs = [[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}, 
                    {"type": "pie"}, {"type": "pie"}], [{"type": "pie"}, {"type": "pie"}, {"type": "pie"}, 
                    {"type": "pie"}, {"type": "pie"}]])

fig.update_annotations(font_size=12)

# Females
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Christianity']['Female'].values[0],
                                                df[df.religion == 'Christianity']['Female'].values[0]]), 1, 1)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Hinduism']['Female'].values[0],
                                                df[df.religion == 'Hinduism']['Female'].values[0]]), 1, 2)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Islam']['Female'].values[0],
                                                df[df.religion == 'Islam']['Female'].values[0]]), 1, 3)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'atheism']['Female'].values[0],
                                               df[df.religion == 'atheism']['Female'].values[0]]), 1, 4)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Judaism']['Female'].values[0],
                                               df[df.religion == 'Judaism']['Female'].values[0]]), 1, 5)

# Males
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Christianity']['Male'].values[0],
       df[df.religion == 'Christianity']['Male'].values[0]]), 2, 1)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Hinduism']['Male'].values[0],
       df[df.religion == 'Hinduism']['Male'].values[0]]),2, 2)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Islam']['Male'].values[0],
       df[df.religion == 'Islam']['Male'].values[0]]), 2, 3)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'atheism']['Male'].values[0],
       df[df.religion == 'atheism']['Male'].values[0]]),2, 4)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.religion == 'Judaism']['Male'].values[0],
       df[df.religion == 'Judaism']['Male'].values[0]]), 2, 5)

fig.update_traces(hole = .6, marker = dict(colors = [colors[2], colors[6]]))

fig.update_layout(title_text = "% of certain and uncertain speakers in the main 5 religions", title_x=0.5)

fig.show()

We save the plot for the data story.

In [25]:
"""html_path = IMAGE_PATH + "percentage_religions.html"
fig.write_html(html_path)""";

#### Academic Degree

Finally, we look at the `academic_degree` feature.

In [26]:
influence = 'academic_degree'

In [27]:
df = calculate_percentage(df_uncertain_no_conditions_all_years, df_certain_no_conditions_all_years, influence)

In [28]:
labels = ['Certain speakers','Uncertain speakers']

# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows = 2, cols = 5, column_titles = ["Bachelor's degree", 'Bachelor of Arts', 
                    'Doctor of Philosophy','Bachelor of Science', 'Doctorate'], row_titles = ['Female', 'Male'],
                     specs = [[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}, {"type": "pie"}, {"type": "pie"}], 
                             [{"type": "pie"}, {"type": "pie"}, {"type": "pie"}, {"type": "pie"}, {"type": "pie"}]])

fig.update_annotations(font_size=12)

# Females
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == "bachelor's degree"]['Female'].values[0],
                                    df[df.academic_degree == "bachelor's degree"]['Female'].values[0]]), 1, 1)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Bachelor of Arts']['Female'].values[0],
                                    df[df.academic_degree == 'Bachelor of Arts']['Female'].values[0]]), 1, 2)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Doctor of Philosophy']['Female']. \
                                    values[0], df[df.academic_degree == 'Doctor of Philosophy']['Female'].values[0]]),
                                    1, 3)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Bachelor of Science']['Female']. \
                                    values[0], df[df.academic_degree == 'Bachelor of Science']['Female'].values[0]]), 
                                    1, 4)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'doctorate']['Female'].values[0],
                                    df[df.academic_degree == 'doctorate']['Female'].values[0]]), 1, 5)

# Males
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == "bachelor's degree"]['Male'].values[0],
                                    df[df.academic_degree == "bachelor's degree"]['Male'].values[0]]), 2, 1)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Bachelor of Arts']['Male'].values[0],
                                    df[df.academic_degree == 'Bachelor of Arts']['Male'].values[0]]), 2, 2)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Doctor of Philosophy']['Male']. \
                                    values[0], df[df.academic_degree == 'Doctor of Philosophy']['Male'].values[0]]), 
                                    2, 3)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'Bachelor of Science']['Male'].values[0],
                                    df[df.academic_degree == 'Bachelor of Science']['Male'].values[0]]), 2, 4)
fig.add_trace(go.Pie(labels = labels, values = [1 - df[df.academic_degree == 'doctorate']['Male'].values[0],
                                    df[df.academic_degree == 'doctorate']['Male'].values[0]]), 2, 5)

fig.update_traces(hole = .5, marker = dict(colors = [colors[2], colors[6]]))

fig.update_layout \
        (title_text = "% of certain and uncertain speakers in the main 5 categories of academic degrees", title_x=0.5)

fig.show()

We save the figure.

In [29]:
"""html_path = IMAGE_PATH + "percentage_academic_degree.html"
fig.write_html(html_path)""";

### 4.4 Possible variation from 2015 to 2020 <a class = anchor id="4.4"></a>

Finally, we want to see if those distributions would vary between years. Let's compare the possible change between 2015 and 2020 concerning speech uncertainty gender difference. 

#### Accross occupations

Here, again, we still separate between the different profesionnal fields. We use the previously created dictionnary containing all uncertainty quotes from each profession per year (`df_uncertain_professions_per_year`) from part [4.2 Analysis of the genders distribution per professions](#4.2)

In [18]:
gender_dist_ratio_professions_per_year = {}
for i in range(15,21): # A CHANGER EN (15, 21)
    gender_dist_ratio_professions_per_year['politicians_20' + str(i)] = \
                                            create_gender_dist(df_per_year['df_politicians_20' + str(i)],
                                            df_uncertain_professions_per_year['df_uncertain_politicians_20' + str(i)])
    gender_dist_ratio_professions_per_year['artists_20' + str(i)] = \
                                            create_gender_dist(df_per_year['df_artists_20' + str(i)],
                                            df_uncertain_professions_per_year['df_uncertain_artists_20' + str(i)])
    gender_dist_ratio_professions_per_year['scientists_20' + str(i)] = \
                                            create_gender_dist(df_per_year['df_scientists_20' + str(i)],
                                            df_uncertain_professions_per_year['df_uncertain_scientists_20' + str(i)])
    gender_dist_ratio_professions_per_year['economists_20' + str(i)] = \
                                            create_gender_dist(df_per_year['df_economists_20' + str(i)],
                                            df_uncertain_professions_per_year['df_uncertain_economists_20' + str(i)])

We visualise our ratio in percentages in a table.

In [24]:
df_years = pd.DataFrame(gender_dist_ratio_professions_per_year)
df_years.index = ['Female', 'Male']

We plot the change in distributions of uncertain speakers from 2015 to 2020 in our four professional fields, after inserting our data in a differently designed dataframe.

In [21]:
df = pd.DataFrame(columns = ['female_politicians', 'female_artists', 'female_scientists', 'female_economists', 
                             'male_politicians', 'male_artists', 'male_scientists', 'male_economists'], 
                  index = ['2015', '2016', '2017', '2018', '2019', '2020'])

for i in range(15,21) : 
    df.at[df.index == '20'+ str(i), 'female_politicians'] = \
                                            df_years[df_years.index == 'Female']['politicians_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'female_artists'] = \
                                            df_years[df_years.index == 'Female']['artists_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'female_scientists'] = \
                                            df_years[df_years.index == 'Female']['scientists_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'female_economists'] = \
                                            df_years[df_years.index == 'Female']['economists_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'male_politicians'] = \
                                            df_years[df_years.index == 'Male']['politicians_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'male_artists'] = \
                                            df_years[df_years.index == 'Male']['artists_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'male_scientists'] = \
                                            df_years[df_years.index == 'Male']['scientists_20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'male_economists'] = \
                                            df_years[df_years.index == 'Male']['economists_20' + str(i)].values[0]

In [41]:
df.astype(float).round(2)

Unnamed: 0,female_politicians,female_artists,female_scientists,female_economists,male_politicians,male_artists,male_scientists,male_economists
2015,28.69,29.12,30.27,30.81,28.71,29.3,30.7,30.55
2016,28.04,29.02,29.92,31.03,28.54,29.34,30.21,31.5
2017,28.43,29.8,29.94,31.35,28.68,29.68,30.19,31.74
2018,28.68,30.33,30.75,30.99,28.99,30.27,30.69,31.33
2019,28.33,30.7,31.15,31.42,29.05,31.04,31.38,31.33
2020,29.09,30.61,31.3,29.9,29.42,31.39,32.17,30.38


In [51]:
# Females
trace_fem_pol = go.Scatter(x = df.index, y = df['female_politicians'], name = "Female politicians", 
                   marker_color = colors[3], line = dict(width = 3))
trace_fem_art = go.Scatter(x = df.index, y = df['female_artists'], name = "Female artists", 
                   marker_color = colors[5], line = dict(width = 3))
trace_fem_sci = go.Scatter(x = df.index, y = df['female_scientists'], name = "Female scientists", 
                   marker_color = colors[7], line = dict(width = 3))
trace_fem_eco = go.Scatter(x = df.index, y = df['female_economists'], name = "Female economists", 
                   marker_color = colors[9], line = dict(width = 3))

# Males
trace_mal_pol = go.Scatter(x = df.index, y = df['male_politicians'], name = "Male politicians", 
                   marker_color = colors[3], line = dict(dash='dash', width = 3))
trace_mal_art = go.Scatter(x = df.index, y = df['male_artists'], name = "Male artists", 
                   marker_color = colors[5], line = dict(dash='dash', width = 3))
trace_mal_sci = go.Scatter(x = df.index, y = df['male_scientists'], name = "Male scientists", 
                   marker_color = colors[7], line = dict(dash='dash', width = 3))
trace_mal_eco = go.Scatter(x = df.index, y = df['male_economists'], name = "Male economists", 
                   marker_color = colors[9], line = dict(dash='dash', width = 3))

my_layout = {
    'title': 'Relative % of uncertain speakers within a gender for each profession <br> from 2015 to 2020',
    'xaxis': {'title': 'Years'},
    'yaxis': {'title': '% of uncertain speakers'},
    }

fig = go.Figure()
fig.add_trace(trace_fem_pol)
fig.add_trace(trace_fem_art)
fig.add_trace(trace_fem_sci)
fig.add_trace(trace_fem_eco)
fig.add_trace(trace_mal_pol)
fig.add_trace(trace_mal_art)
fig.add_trace(trace_mal_sci)
fig.add_trace(trace_mal_eco)                        
fig.update_layout(my_layout, title_x=0.5)

fig.show()

We then save the figure.

In [52]:
"""html_path = IMAGE_PATH + "percentage_uncertain_plot_across_prof_across_time.html"
fig.write_html(html_path)"""

Again, we see that that the genders' distributions are slightly different depending on the profession.

#### Accross the general dataframe (containing background information)

We compute our Female/Male ratios. We use the `df_uncertain_no_conditions_per_year` already defined in [4.3 Background influence](#4.3).

In [30]:
gender_dist_ratio_no_conditions_per_year = {}

for i in range(15,21):
    gender_dist_ratio_no_conditions_per_year['20' + str(i)] = create_gender_dist(df_no_conditions_per_year['20' + \
                                                          str(i)], df_uncertain_no_conditions_per_year['20' + str(i)])

The uncertainty percentages per gender per year are displayed in the following table.

In [31]:
df_ratio_back_years = pd.DataFrame()

for i in range(15,21):
    df_ratio_back_years['20' + str(i)] = pd.DataFrame(gender_dist_ratio_no_conditions_per_year['20' + str(i)])

df_ratio_back_years.index = ['Female', 'Male']
df_ratio_back_years.round(decimals = 2)

Unnamed: 0,2015,2016,2017,2018,2019,2020
Female,28.87,28.26,28.96,29.49,29.64,29.78
Male,28.77,28.68,29.06,29.39,29.94,30.31


We again create a dataframe containing the same data as in `df_ratio_back_years` but in a different design.

In [32]:
df = pd.DataFrame(columns = ['female', 'male'], 
                  index = ['2015', '2016', '2017', '2018', '2019', '2020'])

for i in range(15,21) : 
    df.at[df.index == '20'+ str(i), 'female'] = df_ratio_back_years[df_ratio_back_years.index == 'Female'] \
                                                                                            ['20' + str(i)].values[0]
    df.at[df.index == '20' + str(i), 'male'] = df_ratio_back_years[df_ratio_back_years.index == 'Male'] \
                                                                                            ['20' + str(i)].values[0]
    

Finally, we visualise our distributions from 2015 to 2020.

In [33]:
trace_back_female = go.Scatter(x = df.index, y = df['female'], marker_color = colors[3], name = "Females", line = dict(width = 3))
trace_back_male = go.Scatter(x = df.index, y = df['male'], marker_color = colors[3], line = dict(dash = 'dash', width = 3),
                             name = "Males")


my_layout = {
    'title': 'Relative % of uncertain speakers from 2015 to 2020 in our general dataframe',
    'xaxis': {'title': 'Years'},
    'yaxis': {'title': '% of uncertain speakers'},
    }

fig = go.Figure()
fig.add_trace(trace_back_female)
fig.add_trace(trace_back_male)
fig.update_layout(my_layout, title_x = 0.5)

fig.show()

The figure is saved.

In [34]:
"""html_path = IMAGE_PATH + "percentage_uncertain_plot_across_back_across_time.html"
fig.write_html(html_path)""";

## 5. Statistical analysis <a class = anchor id="statanalysis"></a>

We now need to analyse the statistical relevance of our findings.

We can state the following statements regarding our results:
- The observation we have were selected randomly amongst populations of our choice.
- The proportions of male and female is unbalanced (less females) but the ratio is averagely constant amongst the different backgrounds and profession, thus we assume parameters independence and no multi-collinearity.
- We also assume that there isn’t autocorrelation between certainty in year 2015 and certainty in the following years.

As a result, we decide to use OLS (Ordinary Least Squares) regression as it provides simple relationship modeling between the dependent and independent variables and they are easily interpretable. Indeed, we will now perform linear regression using the certainty label we obtained from the Pajean uncertainty classifier to identify important features for certainty prediciton.

It is important to keep in mind that the classifier has a 62.8 F-score thus some of the results we obtain could be due to chance.

We will consider features with p-values over 0.05 not statisticaly significant.

### 5.1 Analysis of the genders distribution <a class = anchor id="5.1"></a>

First, we want to build an uncertainty labeled dataset to perform a linear regression on, to identify important features and possible correlation between them.

In [35]:
df_certain_no_conditions_all_years = df_no_conditions_all_years[~df_no_conditions_all_years.quoteID.isin(
                                                                df_uncertain_no_conditions_all_years.quoteID)].copy()

We label certain and uncertain quotes.

In [36]:
df_certain_no_conditions_all_years['uncertainty_label'] = 1
df_uncertain_no_conditions_all_years['uncertainty_label'] = 0
my_df = pd.concat([df_uncertain_no_conditions_all_years, df_certain_no_conditions_all_years], ignore_index = True)

We label the genders.

In [37]:
my_df['gender'] = qid_label.loc[my_df['gender']].Label.values

We make a first linear regression analysis only using gender as variable.

In [38]:
res = linear_reg(data = my_df, formula = 'uncertainty_label ~  C(gender, Treatment(reference="male"))')
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     16.54
Date:                Fri, 17 Dec 2021   Prob (F-statistic):           4.76e-05
Time:                        19:24:04   Log-Likelihood:            -4.2420e+06
No. Observations:             6691519   AIC:                         8.484e+06
Df Residuals:                 6691517   BIC:                         8.484e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                                                       coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------

On this first analysis using all of our datasets with no conditions, we can note that, in our model, the probability that a quote belonging to a male is 0.7045 to be certain.
The speaker being a female is correlated to a +0.19% probability of having a certain quote (the probability that a quote belonging to a female is 0.7064 to be certain).
We have a p-value inferior to 0.001 thus this result is statisticaly significant. Furthermore, the 95% CI (confidence interval) is fully positive between 0.1 and 0.3%.
It is important to note that our model isn't able to properly predict the uncertainty using only the gender feature as we obtain a R_squared close to 0.

### 5.2 Analysis of the gender distribution per professions <a class = anchor id="5.2"></a>

We build `df_all_professions_all_years`, a dataset that contains all quotes from professions [artist, economists, artists, politicians] for all years. To do this, we use samples of our datasets to avoid memory errors. The samples are created below.

# fin Test

In [23]:
small_df_per_year = {}
for i in range(15, 21): #(15, 21)
    print(i)
    small_df_per_year['df_politicians_20' + str(i)] = pd.read_pickle(DATA_PATH + "small_politicians_20" + str(i) + \
                                                                     ".pck")
    small_df_per_year['df_artists_20' + str(i)] = pd.read_pickle(DATA_PATH + "small_artists_english_20" + str(i) + \
                                                                     ".pck")
    small_df_per_year['df_scientists_20' + str(i)] = pd.read_pickle(DATA_PATH + "small_scientists_20" + str(i) + \
                                                                     ".pck")
    small_df_per_year['df_economists_20' + str(i)] = pd.read_pickle(DATA_PATH + "small_economists_20" + str(i) + \
                                                                     ".pck")

15
16
17
18
19
20


Then, we create data frames per professions.

In [24]:
df_professions_all_years = {} #Changer 2020 en 2015
df_professions_all_years['politicians'] = small_df_per_year['df_politicians_2015']
df_professions_all_years['artists'] = small_df_per_year['df_artists_2015']
df_professions_all_years['scientists'] = small_df_per_year['df_scientists_2015']
df_professions_all_years['economists'] = small_df_per_year['df_economists_2015']

# We concatenate over different years
for i in range(16,21): #changer en 16 21
    print(i)
    df_professions_all_years['politicians'] = pd.concat([df_professions_all_years['politicians'], 
                                                              small_df_per_year['df_politicians_20' + str(i)]])
    
    df_professions_all_years['artists'] = pd.concat([df_professions_all_years['artists'], 
                                                              small_df_per_year['df_artists_20' + str(i)]])
    
    df_professions_all_years['scientists'] = pd.concat([df_professions_all_years['scientists'], 
                                                              small_df_per_year['df_scientists_20' + str(i)]])
    
    df_professions_all_years['economists'] = pd.concat([df_professions_all_years['economists'], 
                                                              small_df_per_year['df_economists_20' + str(i)]])

16
17
18
19
20


Those data frames are then concatenated to create `df_all_professions_all_years`.

In [25]:
occupations = ['artists', 'economists', 'politicians', 'scientists']

df_all_professions_all_years = pd.DataFrame()

# We concatenate over different professions
for occupation in occupations:
    df_professions_all_years[occupation]['occupation_label'] = occupation
    
df_all_professions_all_years = pd.concat([df_professions_all_years['artists'], 
                                          df_professions_all_years['economists']])
df_all_professions_all_years = pd.concat([df_all_professions_all_years, df_professions_all_years['politicians']])
df_all_professions_all_years = pd.concat([df_all_professions_all_years, df_professions_all_years['scientists']])


We label certain and uncertain quotes.

In [26]:
df_certain_all_professions_all_years = df_all_professions_all_years[~df_all_professions_all_years.quoteID.isin(
                                                                df_uncertain_no_conditions_all_years.quoteID)].copy()
df_uncertain_all_professions_all_years = df_all_professions_all_years[df_all_professions_all_years.quoteID.isin(
                                                                df_uncertain_no_conditions_all_years.quoteID)].copy()
df_certain_all_professions_all_years['uncertainty_label'] = 1
df_uncertain_all_professions_all_years['uncertainty_label'] = 0
my_professions_df = pd.concat([df_uncertain_all_professions_all_years, df_certain_all_professions_all_years], 
                              ignore_index = True)

As the occupations are categorical, we divide them in multiple features with one hot encoding. Each profession is associated to a column attribute. For example, an artist quote looks as follow:

In [27]:
an_artist_quote_data = {"Quote": ["An artist quote"], "Artist":[1], "Politician":[0], "Economist":[0], "Scientist":[0]} 
an_artist_quote = pd.DataFrame(an_artist_quote_data)
an_artist_quote

Unnamed: 0,Quote,Artist,Politician,Economist,Scientist
0,An artist quote,1,0,0,0


We now perform this on all our data.

In [28]:
X = pd.get_dummies(my_professions_df['occupation_label'])
my_professions_df = my_professions_df.join(X)

In [29]:
my_professions_df['gender'] = qid_label.loc[my_professions_df['gender']].Label.values

Now that we have our labeled dataset to perform linear regression, let's start!

We plot the linear regression of the uncertainty label using variables *scientist*, *gender* and the combination of both.
For *gender* we set the reference as male (for *scientist*, it is set to 0).
We perform this analysis independently on each profession.

In [30]:
res_politicians_gender = linear_reg(data = my_professions_df, formula = 'uncertainty_label ~ \
    politicians * C(gender, Treatment(reference="male"))')

print(res_politicians_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     57.13
Date:                Thu, 16 Dec 2021   Prob (F-statistic):           6.61e-37
Time:                        15:51:06   Log-Likelihood:                 8622.8
No. Observations:              240000   AIC:                        -1.724e+04
Df Residuals:                  239996   BIC:                        -1.720e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                                                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------

In our model, we can see that a reference quote (male and not a politician) is associated with a probability of 0.9469 that the quote is certain.
We can immediatly note that the base rate of certainty is 20% higher on this "artists, politicians, economists and scientists selected dataset".

Compared to that reference, a female also not being a politician is correlated to a -0.61% probability to have issued a certain quote.
Similarly, a male who is a politician is correlated to a -1.58% probabiblity to have written a certain quote.

A female politician would be correlated to a +1.16% chance to have pronounced a certain quote compared to the reference (a male not politician). 




Now we will use all occupation features to compare their relative correlations.

In [31]:
res_all_gender = linear_reg(data = my_professions_df, formula = 'uncertainty_label ~ \
    scientists * C(gender, Treatment(reference="male"))+\
        politicians * C(gender, Treatment(reference="male"))+artists * C(gender, Treatment(reference="male"))')

print(res_all_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     163.9
Date:                Thu, 16 Dec 2021   Prob (F-statistic):          7.69e-243
Time:                        15:51:07   Log-Likelihood:                 9109.3
No. Observations:              240000   AIC:                        -1.820e+04
Df Residuals:                  239992   BIC:                        -1.812e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                                                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------

Note:

   - Model is getting better but has no real predictive value: R_squared = 0.005
   - Being a scientist is correlated with the highest certainty probability boost: +3.02%
   - Being a female here is correlated to a the highest certainty probability drop: -1.32% and it is unrelated to the female profession.
    

### 5.3 Analysis of background influence regarding gender distribution <a class = anchor id="5.3"></a>

We build a different dataset to regroup the quotes per `nationality`, `religion` and `academic_degree`.
The intersection of those datasets would be too small to be properly analysed.

In [32]:
df_nat, df_rel, df_eth, df_aca = pd.DataFrame(), pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

We use a dictionary structure to store our datasets and we regroup for each category the most represented features.

In [33]:
dict_df = {
  "nationality": df_nat,
  "religion": df_rel,
  "ethnic_group": df_eth,
  "academic_degree": df_aca
}

For each influence, a selected list of possibilities are chosen for further analysis.

In [37]:
for influence in influences:
    print(influence)
    dict_df[influence] = my_df.copy(deep = True)
    dict_df[influence] = dict_df[influence][~dict_df[influence][influence].isin([None, "None"])]
    dict_df[influence][influence] = dict_df[influence][influence].apply(lambda x: x[0])
    dict_df[influence][influence] = qid_label.loc[dict_df[influence][influence]].Label.values
    if influence == 'nationality':
        dict_df[influence] = dict_df[influence][dict_df[influence]['nationality'].isin([
            'United States of America', 'United Kingdom', 'India', 'Canada', 'Australia'])]
        X = pd.get_dummies(dict_df[influence]['nationality'])
        dict_df[influence] = dict_df[influence].join(X)
        dict_df[influence].rename(columns = {'United States of America': 'USA', 'United Kingdom': 'UK'}, 
                                  inplace = True)
    if influence == 'religion':
        dict_df[influence] = dict_df[influence][dict_df[influence]['religion'].isin([
            'Catholicism', 'Judaism', 'Hinduism', 'Islam', 'atheism', 'Anglicansim', 'agnosticism'])]
        X = pd.get_dummies(dict_df[influence]['religion'])
        dict_df[influence] = dict_df[influence].join(X)
    if influence == 'academic_degree':
        dict_df[influence] = dict_df[influence][dict_df[influence]['academic_degree'].isin([
            'Bachelor of Arts', 'Bachelor of Science', 'Juris Doctor', 'Doctor of Philosophy', "bachelor's degree"])]
        X = pd.get_dummies(dict_df[influence]['academic_degree'])
        dict_df[influence] = dict_df[influence].join(X)
        dict_df[influence].rename(columns = {'Bachelor of Arts': 'Bachelor_of_Arts', 
                                             'Bachelor of Science': 'Bachelor_of_Science',
                                             'Juris Doctor': 'Juris_Doctor', 
                                             'Doctor of Philosophy': 'Doctor_of_Philosophy',
                                             "bachelor's degree": "bachelor_degree"}, 
                                  inplace = True)                    

nationality
religion
ethnic_group
academic_degree


Let's have a brief overview of the mean uncertainty and std deviation:

In [35]:
print('United States of America: mean: ', dict_df['nationality'][dict_df['nationality']['USA'] == 1]\
            ['uncertainty_label'].mean(), "std: ", dict_df['nationality']['USA'].std(),
      '\nUnited Kingdom: mean:', dict_df['nationality'][dict_df['nationality']['UK'] == 1]\
            ['uncertainty_label'].mean(),"std: ", dict_df['nationality']['UK'].std(),
      '\nIndia: mean: ', dict_df['nationality'][dict_df['nationality']['India'] == 1]\
            ['uncertainty_label'].mean(), "std: ", dict_df['nationality']['India'].std(),
      '\nCanada: mean: ', dict_df['nationality'][dict_df['nationality']['Canada'] == 1]\
            ['uncertainty_label'].mean(),"std: ", dict_df['nationality']['Canada'].std())

United States of America: mean:  0.712707136676242 std:  0.4866571075234479 
United Kingdom: mean: 0.7069002279847135 std:  0.38707031947458437 
India: mean:  0.6160903141767206 std:  0.2386860366518763 
Canada: mean:  0.7036667910248894 std:  0.2561332709851297


Let's look at some possible analyses!

In [46]:
res_nationality_gender = linear_reg(data = dict_df['nationality'], formula = 'uncertainty_label ~ \
    USA * C(gender, Treatment(reference="male")) + UK * C(gender, Treatment(reference="male")) +\
        Canada * C(gender, Treatment(reference="male")) + India * C(gender, Treatment(reference="male"))')

print(res_nationality_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     1489.
Date:                Thu, 16 Dec 2021   Prob (F-statistic):               0.00
Time:                        16:02:53   Log-Likelihood:            -3.3650e+06
No. Observations:             5316609   AIC:                         6.730e+06
Df Residuals:                 5316599   BIC:                         6.730e+06
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                                                              coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------

Note:

    - India is correlated to a very important certainty probability drop of -8.51%.

We now perform a linear regression on religion features related to gender.

In [None]:
res_religion_gender = linear_reg(data = dict_df['religion'], formula = 'uncertainty_label ~ \
    Catholicism * C(gender, Treatment(reference="male")) + Judaism * C(gender, Treatment(reference="male")) + \
        Hinduism * C(gender, Treatment(reference="male")) + Islam * C(gender, Treatment(reference="male")) + \
            atheism * C(gender, Treatment(reference="male"))')

print(res_religion_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     235.4
Date:                Wed, 15 Dec 2021   Prob (F-statistic):               0.00
Time:                        17:07:36   Log-Likelihood:            -3.0636e+05
No. Observations:              480782   AIC:                         6.127e+05
Df Residuals:                  480770   BIC:                         6.129e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                                                                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------

Note:

    - Lowest correlated certainty: Hinduism => -7.51% and Female Islam => -2.34%
    - Highest: Caholicism and Judahism

Next, we do a linear regression on academic degree features related to gender:

In [None]:
res_degree_gender = linear_reg(data = dict_df['academic_degree'], formula = 'uncertainty_label ~ \
    bachelor_degree * C(gender, Treatment(reference="male")) + Bachelor_of_Arts * C(gender, \
    Treatment(reference="male")) + Bachelor_of_Science * C(gender, Treatment(reference="male")) + \
    Doctor_of_Philosophy * C(gender, Treatment(reference="male"))')

print(res_degree_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     55.72
Date:                Wed, 15 Dec 2021   Prob (F-statistic):          3.87e-102
Time:                        17:14:46   Log-Likelihood:            -1.1760e+05
No. Observations:              195519   AIC:                         2.352e+05
Df Residuals:                  195509   BIC:                         2.353e+05
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                                                                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------

Notes:

    - Lowest: Doctor of Philosophy
    - Highest: Female Doctor of Philosophy and Female Bachelor degree

### 5.4 Possible variation from 2015 to 2020 <a class = anchor id="5.4"></a>

Finally, we analyse a possible variation from 2015 to 2020. To start, we build a year labeled dataset.

In [None]:
my_df['year_label'] = pd.DatetimeIndex(my_df['date']).year - 2015

my_df['year_label'].unique()

array([0, 1, 2, 3, 4, 5], dtype=int64)

A value of 0 corresponds to the year 2015 (first year in our dataset) and other values correspond to the number of years elapsed since 2015.

Ex: 2016 is encoded as 1.

We now perform a linear regression on years and gender.

In [48]:
res_year_gender = linear_reg(data = my_df, formula = 'uncertainty_label ~ \
    year_label * C(gender, Treatment(reference="male"))')

print(res_year_gender.summary())

                            OLS Regression Results                            
Dep. Variable:      uncertainty_label   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     369.0
Date:                Wed, 15 Dec 2021   Prob (F-statistic):          1.10e-239
Time:                        15:08:59   Log-Likelihood:            -4.2415e+06
No. Observations:             6691519   AIC:                         8.483e+06
Df Residuals:                 6691515   BIC:                         8.483e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                                                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------

Notes: 

    - It is interesting to observe that each year elapsed since 2015 correlates with a drop of the certainty probability of the quotes of -0.35%.
    - On the other hand, we can see that quotes written from a woman are becoming slightly more certain by 0.1% each year.

This sugests that quotes are generally becoming more uncertain year after year, whereas female quotes are getting slighlty more certain.

### Statistical analysis summary:

We performed linear regression unsing the certainty label we obtained from the Pajean uncertainty classifier to identify important features for certainty prediciton. It is important to keep in mind that the classifier has a 62.8 F-score thus some of the results we obtain could be due to chance.
We considered features with p-value under 0.05 not statisticaly significant.

Considering all quotes, our first findings were that the average certainty for a male's quote was 0.70 and a female's quote is correlated with an increase in certainty probability of 0.19%. This could be slightly unintuitive at first.

Then, we selected a smaller portions of the quotes selecting only quotes from artists, economists, politicians and scientists.
Our first surprise was that amongst this population, the base probability for having a certain quote was of 93%, 20% more than quotes from any individual.
In this subset, we could observe that the most certain were estimated to be the scientists with an increase of 3% in quote certainty estimation.
We also found that the feature which was lowering the most the certainty probability in our model was being a female, with a -1.3% certainty probability drop without relation with the female's profession.

Going back to our full dataset, we analysed the backgrounds of the speakers.
We found that Indians' quotes are correlated with an important -8.5% certainty probability drop. Similarly hinduists' quotes are also associated with a -7.5% certainty fall. Doctors in philosphy are also matched with a low certainty, -7.5%. 

Cases of background and gender interactions were rarely significants. It could be due to the large inegality proportions of male and females in the different background categories. Our most representative case of interaction was with a female with a bachelor of science. Having a bachelor of science in our model is correlated with a -1.7% and a female with a bachelor of science is correlated with a -4.5% of having a certain quote. 

Finally we found that quotes were seemingly getting more uncertain, -0.35% a year, when women's quotes are slightly becoming more certain, +0.1% per year.

## 6. Conclusion <a class = anchor id="conclusion"></a>

(OLD TEXT)

Through this notebook, we aimed to analyse the speech difference between women and men using the Quotebank dataset. We started from the hypothesis that women speak less confidently than men and in a more uncertain way. To verify this claim, we conducted an analysis with the help of a classifier which distinguishes uncertain quotations from certain quotation. We also used Wikidata as a supplement input data to study more closely the quotation speakers.   
We performed various data frame separation with respect to the `occupation`, `religion`, `nationality` and `education`, to be able to measure the impact of each influence and to remove out the bias. For our initial question, it seems that there is no significant difference between men and women when compared in the same field of work. However, there seems that women in some culture or education level do show speech uncertainty more than men. However, it is important to follow these intial suggestions by a robust data statistical analysis with hypothesis testing.   
As a continuation of our milestone 2, it would be interesting to dive deeper on the statistical analysis of our findings as well as generalize to all of our quotebank dataset from 2015 up to 2020.

+ conclusions M3?

+  further extensions on the project.?