# ADA - Project Milestone 2: 
# *Analysis of speech behaviours between genders*

## Context

In this project, we are going to analyze data from Quotebank. Quotebank, as the name suggests, is an open corpus which gathers 178 million quotations from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020.

We are interested in using this dataset to answer the following question: Do speech behaviours related to confidence and uncertainty vary between men and women?

To answer this question, we'll go through the following points:

1. To what extent can we observe the differences in communicative acts in relation to gender within a professional area? Are there noticeable differences between those professional areas?
3. What are the roles of environment (nationality), culture/tradition (religion, ethnic groups), and education (whether the speaker obtained an academic degree) in determining those differences in speech between men and women? How are the lines drawn between the language we use and the environment around us?
4. Has there been a possible change over time (from 2015 to 2020)?

In order to have access to the speakers' information (e.g., speakers' genders), we use the open source data from wikidata (https://www.wikidata.org/wiki/Wikidata:Main_Page).

To analyse speech uncertainty, we use an uncertainty detection classifier, adapted from the following paper "P. A. Jean, S. Harispe, S. Ranwez, P. Bellot, and J. Montmain, “[Uncertainty detection in natural language: A probabilistic model](https://www.researchgate.net/publication/303842922)” ACM Int. Conf. Proceeding Ser., vol. 13-15-June, no. June, 2016, doi: 10.1145/2912845.2912873".

## Table of contents

[1. Pre-processing of the data](#pre-processing) 
- [Imports](#1imports)
- [Pathways](#1pathways)
- [Functions](#1functions)
- [Merging files from wikidata into one file containing the 9 million speakers](#1merging)
- [1.1 Loading and pre-processing of Quotebank data](#1.1)
- [1.2 Analysis of the quotes from Quotebank](#1.2)
- [1.3 Loading wikidata labels](#1.3) 
- [1.4 Pre-processing of wikidata](#1.4)
- [1.5 Exploratory Data Analysis of wikidata](#1.5)

[2. Creation of our sub data frames](#dataframes)
- [Functions](#2functions)
- [2.1 Creation of professional fields](#2.1)
- [2.2 Creation of sub-dataframes](#2.2) 
- [2.3 Saving of all sub data frames](#2.3)

[3. Classification of the quotes](#classifier)
- [Pathways](#3pathways)
- [Functions](#3functions)
- [3.1 Reading of all sub data frames](#3.1)
- [3.2 Creation of the text files](#3.2)
- [3.3 Use of the uncertainty detection classifier](#3.3) 

[4. Statistical-analysis](#stat_analysis)
- [Functions](#4functions)
- [4.1 Creation of the uncertain and certain dataframes](#4.1)
- [4.2 Gender distribution accross occupations](#4.2)
- [4.3 Analysis of the gender distribution per professions](#4.3) 
- [4.4 Background influence](#4.4)
- [4.5 Possible variation from 2015 to 2020](#4.5)

[5. First interpretation of results and upcoming steps](#interpretation)

## 1. Pre-processing of the data <a class = anchor id="pre-processing"></a>

### Imports <a class = anchor id="1imports"></a>

Let's start by importing our libraries. 

In [1]:
import pandas as pd
import numpy as np
from langdetect import detect

In [2]:
# DATA_PATH contains all data frames
DATA_PATH = 'Data/'
# PATH_PARQUET contains all the data from wikidata
PATH_PARQUET = 'Data_parquet/'

In [3]:
def saving_wikidata(path):
    """
    Transforms all the wikidata files in one dataset saved as pickle. 
    This allows to load the file and use it more quickly and easily.
    Inputs:
        * path : pathway where to save the pickle file
    """
    wikidata_all = pd.DataFrame()
    for i in range(1,16):
        if i < 10:
            DATA_FILE = 'part-0000{}-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet'.format(i)
        else:
            DATA_FILE = 'part-000{}-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet'.format(i)
        wikidata = pd.read_parquet(PATH_PARQUET + DATA_FILE)
        wikidata_all = pd.concat([wikidata_all, wikidata])
    wikidata_all.to_pickle(path)

In [4]:
qid_label = pd.read_csv(DATA_PATH+'wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

In [11]:
def add_columns(column, target, init_df, name_column):
    """
    Checks if a target ('politician', 'male', 'female' etc...) is in a certain column.
    If it is, we return True in an additional column (name_column).
    Inputs:
        * column : name of column to search for target
        * target : item of interest 
        * init_df : initial data frame
        * name_column : name of new column of booleans
    Outputs:
        * final_df : dataframe with new column and only rows which contains the item
    """
    final_df = init_df.copy(deep = True)
    final_df[name_column] = final_df[column].apply(lambda x: np.any(x) in target)
    return final_df


def extracting_sub_df(quotebank, wikidata, column):
    """
    Creates a sub dataframe with information from quotebank and wikidata.
    We only take the rows in column which are True.
    Inputs:
        * quotebank : data frame extracted from quotebank
        * wikidata : data frame extracted from wikidata
        * column : column on which we base the merge
    Outputs :
        * sub_df : merged dataframe 
    """
    merged_df = pd.merge(quotebank, wikidata, left_on = 'speaker', right_on = 'label')
    merged_df['qids'] = merged_df['qids'].apply(lambda x : x[0])
    merged_df = merged_df[merged_df['qids'] == merged_df['id']]
    sub_df = merged_df[merged_df[column] == True]
    return sub_df


def clean_quotebank(df):
    """ 
    Cleans quotebank dataset by droping quotes from unknown speakers and
    quotes where the speaker is uncertain (p<0.5)
    Inputs:
        * df : quotebank data frame to clean  
    Outputs:
        * df_copy : cleaned data frame
    """ 
    df_copy = df.copy(deep = True)
    df_copy = df_copy[~df_copy.speaker.isin(['None', None])]
    df_copy =  df_copy[df_copy['probas'].apply(lambda x: x[0][1]).values.astype(float) > 0.5]
    return df_copy 


def create_df_with_conditions(column, condition, wikidata, column_temp, start, stop):
    """
    From quotebank data, extracts and returns a data frame with only the rows that respect the condition.
    Inputs:
        * column : name of column where the condition is applied
        * condition : condition of interest 
        * wikidata : wiki database
        * column_temp : name of new column of booleans
    Outputs:
        * sub_df : data frame with only rows which contains the item
    """
    wiki_plus = add_columns(column, condition, wikidata, column_temp)
    sub_df = pd.DataFrame()
    # Milestone 2: (20,21) and (15,16) = only for 2020 quotes; Milestone 3: (15,21) = quotes from 2015 to 2020
    for i in range(start,stop):
        DATA_FILE = 'quotes-20{}.json.bz2'.format(i)
        with pd.read_json(DATA_PATH + DATA_FILE, lines = True, compression ='bz2', chunksize = 100000) as df_reader:
            for chunk in df_reader:
                sub_df = pd.concat([sub_df, extracting_sub_df(clean_quotebank(chunk), wiki_plus, column_temp)])
    sub_df = sub_df.drop(column_temp, axis = 1)
    return sub_df


def merging(quotebank, wikidata):
    """
    Creates a sub dataframe with information from quotebank and wikidata.
    Inputs:
        * quotebank : data frame extracted from quotebank
        * wikidata : data frame extracted from wikidata
    Outputs :
        * sub_df : merged data frame 
    """
    merged_df = pd.merge(quotebank, wikidata, left_on = 'speaker', right_on = 'label')
    merged_df['qids'] = merged_df['qids'].apply(lambda x : x[0])
    merged_df = merged_df[merged_df['qids'] == merged_df['id']]
    return merged_df


def create_df(wikidata, start, stop):
    """
    From quotebank data, extracts and returns a data frame with only the rows where all informations are present.
    Inputs:
        * wikidata : wiki database
        * start : years of beginning, for example 15 for 2015. Have to be between 15 and 21
        * stop : years of end + 1, for example 16 if you only want 2015. Have to be between 15 and 21
    Outputs:
        * sub_df : data frame only the rows that have all the information of interest (without 'None')
    """
    sub_df = pd.DataFrame()
    # Milestone 2: (20,21) = only for 2020 quotes; Milestone 3: (15,21) = quotes from 2015 to 2020
    for i in range(start, stop):
        DATA_FILE = 'quotes-20{}.json.bz2'.format(i)
        with pd.read_json(DATA_PATH + DATA_FILE, lines = True, compression ='bz2', chunksize = 100000) as df_reader:
            for chunk in df_reader:
                sub_df = pd.concat([sub_df, merging(clean_quotebank(chunk), wikidata)])
    return sub_df

In [12]:
art_professions = qid_label[qid_label['Label'].str.contains('cineast|painter|musician|sculpter|architect|dancer| \
        philosoph|writer|actor|actress|choreographer|music interpreter|singer|photographer|entertainer', na=False)]

scientific_professions = qid_label[qid_label['Label'].str.contains('scientific|researcher|mathematician|doctor| \
        astronomist|biologist|chemist|physicist|physician|psychologist|engineer|anatomist|neurologist| \
        pediatrician|veterinarian|pharmacist|obstetrician|gynecologist|therapist|dentist|surgeon|nurse| \
        psychiatrist|Scientific', na=False)]

economic_professions = qid_label[qid_label['Label'].str.contains('economist|banker|financ|chairman|auditor|insurer| \
        CEO|chief executive officer|CTO|chief technology officer|CIO|chief investment officer|business manager| \
        stockbroker|retail merchandizer|pricing analyst|statistician|marketing consultant|sales consultant|actuary| \
        tax consultant| salesperson|risk analyst|data analyst|accountant|economic researcher|Investm|investor|', \
         na=False)]

political_professions = qid_label[qid_label['Label'].str.contains('politician|president|minister|government \
        accountant General', na=False)]

In [13]:
""" Only intented to be run once"""
wiki_occupation = pd.read_pickle(DATA_PATH + "wiki_occupation.pck")
wiki_background = pd.read_pickle(DATA_PATH + "wiki_background.pck")

Second, we create data frames for the year 2015. Again, as said above, the cell is already run.

In [31]:
#Changer la data (deux arguments fonctions)
""" Only intented to be run once """
df_politicians_2015 = create_df_with_conditions('occupation', political_professions.index, wiki_occupation, 'ispolitician', 15, 16)
df_artists_2015 = create_df_with_conditions('occupation', art_professions.index, wiki_occupation, 'isartist', 15, 16)
df_scientists_2015 = create_df_with_conditions('occupation', scientific_professions.index, wiki_occupation, 'isscientist', 15, 16)
df_economists_2015 = create_df_with_conditions('occupation', economic_professions.index, wiki_occupation, 'iseconomist', 15, 16)

#### Creation of a general background sub-dataframe (2020)

We now create our data frame from the more precise wikidata where the occupation, religion, nationality, ethnic_group and academic_degree are known for every speaker in the dataset.

In [32]:
#CHanger data
""" Only intented to be run once """
df_without_conditions_2015 = create_df(wiki_background, 15, 16)

In [33]:
#Changer data
""" Only intented to be run once """
#2015
df_politicians_2015.to_pickle(DATA_PATH + "politicians_2015.pck")
df_artists_2015.to_pickle(DATA_PATH + "artists_2015.pck")
df_scientists_2015.to_pickle(DATA_PATH + "scientists_2015.pck")
df_economists_2015.to_pickle(DATA_PATH + "economists_2015.pck")

In [34]:
#changer date
""" Only intented to be run once """
df_without_conditions_2015.to_pickle(DATA_PATH + "df_without_conditions_2015.pck")

 # ICI Restart le kernel pour libérer espace (refaire import et fonction save)

In [None]:
#Changer année
df_politicians_2015 = pd.read_pickle(DATA_PATH + "politicians_2015.pck")
df_artists_2015 = pd.read_pickle(DATA_PATH + "artists_2015.pck")
df_scientists_2015 = pd.read_pickle(DATA_PATH + "scientists_2015.pck")
df_economists_2015 = pd.read_pickle(DATA_PATH + "economists_2015.pck")
df_without_conditions_2015 = pd.read_pickle(DATA_PATH + "df_without_conditions_2015.pck")

In [None]:
df_politicians_2015['quote_language'] = df_politicians_2015['quotation'].apply(lambda x: detect(x) if len(x) > 50 else 'not_a_language')
df_politicians_2015 = df_politicians_2015[df_politicians_2015['quote_language'] == 'en']

df_artists_2015['quote_language'] = df_artists_2015['quotation'].apply(lambda x: detect(x) if len(x) > 50 else 'not_a_language')
df_artists_2015 = df_artists_2015[df_artists_2015['quote_language'] == 'en']

df_scientists_2015['quote_language'] = df_scientists_2015['quotation'].apply(lambda x: detect(x) if len(x) > 50 else 'not_a_language')
df_scientists_2015 = df_scientists_2015[df_scientists_2015['quote_language'] == 'en']

df_economists_2015['quote_language'] = df_economists_2015['quotation'].apply(lambda x: detect(x) if len(x) > 50 else 'not_a_language')
df_economists_2015 = df_economists_2015[df_economists_2015['quote_language'] == 'en']

df_without_conditions_2015['quote_language'] = df_without_conditions_2015['quotation'].apply(lambda x: detect(x) if len(x) > 50 else 'not_a_language')
df_without_conditions_2015 = df_without_conditions_2015[df_without_conditions_2015['quote_language'] == 'en']

In [None]:
#Changer data
""" Only intented to be run once """
#2015
df_politicians_2015.to_pickle(DATA_PATH + "politicians_2015.pck")
df_artists_2015.to_pickle(DATA_PATH + "artists_2015.pck")
df_scientists_2015.to_pickle(DATA_PATH + "scientists_2015.pck")
df_economists_2015.to_pickle(DATA_PATH + "economists_2015.pck")
df_without_conditions_2015.to_pickle(DATA_PATH + "df_without_conditions_2015.pck")

In [35]:
PATH_TXT = 'txt_files/'

In [36]:
def quotes_to_txt(file_name, df):
    """
    Changes quotes dataframe to a text file 
    Inputs:
        * file_name : text name
        * df : dataframe to convert
    """
    quotes = df.quotation.astype(str)
    with open(file_name, "w", encoding = "utf-8") as f:
        for ind, quote in enumerate(quotes):
            f.write(str(ind) + " " + quote + "\n")
    f.close()

### 3.2 Creation of the text files <a class = anchor id="3.2"></a>

In [39]:
#Changer date
""" Only intented to be run once """
quotes_to_txt(PATH_TXT + "politicians_2015.txt", df_politicians_2015)
quotes_to_txt(PATH_TXT + "artists_2015.txt", df_artists_2015)
quotes_to_txt(PATH_TXT + "scientists_2015.txt", df_scientists_2015)
quotes_to_txt(PATH_TXT + "economists_2015.txt", df_economists_2015)

#### Sub data frame without conditions on professions

In [40]:
#Changer date
""" Only intented to be run once"""
quotes_to_txt(PATH_TXT + "df_without_conditions_2015.txt", df_without_conditions_2015)