## Milestone 2: Hand in 


In this notebook we start the project regarding **The representation of women in media**

This notebook includes:
- the data cleaning process
- the merging with additional datasets
- the basic data analysis including plots

*Note*: In order to limit the computation, we only kept the quotations of 2019. Of course, in order to add the time dependent analysis we would need to include the quotes of each year.


Let us remind ourselves of some of the **research questions** we wish to answer:

- Is the representation **equal between males and females** in media?
- How does the distribution of quotes based on gender vary across countries and across domains?
- How does the distribution of quotes between genders evolve in time, geographically and thematically?
- Is there a tendency in each category for a males to have **longer quotes** than females?
- Are males more likely to be **quoted in highly respected media**?
- Are there any **blind spots** in media where females are especially neglected?
- Is there a difference in how females/males at a certain **age** are quoted?
- Are countries known to promote gender equality more likely to reflect this in media compared to the rest of the world   


Extra ideas (draft selene):
- Can we identify speakers who were quoted and that triggered a change in trend? Example: a woman is quoted once as she talks about topic A. After that we see an increase of women quoted talking about topic A ? Like some turning points in media? 
------> Comment Lavi: Not sure about this, we could discuss :))))



#### **Data cleaning**

# Import packages

### LAVI

In [2]:
# Libraries to import

from journal_API_wikidata import *
import pandas as pd
import numpy as np


# Cleaning process : # 
### See notebook `milestone_2_BLABLA.ipynb`

# Load cleaned data

### LAVI (cleaning process in another notebook)
with the wikidata_utils process

Check for inconsistency

#### **Addition of datasets**
The following function allows us to add the country of origin of the newspappers in which the quotes may be found, this will be used to analyse the geographic tendencies of quoting different genders accross the world

- Media countries (Lavi)
- Countries to continent

# Analysis

## Number of quotes per gender

#### In absolute value

In [None]:
def plot_quotes_number_abs(df):
    f = plt.figure(figsize=(12,6))
    ax = sns.countplot(data=df, x='gender')
    plt.xlabel('Gender')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes depending on gender')

In [None]:
plot_quotes_number_abs(df_2016)

#### In relative value

##### Statistical tests over the years

## Age

In [None]:
def dateofbirth_to_timestamp(df):
    df['date_of_birth'] = extract_element_from_series(df['date_of_birth'])
    df['date_of_birth'] = df['date_of_birth'].replace(to_replace='[\+Z]',value='', regex=True)                                                                              
    df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
    return df

In [None]:
def compute_age(df):
    now = pd.to_datetime('now')
    df['age'] = (now.year - df.date_of_birth.dt.year) - ((now.month - df.date_of_birth.dt.month) < 0)
    return df

In [None]:
def compute_age_range(df):
    bins = np.arange(0,110,step=10)
    df["age_range"] = pd.cut(df["age"], bins)
    return df

In [None]:
df_2016 = dateofbirth_to_timestamp(df_2016)
df_2016 = compute_age(df_2016)
df_2016 = compute_age_range(df_2016)

In [None]:
def plot_quotes_age(df, age_threshold):
    f = plt.figure(figsize=(16,6))
    ax = sns.countplot(data=df[df["age"]<age_threshold], x='age_range', hue ='gender')
    plt.xlabel('Age intervals')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes depending on age and gender')
    #labels = ['[0,10]','[10,20]','[20,30]','[30,40]','[50,60]','[60,70]','[70,80]','[80,90]','[90,100]']
    #ax.set_xticklabels(labels)

In [None]:
plot_quotes_age(df_2016, 100)

## Countries

In [None]:
def plot_quotes_country(df, threshold_nber):
    df_citizenship_count = df.groupby(['gender', 'citizenship'])['quoteID'].count().to_frame(name='count').reset_index()
    df_citizenship_count = df_citizenship_count[df_citizenship_count['count']>threshold_nber]
    
    f = plt.figure(figsize=(18,6))
    ax = sns.barplot(data=df_citizenship_count, x='citizenship',y='count', hue='gender')
    plt.xlabel('Citizenship')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes above '+ str(threshold_nber) +' depending on gender and citizenship')
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=90)
    ax.set_yscale('log')
    return ax

In [None]:
ax = plot_quotes_country(df_2016, threshold_nber = 300)

## Continents

In [None]:
def add_continent(df, countries_to_continent):
    df = pd.merge(df, countries_to_continent, left_on='citizenship', right_on='Country', copy=False)
    df = df.drop('Country', axis=1)
    return df

In [None]:
df_2016 = add_continent(df_2016, countries_to_continent)

In [None]:
def plot_quotes_continent(df):
    f = plt.figure(figsize=(12,6))
    ax = sns.countplot(data=df, x='Continent', hue='gender')
    plt.xlabel('Continent')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes depending on gender and continent')

In [None]:
plot_quotes_continent(df_2016)

## Number of quotes per journals

In [None]:
df_2016['sitenames'] = extract_element_from_series(df_2016['sitenames'])

In [None]:
def plot_quotes_journals(df):
    f = plt.figure(figsize=(14,6))
    ax = sns.countplot(data=df, x='sitenames', hue='gender', order=df['sitenames'].value_counts().index)
    plt.xlabel('Media')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes depending on gender and media')
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=45,  horizontalalignment='right')

In [None]:
plot_quotes_journals(df_2016)

## Number of quotes per category 

In [None]:
def transform_tags(df):
    col_tags = []
    for i in range(len(df)):
        array = df['tags'][i]
        tags = [var for var in array if var]
        if tags : 
            tags = tags[0][0]
        else :
            tags = 'undefined'
        col_tags.append(tags)
    return col_tags

In [None]:
df_2016['tags'] = transform_tags(df_2016)

In [None]:
def plot_quotes_categories(df):
    f = plt.figure(figsize=(14,6))
    ax = sns.countplot(data=df, x='tags', hue='gender', order=df['tags'].value_counts().index)
    plt.xlabel('Category')
    plt.ylabel('Number of quotes')
    plt.title('Number of quotes depending on gender and media')
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=45,  horizontalalignment='right')

In [None]:
plot_quotes_categories(df_2016)

## Length of quotes

## Quotes per journals

#### **Basic data analysis**

NOTE FOR US Check on the read me for plots and stuff to do:

*Note*: For this hand in we will not look into questions that are computed over the years since we only focus on a subset of the whole data 

Comparison of number of male versus female quotes
USE FUNCTIONS TO PLOT TO RUN IT FOR EACH YEAR
Lisa : 
- Overall difference in count of males versus females, over all years.
- Overall count of male versus female quotes, per year.
- Overall count of male versus female quotes, per category.
- Overall count of male versus female quotes, per year and per category.
- Overall count of male versus female quotes, per country/geographical location.
- Overall count of male versus female quotes, per year and per country/geographical location.
Arthur : 
- Compare length of quotes 

Tests for statistical significance
Arthur : 
- Perform statistical tests to see if the difference between counts of males and females per year is statistically significant.
- Perform statistical tests to see if the difference per year and per category is statistically significant.
- Perform statistical tests to see if the difference per year and per location is statistically significant.
