# Research question 4

Do our sentiment analysis of gender equality opinion per country matches with related indexes (European Institute for Gender Equality https://eige.europa.eu/gender-equality-index/2021 or United Nations Development Program - Human Development Report - Gender Inequality Index http://hdr.undp.org/en/content/gender-inequality-index-gii for example) for year 2019 (latest report) ?


## Data loading 

### Merging dataframes to have sentiment analysis on 'feminism' dataset

In [136]:
import pandas as pd
import numpy as np

In [137]:
df_filtered = pd.read_pickle('CleanDF.pkl')
df_feminism = pd.read_json('feminism_part.json')
df_feminism.nationality = [elem[0] for elem in df_feminism.nationality]
df_feminism.gender = [elem[0] for elem in df_feminism.gender]

In [138]:
df_filtered.head(5)

Unnamed: 0_level_0,date_of_birth,nationality,gender,occupation,Speaker,Quote,numOccurrences,quote_year,quote_month,Sentiment,Sentiment Label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-11-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,This loss is a wake-up call that despite remar...,2,2015,11,-0.876,Negative
2015-06-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,"She didn't see it, she hadn't heard of it, she...",1,2015,6,0.0387,Positive
2015-04-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,As a journalist and anchor who reaches million...,1,2015,4,0.4939,Positive
2015-02-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,By empowering people to talk about their gende...,6,2015,2,0.7003,Positive
2015-01-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,"By investing in this dangerous programming, TL...",133,2015,1,-0.561,Negative


In [142]:
df_filtered.shape

(114746, 11)

In [143]:
df_feminism.head(5)

Unnamed: 0,index,date_of_birth,nationality,gender,occupation,Speaker,Quote,numOccurrences,quote_year,quote_month
0,1137,1973,United States of America,male,[film producer],chad griffin,This kind of violence is often motivated by an...,3,2015,11
1,1233,1973,United States of America,male,[film producer],chad griffin,Transgender women of color are facing an epide...,2,2015,11
2,1264,1973,United States of America,male,[film producer],chad griffin,At a time when transgender people are finally ...,1,2015,11
3,1363,1973,United States of America,male,[film producer],chad griffin,Each of these women died simply for being them...,2,2016,2
4,1566,1973,United States of America,male,[film producer],chad griffin,It is crucial that we know these stories in or...,2,2017,11


In [144]:
df_feminism.shape

(13556, 10)

In [145]:
df_feminism = df_feminism[df_feminism["quote_year"] == 2019]

In [146]:
df_feminism = df_feminism.drop(['index', 'date_of_birth', 'gender', 'occupation', 'numOccurrences', 'quote_year', 'quote_month'], axis=1)


In [147]:
df_filtered = df_filtered.drop(['date_of_birth', 'gender', 'occupation', 'numOccurrences', 'quote_year', 'quote_month', 'Sentiment Label'], axis=1)

In [148]:
new_df = df_filtered.merge(df_feminism, on=["Quote"], how="right")
new_df = new_df.drop(["nationality_x", "Speaker_x", "Quote"], axis =1)

In [149]:
new_df.shape

(2473, 3)

In [150]:
new_df.head(5)

Unnamed: 0,Sentiment,nationality_y,Speaker_y
0,-0.5994,United States of America,chad griffin
1,-0.6684,United States of America,hillary clinton
2,-0.9485,United States of America,hillary clinton
3,-0.296,United States of America,hillary clinton
4,-0.0754,United States of America,hillary clinton


`new_df`is the dataframe with all quotes related to feminism, with nationality and relative sentiment scores. Let's compute mean score by country, rank them and compare to the indexes.

### Aggregating sentiment by country and rank

In [151]:
list_countries = ['United States of America', 'United Kingdom', 'Australia','Canada','India']
new_df = new_df[new_df.nationality_y.isin(list_countries)]
new_df.shape

(2139, 3)

In [152]:
new_df.nationality_y.value_counts()

United States of America    1398
United Kingdom               378
India                        154
Canada                       110
Australia                     99
Name: nationality_y, dtype: int64

We will consider those 6 countries for analysis as they represent the most part of the dataset. Other nationalities are meanless in their number of quotes.

In [153]:
new_df.groupby("nationality_y")['Sentiment'].agg('mean').sort_values()

nationality_y
Canada                     -0.124876
Australia                  -0.066536
United States of America   -0.064667
United Kingdom             -0.054222
India                       0.118381
Name: Sentiment, dtype: float64

The ranking we obtain on the sentiments is the following :
India > UK > USA > Australia > Canada 

On the https://eige.europa.eu/gender-equality-index/2021 website, we can see the following ranks : 

- Canada : 19
- Australia : 25 
- UK : 31 
- USA : 46 
- India : 123


We can see that the score does not match with the sentiment of the speaker, for the year 2019. We chose not to integrate it then in our datastory.