## Regression analysis on nationalities and genders

In the following analysis, the overall Sentiment score on gender norms is regressed on the top 6 most frequent nationalities of speakers as well as their gender.


In [43]:
import seaborn as sns 
from matplotlib import pyplot as plt 
import pandas as pd
import numpy as np
import nltk 
from collections import Counter
import statsmodels.formula.api as smf


In [38]:
df = pd.read_pickle('CleanDF.pkl')

In [39]:
list_countries = ['United States of America', 'United Kingdom', 'Australia','Canada','India','Ireland']
df = df[df.nationality.isin(list_countries)]
df.head(5)

Unnamed: 0_level_0,date_of_birth,nationality,gender,occupation,Speaker,Quote,numOccurrences,quote_year,quote_month,Sentiment,Sentiment Label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-11-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,This loss is a wake-up call that despite remar...,2,2015,11,-0.876,Negative
2015-06-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,"She didn't see it, she hadn't heard of it, she...",1,2015,6,0.0387,Positive
2015-04-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,As a journalist and anchor who reaches million...,1,2015,4,0.4939,Positive
2015-02-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,By empowering people to talk about their gende...,6,2015,2,0.7003,Positive
2015-01-01,1971,United States of America,female,[LGBTIQ+ rights activist],sarah kate ellis,"By investing in this dangerous programming, TL...",133,2015,1,-0.561,Negative


In [40]:
df = df.drop(columns = ['occupation', 'Speaker','Quote', 'numOccurrences','quote_month', 'Sentiment Label', 'quote_year'])
df.head(5)

Unnamed: 0_level_0,date_of_birth,nationality,gender,Sentiment
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-11-01,1971,United States of America,female,-0.876
2015-06-01,1971,United States of America,female,0.0387
2015-04-01,1971,United States of America,female,0.4939
2015-02-01,1971,United States of America,female,0.7003
2015-01-01,1971,United States of America,female,-0.561


In [41]:
df.loc[~df.gender.isin(["male", "female"]), "gender"] = "Other"

In [42]:
df = pd.concat([df, pd.get_dummies(df.gender).add_suffix("_onehot")], axis=1)  # one hot encoding
df = pd.concat([df, pd.get_dummies(df.nationality).add_suffix("_onehot")], axis=1)  # one hot encoding
df.head(5)

Unnamed: 0_level_0,date_of_birth,nationality,gender,Sentiment,Other_onehot,female_onehot,male_onehot,Australia_onehot,Canada_onehot,India_onehot,Ireland_onehot,United Kingdom_onehot,United States of America_onehot
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2015-11-01,1971,United States of America,female,-0.876,0,1,0,0,0,0,0,0,1
2015-06-01,1971,United States of America,female,0.0387,0,1,0,0,0,0,0,0,1
2015-04-01,1971,United States of America,female,0.4939,0,1,0,0,0,0,0,0,1
2015-02-01,1971,United States of America,female,0.7003,0,1,0,0,0,0,0,0,1
2015-01-01,1971,United States of America,female,-0.561,0,1,0,0,0,0,0,0,1


In [54]:
df = df.rename(columns={"United Kingdom_onehot": "United_Kingdom_onehot"})

In [58]:
onehot_cols = [f"{col}_onehot" for col in df.columns]
features = onehot_cols + ["date_of_birth"]
y = df.Sentiment.to_numpy()

mod = smf.ols(formula='Sentiment ~ female_onehot + Other_onehot + Australia_onehot + Canada_onehot + India_onehot + Ireland_onehot + United_Kingdom_onehot',
              data = df) 

res = mod.fit()
print(res.summary())         

                            OLS Regression Results                            
Dep. Variable:              Sentiment   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     13.80
Date:                Sun, 12 Dec 2021   Prob (F-statistic):           5.61e-18
Time:                        19:29:44   Log-Likelihood:                -79746.
No. Observations:              100830   AIC:                         1.595e+05
Df Residuals:                  100822   BIC:                         1.596e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 0.11

## Analysis

The group reference for the regression is composed of male speakers from the United States. 

- We can see from the analysis that all the coefficients that are significant are positive, suggesting a higher sentiment score compared to the reference group. 
- The intercept is the mean for the reference group (American and male speakers) which is equal to 0.12
- All other nationalities and genders have a significant higher sentiment on this topic compared to American male speakers.