### Data Dictionary

For this set of exercises, we examine the data from a 2014 PNAS paper that analyzed success rates from funding agencies in the Netherlands and concluded:

"our results reveal gender bias favoring male applicants over female applicants in the prioritization of their "quality of researcher" (but not "quality of proposal") evaluations and success rates, as well as in the language used in instructional and evaluation materials."

A response was published a few months later titled No evidence that gender contributes to personal research funding success in The Netherlands: A reaction to Van der Lee and Ellemers, which concluded:

However, the overall gender effect borders on statistical significance, despite the large sample. Moreover, their conclusion could be a prime example of Simpson’s paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines (i.e., with low application success rates for both men and women), then an analysis across all disciplines could incorrectly show "evidence" of gender inequality. 

Who is right here: the original paper or the response? Here, you will examine the data and come to your own conclusion.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

%matplotlib inline
sns.set_style('dark')
sns.set(font_scale=1.5)

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_columns= None
#pd.options.display.max_rows = None

### Data Exploration

In [2]:
df = pd.read_csv("research.csv")

In [3]:
df

Unnamed: 0,discipline,applications_total,applications_men,applications_women,awards_total,awards_men,awards_women,success_rates_total,success_rates_men,success_rates_women
0,Chemical sciences,122,83,39,32,22,10,26.2,26.5,25.6
1,Physical sciences,174,135,39,35,26,9,20.1,19.3,23.1
2,Physics,76,67,9,20,18,2,26.3,26.9,22.2
3,Humanities,396,230,166,65,33,32,16.4,14.3,19.3
4,Technical sciences,251,189,62,43,30,13,17.1,15.9,21.0
5,Interdisciplinary,183,105,78,29,12,17,15.8,11.4,21.8
6,Earth/life sciences,282,156,126,56,38,18,19.9,24.4,14.3
7,Social sciences,834,425,409,112,65,47,13.4,15.3,11.5
8,Medical sciences,505,245,260,75,46,29,14.9,18.8,11.2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   discipline           9 non-null      object 
 1   applications_total   9 non-null      int64  
 2   applications_men     9 non-null      int64  
 3   applications_women   9 non-null      int64  
 4   awards_total         9 non-null      int64  
 5   awards_men           9 non-null      int64  
 6   awards_women         9 non-null      int64  
 7   success_rates_total  9 non-null      float64
 8   success_rates_men    9 non-null      float64
 9   success_rates_women  9 non-null      float64
dtypes: float64(3), int64(6), object(1)
memory usage: 848.0+ bytes


In [5]:
df.describe(include='all')

Unnamed: 0,discipline,applications_total,applications_men,applications_women,awards_total,awards_men,awards_women,success_rates_total,success_rates_men,success_rates_women
count,9,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0
unique,9,,,,,,,,,
top,Social sciences,,,,,,,,,
freq,1,,,,,,,,,
mean,,313.666667,181.666667,132.0,51.888889,32.222222,19.666667,18.9,19.2,18.888889
std,,236.871801,110.232708,129.68616,28.802971,16.037283,13.96424,4.688283,5.598437,5.263185
min,,76.0,67.0,9.0,20.0,12.0,2.0,13.4,11.4,11.2
25%,,174.0,105.0,39.0,32.0,22.0,10.0,15.8,15.3,14.3
50%,,251.0,156.0,78.0,43.0,30.0,17.0,17.1,18.8,21.0
75%,,396.0,230.0,166.0,65.0,38.0,29.0,20.1,24.4,22.2


In [6]:
df.shape

(9, 10)

In [7]:
df.columns

Index(['discipline', 'applications_total', 'applications_men',
       'applications_women', 'awards_total', 'awards_men', 'awards_women',
       'success_rates_total', 'success_rates_men', 'success_rates_women'],
      dtype='object')

In [8]:
######################### QUESTION 1 a ###########################
# Construct a table of gender (men/women) by award status (awarded/not)
#  using the total numbers across all disciplines.

In [9]:
df.groupby(by='awards_total').sum()

Unnamed: 0_level_0,applications_total,applications_men,applications_women,awards_men,awards_women,success_rates_total,success_rates_men,success_rates_women
awards_total,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20,76,67,9,18,2,26.3,26.9,22.2
29,183,105,78,12,17,15.8,11.4,21.8
32,122,83,39,22,10,26.2,26.5,25.6
35,174,135,39,26,9,20.1,19.3,23.1
43,251,189,62,30,13,17.1,15.9,21.0
56,282,156,126,38,18,19.9,24.4,14.3
65,396,230,166,33,32,16.4,14.3,19.3
75,505,245,260,46,29,14.9,18.8,11.2
112,834,425,409,65,47,13.4,15.3,11.5


In [10]:
######################### QUESTION 2 a ###########################
# Use the table from Question 1 to compute the percentages of men
#  awarded versus women awarded.

In [11]:
######################### QUESTION 3 a ###########################
# Run a chi-squared test on the two-by-two table to determine whether the
#  difference in the two success rates is significant. (You can use tidy
#  to turn the output of chisq.test into a data frame as well.)
# What is the p-value of the difference in funding rate?

In [12]:
######################### QUESTION 4 a/b/c ###########################
# There may be an association between gender and funding. But can we infer causation here?
#  Is gender bias causing this observed difference? The response to the original paper claims
#  that what we see here is similar to the UC Berkeley admissions example. Specifically they 
#  state that this "could be a prime example of Simpsonâs paradox; if a higher percentage of
#  women apply for grants in more competitive scientific disciplines, then an analysis across
#  all disciplines could incorrectly show 'evidence' of gender inequality."

# To settle this dispute, use dataset 'dat' to check if this is a case of Simpson's paradox,
#  plot the success rates versus disciplines, which have been ordered by overall success, with
#  colors to denote the genders and size to denote the number of applications.