# Replication: Dion, Sumner, Mitchell (2018): Gendered Citation Patterns across Political Science and Social Science Methodology Fields

The following Notebook will guide the reader through the replication of a quantitative analysis by Dion, Sumner and Mitchell from their article [**Gendered Citation Patterns across Political Science and Social Science Methodology Fields**](https://www.cambridge.org/core/journals/political-analysis/article/gendered-citation-patterns-across-political-science-and-social-science-methodology-fields/5E8E92DB7454BCAE41A912F9E792CBA7). It will allow an interactive exploration of the authors' analysis of citation patterns in social science journals.

In the first section, I describe the main argument of the authors. Further on, I will guide through the analysis of the dataset used in the article and show the necessary steps to reproduce the Tables 1 to 3 in the article. I will conclude this replication with short comments on the argument and the analysis.

## Summary

In the publication in **Political Analysis**, Dion, Sumner and Mitchell (2018) aim to explain the gender biased citation behaviour in the social sciences. Building on prior research adressing the gender citation gap in different scientific disciplines. While other researchers examine differences in citation behaviour of male and female researchers over entire academic careers, Dion et al. (2018) approach the topic from a network-approach, in other words, they analyse "who cites whom?", rather than meremly asking "who is cited the most?"

Their aim is to model the probability with which the authorship of a referenced article in a social science journal is entirely composed of female researchers vs. solely male or researchers of both sexes. And the main explanatory factor of this variable is the sex of the author(s) citing the respective article (male vs. female vs. mixed sex). A presence of such an effect could already be shown by other researchers.

The distinct argument of article by Dion et al. (2018) is to analyse whether this effect is differently pronounced when there are more female researchers active within a field of social science. They hypothesise that:

- H1: the effect of the sex of citing authors on the sex of the cited author will be more strongly pronounced in  methodological fields of political science, because there are fewer female researcher active in these fields.
- H2: the effect of the sex of citing authors on the sex of the cited author will be more strongly pronounced in fields of Social science with fewer active female researchers.

They select five journals (APSR, Politics \& Gender, Econometrica, Political Analysis, Sociological Methods \& Research) as representatives of fields with higher share of female (Politics \& Gender) and higher share of male  researchers (Political Analysis). Via comparing methodological journals from Economics (Econometrica), Sociology (SMR) and Political Science (Political Analysis) they aim to show these patterns across social science disciplines with economics being a field more dominated by men than political science and sociology.

In the following, I guide through the replication of the analysis by Dion et al.  (2018).

## Replication

We will require a range of R packages to execute the replicating code.


In [2]:
# load necessary packages
library(tidyverse)
library(MASS)
library(foreign)
library(IRdisplay)
library(optimx)
library(rms)
library(kableExtra)

# to hide a message when summarising data:
options(dplyr.summarise.inform = FALSE)

"Paket 'tidyverse' wurde unter R Version 4.2.2 erstellt"
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
"Paket 'MASS' wurde unter R Version 4.2.2 erstellt"

Attache Paket: 'MASS'


Das folgende Objekt ist maskiert 'package:dplyr':

    select


"Paket 'IRdisplay' wurde unter R Version 4.2.2 erstellt"
"Paket 'optimx' wurde unter R Version 4.2.2 erstellt"
"Paket 'rms' wurde unter R Version 4.2.2 erstellt"

### Descriptive Analysis

Dion et al. (2018) collected data from the [Web of Science](https://www.webofscience.com/wos/woscc/basic-search). They provide a dataset containing all article references of articles published in APSR, Gender & Politics, Political Analysis, Econometrica and Sociological Methods & Research.

One row in the datafrae corresponds to one reference to an article published in one of the 5 journals between 2007 and 2016. The columns denote the following:

- **newartid**: Name of the published article
- **newjnlid**: Name of the journal in which **newartid** is published
- **authorteam**: Whether authors of **newartid** are all-male, all-female or mixed ("Male", "Female", "Mixed")
- **refteam**: Whether authors citing **newartid** are all-male, all-female or mixed ("Male", "Female", "Mixed")
- **reffemonly**: Whether authors citing **newartid** are all-female (0,1)
- **refauthcomplete**: Whether the sex of all authors of a reference could be determined (0,1)

Have a glimpse at the data below:

In [3]:
# Data can be found under
# https://www.cambridge.org/core/journals/political-analysis/article/gendered-citation-patterns-across-political-science-and-social-science-methodology-fields/5E8E92DB7454BCAE41A912F9E792CBA7#supplementary-materials-tab

df <- read.dta("Data/DSM2018PAreplication.dta")

df %>%
  group_by(newjnlid) %>%
  slice(1)

newartid,newjnlid,authorteam,refteam,reffemonly,refauthcomplete
<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>
"APSR Deliberation, Democracy, and the Rule of Reason in Aristotle's Politics",APSR,Male,Male,0.0,1
"Politics & Gender The Roll Call Behavior of Men and Women in the U.S. House of Representatives, 1937-2008",Politics & Gender,Mixed,,,0
Political Analysis Proportionally Difficult: Testing for Nonproportional Hazards in Cox Models,Political Analysis,Male,,,0
Econometrica Instrumental Variable Models for Discrete Outcomes,Econometrica,Male,Male,0.0,1
Soc. Methods & Res. A smoothing cohort model in age-period-cohort analysis with applications to homicide arrest rates and lung cancer mortality rates,Soc. Methods & Res.,Male,,,1


#### Replicating Table 1

Table 1 in Dion et al. (2018) shows the number of original articles as well as the authors' sex by journal. We can use the following additional data provided in the replication file to analyse this.

In [4]:
# Data can be found under
# https://www.cambridge.org/core/journals/political-analysis/article/gendered-citation-patterns-across-political-science-and-social-science-methodology-fields/5E8E92DB7454BCAE41A912F9E792CBA7#supplementary-materials-tab

df_articles <- read.dta("Data/DSM2018PAreplication_articlesonly.dta")

##Table 1

df_articles %>%
  group_by(newjnlid, authorteam) %>% # group data by journal and sex of referenced authors
  na.omit() %>% # remove missing values 
  summarise(N = n()) %>% # count number of obs per journal and sex authors
  mutate(Percent = round(100 * N / sum(N),2)) %>% # calculate percentages
  kable("html", caption = "Table 1: Distribution of author genders by article, 2007–2016.")  %>% # create nice table
  as.character() %>% # make table readable by markdown
  display_html() # show it 


newjnlid,authorteam,N,Percent
APSR,Male only,324,69.83
APSR,Female only,67,14.44
APSR,Mixed,73,15.73
Politics & Gender,Male only,27,7.94
Politics & Gender,Female only,266,78.24
Politics & Gender,Mixed,47,13.82
Political Analysis,Male only,220,74.58
Political Analysis,Female only,8,2.71
Political Analysis,Mixed,67,22.71
Econometrica,Male only,465,76.99


When comparing the above data to Table 1 in Dion et al. (2018), we can see that numbers add up and we are able to replicate the data. Again we see the pattern of differently distributed authorship gender across different journals (APSR vs. Politics & Gender vs. Political Analysis) and fields (Econometrica vs. Political Analysis vs. Sociological Methods & Research).

#### Replicating Table 2

Table 2 in Dion et al. (2018) summarises the distribution of sexes of referenced authors across the various journals. See below the code to replicate Table 2.

In [5]:
##Table 2

df %>%
  filter(refauthcomplete == 1 & !is.na(refteam)) %>% # remove all the rows in which we don't have a complete reference's gender or no reference authors at all.
  group_by(newjnlid, refteam) %>% # group data by journal and sex of referenced authors
  summarise(N = n()) %>% # count how many references there are per journal and sex of referenced authors
  mutate(Percent = round(100 * N / sum (N),2))%>% # calculate percentages 
  kable("html", caption = "Table 2: Distribution of reference author genders, 2007–2016.")  %>% # create nice table
  as.character() %>% # make table readable by markdown
  display_html() # show it 

newjnlid,refteam,N,Percent
APSR,Male,11617,74.24
APSR,Female,2203,14.08
APSR,Mixed,1828,11.68
Politics & Gender,Male,1649,27.98
Politics & Gender,Female,3405,57.77
Politics & Gender,Mixed,840,14.25
Political Analysis,Male,4650,78.93
Political Analysis,Female,322,5.47
Political Analysis,Mixed,919,15.6
Econometrica,Male,9226,84.88


In this Table 2, we can see how many references from articles in each journal are citing all-male, mixed or all-female-written articles. Again, we are able to exactly replicate numbers from Dion et al. (2018) Table 2. We can see that in different journals, the distribution of cited authors' sex varies. In Political Analysis, only 5.47% of citations are by all-female authors whereas in Politics & Gender, almost 60% of cited authors are all-female.

### Replicating logistic regression analysis

Dion et al. (2018) apply a logistic regression with clustered and robust standard errors to analyse the effect between articles' authors' sex and their referenced authors' sex. They analyse 6 different models to test their hypotheses. They expect effects of different sizes for journals with a different distribution of authors' gender to examine the presence and strength of the "Mathilda effect" as described in their article: the more balanced the authorship sex within a journal, the less present that effect between citing and cited authors' gender will be.

Before starting the regression, I make some adjustments to the data:

1. I remove all rows not containing completely known sexes of cited authors
2. I remove all missing data
3. I create to dummy variables being 1 if the author team is all female respectively the authorteam is all mixed
4. I select the journal, the gender of the references, the article name and the two dummy variables

Afterwards, I apply a function that conducts a logistic regression analysis with clustered standard errors to each group of articles in journals separately.

In [7]:
# load source code necessary for analysis
source("sources/logistic_function.R")
source("sources/execute_logistic_per_journal.R")

df_ana <- df %>% 
  filter(refauthcomplete == 1) %>% # 1.
  na.omit() %>% # 2.
  mutate(Female = ifelse(authorteam == "Female", 1, 0), # 3.
         Mixed = ifelse(authorteam == "Mixed", 1, 0)) %>% # 3.
  dplyr::select(newjnlid, reffemonly, newartid, Female, Mixed) # 4.


In [8]:
models <- do.call("cbind", lapply(X = unique(df_ana$newjnlid), df = df_ana, FUN = logistic_per_journal))

models %>%
  rownames_to_column(" ") %>%
  kable("html", caption = "Table 3: Logistic Regression Estimates per journal: Effect of gender of citing author on gender of cited authors (1=female)")  %>%
  as.character() %>%
  display_html()

Unnamed: 0,APSR,Politics & Gender,Political Analysis,Econometrica,Soc. Methods & Res.
Intercept,-2.07 (0.05),-0.01 (0.11),-2.84 (0.09),-3.18 (0.06),-2.46 (0.1)
Female,0.99 (0.16),0.53 (0.12),0.42 (0.38),1.14 (0.22),0.76 (0.28)
Mixed,0.21 (0.13),-0.15 (0.16),-0.08 (0.16),0.07 (0.14),0.06 (0.18)
Pseudo R2,-0.026,-0.0165,-7e-04,-0.0106,-0.0078
NullLL,-6359,-4007,-1249,-1951,-1185
LL,-6198,-3942,-1248,-1931,-1175
Clusters,464,332,295,604,232
Observations,15648,5883,5891,10869,4053


First and foremost, we are mostly able to replicate the table output from Table 3 in Dion et al. (2018). And this extends to the replication of the pooled model in Table 3, as we shall see now.

What remains are slight differences in Pseudo $R^2$-values.

#### Replicating pooled model

The code below works very similarly to what I described above. The only difference is that the `logistic_pooled`-Function estimates parameters for journal dummies such that we include fixed effects in the model. 

In [21]:
source("sources/execute_logistic_pooled.R")

    df_pooled <- df %>%
      filter(refauthcomplete == 1) %>%
      dplyr::select(newjnlid, authorteam, reffemonly, newartid) %>%
      na.omit() %>%
      mutate(Female = ifelse(authorteam == "Female", 1, 0),
             Mixed = ifelse(authorteam == "Mixed", 1, 0),
             APSR = ifelse(newjnlid == "APSR", 1, 0),
             PG = ifelse(newjnlid == "Politics & Gender", 1, 0),
             PA = ifelse(newjnlid == "Political Analysis", 1, 0),
             Econ. = ifelse(newjnlid == "Econometrica", 1, 0),
             SMR = ifelse(newjnlid == "Soc. Methods & Res.", 1, 0)) %>%
      dplyr::select(-authorteam)   

#pooled_model <- do.call("cbind", lapply(X = unique(df_ana$newjnlid), df = df_ana, FUN = logistic_pooled))

pooled_model <- logistic_pooled(df = df_pooled)

pooled_model  %>%
  rownames_to_column(" ") %>%
  kable("html", caption = "Table 3: Pooled Model")  %>%
  as.character() %>%
  display_html()

Unnamed: 0,Pooled
Intercept,-2.02 (0.05)
Female,0.86 (0.1)
Mixed,0.11 (0.08)
P&G,1.73 (0.1)
PA,-0.89 (0.09)
Econ,-1.14 (0.07)
SMR,-0.47 (0.1)
Pseudo R2,-0.27964
NullLL,-18566
LL,-14509


We can again replicate the findings from Dion et al. (2018).



### Comments

Whereas the authors use a large amount of data to analyse the effect of a scholar (group)'s sex on citing the article of a female researcher or research group, the testing of the hypothesis solely consists of comparing pairs of effect sizes from 5 models. Hence, although there is evidence of gender biased citation behaviour in social science journals, the evidence for a "Mathilda-effect", a moderation of this effect based on the share of female researchers in a given field/ publishing in a given journal is yet rather weak. To analyse this effect using an interaction between the no. of female researchers in a given field and the effect could yield more insights.


## References

Dion, M. L., Sumner, J. L., & Mitchell, S. M. (2018). Gendered citation patterns across political science and social science methodology fields. *Political analysis*, 26(3), 312-327.

Dion, Michelle L.; Sumner, Jane Lawrence; Mitchell, Sara McLaughlin, 2018, "Replication Data for: Gendered Citation Patterns across Political Science and Social Science Methodology Fields", https://doi.org/10.7910/DVN/R7AQT1, Harvard Dataverse, V1, UNF:6:CInBeM5eziTIPGjBVTpr4A== [fileUNF] 