Skip to content
Probabilistically assign gender proportions of first/last authors pairs in bibliography entries
Jupyter Notebook Lua TeX
Branch: master
Clone or download

Network courtesy Ann Sizemore Blevins

Table of Contents

Diversity Statement and Code Notebook


Motivated from work by J. D. Dworkin, K. A. Linn, E. G. Teich, P. Zurn, R. T. Shinohara, and D. S. Bassett (2020). bioRxiv. doi:

For .pdf and .tex templates of the statement, see the /diversityStatement directory in this repository.

A .bib file containing the references used in the statement can be found in /diversityStatement/bibfile.bib

Diversity statement template


Recent work in neuroscience and other fields has identified a bias in citation practices such that papers from women and other minorities are under-cited relative to the number of such papers in the field [1, 2, 3, 4, 5, 6]. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, and other factors. We used automatic classification of gender based on the first names of the first and last authors [1, 7], with possible combinations including male/male, male/female, female/male, and female/female. Excluding self-citations to the first and last authors of our current paper, the references contain A% male/male, B% male/female, C% female/male, D% female/female, and E% unknown categorization. We look forward to future work that could help us to better understand how to support equitable practices in science.


For the top 5 neuroscience journals (Nature Neuroscience, Neuron, Brain, Journal of Neuroscience, and Neuroimage), the expected gender proportions in reference lists as reported by Dworkin et al. are 58.4% for male/male, 9.4% for male-female, 25.5% for female-male, and 6.7% for female-female. Expected proportions were calculated by randomly sampling papers from 28,505 articles in the 5 journals, estimating gender breakdowns using probabilistic name classification tools, and regressing for relevant article variables like publication date, journal, number of authors, review article or not, and first-/last-author seniority. See Dworkin et al. for more details.


The goal of the coding notebook is to clean your .bib file to only contain references that you have cited in your manuscript. This cleaned .bib will then be used to generate a data table of full first names that will be used to query the probabilistic gender classifier, Gender API. Proportions of the predicted gender for first and last author pairs (male/male, male/female, female/male, and female/female) will be calculated.

  1. Obtain a .bib file of your manuscript's reference list. You can do this with common reference managers. Please export your .bib in an output style that uses full first names (rather than only first initials) and using the full author lists (rather than abbreviated author lists with "et al.").

  2. Launch the Binder environment. Please refresh the page if the Binder does not load after 5-10 mins.


  3. Open the notebook cleanBib.ipynb. Follow the instructions above each code block. It can take 10 minutes to 1 hour complete all of the instructions, depending on the state and size of your .bib file. We expect that the most time-consuming step will be manually modifying the .bib file to find missing author names, fill incomplete entries, and fix formatting errors. These problems arise because automated methods of reference mangagers and Google Scholar sometimes can not retrieve full information, for example if some journals only provide an author's first initial instead of their full first name.


Input Output
.bib file(s)(REQUIRED) cleanBib.csv: table of author first names, titles, and .bib keys
.aux file (OPTIONAL) Authors.csv: table of author first names, estimated gender classification, and confidence
.tex file (OPTIONAL) yourTexFile_gendercolor.tex: your .tex file modified to compile .pdf with in-line citations colored-coded by gender pairs

Color-coded .tex file, Eli Cornblath


  • Why do I receive an error when running the code?.
    • The most common errors are due to misformatted .bib files. Errors will usually provide an indication of the line or type of problem in the .bib file. They will require you to manually correct the .bib file of formatting errors or incomplete entries. If you cannot resolve an error, please open an issue and attach the error pasted into a .txt or a screenshot of the error. We will try to help resolve it.
  • Will this method work on non-Western names?
  • Are self-citations included?
    • We do not include self-citations by default. We define self-citations as those including your first or last author as a co-author.
  • What if a reference has only 1 author?
    • We count that author as both the first and last author.
  • What about gender-neutral names?
    • We exclude names that cannot be classified with >70% confidence. These are reported in the Diversity Statement as "unknown."
  • Should I include the diversity statement references in the gender proportion calculation?
    • Please do not include the diversity statement references. The descriptive statistic of primary interest is of your citation practices.
  • What is a .bib file?
    • The .bib file is a bibliography with tagged entry fields used by LaTeX to format a typesetted manuscript's reference list and its in-line citations. If you are not using LaTeX to write your manuscript, common reference managers that are linked to Microsoft Word or Google Docs also allow you to export .bib files (See Instructions, Step 1).
  • What is an .aux file?
    • The .aux file is generated when you compile the .tex file to build your manuscript. It is linked to the .bib file(s) used to populate your manuscript's reference list and records the citations used.
  • I have an idea to advance this project, suggestions about how to improve the notebook, and/or found a bug. Can I contribute?
    • Yes, please open an issue or pull request. We welcome feedback on any pain points in running this code notebook. If you contribute, please modify the to credit yourself alphabetically in the Contributors section in the pull request.

Other Resources


[1] J. D. Dworkin, K. A. Linn, E. G. Teich, P. Zurn, R. T. Shinohara, and D. S. Bassett, “The extent and drivers of gender imbalance in neuroscience reference lists,” bioRxiv, 2020.

[2] D. Maliniak, R. Powers, and B. F. Walter, “The gender citation gap in international relations,” International Organization, vol. 67, no. 4, pp. 889– 922, 2013.

[3] N. Caplar, S. Tacchella, and S. Birrer, “Quantitative evaluation of gender bias in astronomical publications from citation counts,” Nature Astronomy, vol. 1, no. 6, p. 0141, 2017.

[4] P. Chakravartty, R. Kuo, V. Grubbs, and C. McIlwain, “# communicationsowhite,” Journal of Communication, vol. 68, no. 2, pp. 254–266, 2018.

[5] Y. Thiem, K. F. Sealey, A. E. Ferrer, A. M. Trott, and R. Kennison, “Just Ideas? The Status and Future of Publication Ethics in Philosophy: A White Paper,” tech. rep., 2018.

[6] M. L. Dion, J. L. Sumner, and S. M. Mitchell, “Gendered citation patterns across political science and social science methodology fields,” Political Analysis, vol. 26, no. 3, pp. 312–327, 2018.

[7] D. Zhou, E. J. Cornblath, J. Stiso, E. G. Teich, J. D. Dworkin, A. S. Blevins, and D. S. Bassett, “Gender diversity statement and code notebook v1.0,” Feb. 2020.



  • Ann Sizemore Blevins
  • Eli Cornblath
  • Jordan Dworkin
  • Jeni Stiso
  • Erin Teich
  • Dale Zhou


  • 3/16/2020

    • fix bug with CrossRef title confirmation
    • add to README instructions on exporting .bib with a style that includes full first author (not just initials) when possible
    • added a sleep timer for CrossRef API queries
    • added another self-citation check from the CrossRef search results
  • 2/17/2020

    • streamlined instructions
    • added repository photo for social media (thanks, Ann!)
    • move instructions into Jupyter notebook
    • added code to automatically remove unused .bib entries instead of needing user to manually remove them (thanks, Eli and Erin!)
    • made removing self-citations default
    • added FAQ
    • added screenshots to instructions
    • added error message to request users remove entries with duplicate IDs. Not automated in case duplicate entry key refers to different references.
    • throw error if entries are incomplete or blank
    • fixed handling of optional middle initial correctly for self-citations
    • added SOS notebook support to put all code and instructions into 1 notebook so users don't have to manually change kernel
    • added optional entry for co-first or co-last authors
    • added optional code block to color-code .tex file's citation keys by gender pair classifications
    • added code to search Crossref API to automatically complete some incomplete .bib entries (thanks, Jeni!)
    • add another self-citation check after manual editing
  • 1/19/2020

    • added code to output a column with article titles to make it easier to manually search which bib entries need manual editing
    • added code to output another column that optionally checks for self-citations
You can’t perform that action at this time.