# Clean Bibliography

To goal of this notebook is to clean your `.bib` file to ensure that it only contains the full first names of references that you have cited in your paper. The full first names will then be used to query the probabilistic gender classifier, [Gender API](https://gender-api.com). 

The only required file you need is your manuscript's bibliography in `.bib` format. __Your `.bib` must only contain references cited in the manuscript__. Otherwise, the estimated gender proportions will be inaccurate. 

If you are not using LaTeX, collect and organize only the references you have cited in your manuscript using your reference manager of choice (e.g. Mendeley, Zotero, EndNote, ReadCube, etc.) and export that selected bibliography as a `.bib` file. __Please export your .bib in an output style that uses full first names (rather than only first initials) and using the full author lists (rather than abbreviated author lists with "et al.").__

   * [Export `.bib` from Mendeley](https://blog.mendeley.com/2011/10/25/howto-use-mendeley-to-create-citations-using-latex-and-bibtex/)
   * [Export `.bib` from Zotero](https://libguides.mit.edu/ld.php?content_id=34248570)
   * [Export `.bib` from EndNote](https://www.reed.edu/cis/help/LaTeX/EndNote.html). Note: Please export full first names by either [choosing an output style that does so by default (e.g. in MLA style)](https://canterbury.libguides.com/endnote/basics-output) or by [customizing an output style.](http://bibliotek.usn.no/cite-and-write/endnote/how-to-use/how-to-show-the-author-s-full-name-in-the-reference-list-article185897-28181.html)
   * [Export `.bib` from Read Cube Papers](https://support.papersapp.com/support/solutions/articles/30000024634-how-can-i-export-references-from-readcube-papers-)

For those working in LaTeX, we can use an optional `.aux` file to automatically filter your `.bib` to check that it only contains entries which are cited in your manuscript.

| Input                 | Output                                                                                                                        |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------|
| `.bib` file(s)**(REQUIRED)**    | `cleanBib.csv`: table of author first names, titles, and .bib keys                                                            |
| `.aux` file (OPTIONAL)| `Authors.csv`: table of author first names, estimated gender classification, and confidence                                   |
| `.tex` file (OPTIONAL)| `yourTexFile_gendercolor.tex`: your `.tex` file modified to compile .pdf with in-line citations colored-coded by gender pairs |

## Import libraries, set paths, check settings

### Upload your `.bib` file(s) and optionally an `.aux` file generated from compiling your LaTeX manuscript and your `.tex` file

![upload button](img/upload.png)

![confirm upload button](img/confirmUpload.png)

Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
import numpy as np
import bibtexparser
from bibtexparser.bparser import BibTexParser
import glob
import subprocess
import os
from pybtex.database.input import bibtex
import csv
from pylatexenc.latex2text import LatexNodes2Text 
import unicodedata
import re
import pandas as pd
from habanero import Crossref
import string
from time import sleep


def checkcites_output(aux_file):
    '''take in aux file for tex document, return list of citation keys
    that are in .bib file but not in document'''

    result = subprocess.run(['texlua', 'checkcites.lua', aux_file[0]], stdout=subprocess.PIPE)
    result = result.stdout.decode('utf-8')
    unused_array_raw = result.split('\n')
    # process array of unused references + other output 
    unused_array_final = list()
    for x in unused_array_raw:
        if len(x) > 0: # if line is not empty
            if x[0] == '-':  # and if first character is a '-', it's a citation key
                unused_array_final.append(x[2:]) # truncate '- '            
    if "------------------------------------------------------------------------" in unused_array_final:
        return(result)
    else:
        return(unused_array_final)


def removeMiddleName(line):
    arr = line.split()
    last = arr.pop()
    n = len(arr)
    if n == 4:
        first, middle = ' '.join(arr[:2]), ' '.join(arr[2:])
    elif n == 3:
        first, middle = arr[0], ' '.join(arr[1:])
    elif n == 2:
        first, middle = arr
    elif n==1:
        return line
    return(str(first + ' ' + middle))


def returnFirstName(line):
    arr = line.split()
    n = len(arr)
    if n == 4:
        first, middle = ' '.join(arr[:2]), ' '.join(arr[2:])
    elif n == 3:
        first, middle = arr[0], ' '.join(arr[1:])
    elif n == 2:
        first, middle = arr
    elif n==1:
        return line
    return(str(middle))


def convertLatexSpecialChars(latex_text):
    return LatexNodes2Text().latex_to_text(latex_text)


def convertSpecialCharsToUTF8(text):
    data = LatexNodes2Text().latex_to_text(text)
    return unicodedata.normalize('NFD', data).encode('ascii', 'ignore').decode('utf-8')


def namesFromXref(doi, title, authorPos):
    '''Use DOI and article titles to query Crossref for author list'''
    if authorPos == 'first':
        idx = 0
    elif authorPos == 'last':
        idx = -1
    # get cross ref data
    authors = ['']
    # first try DOI
    if doi != "":
        works = cr.works(query = title, select = ["DOI","author"], limit=1, filter = {'doi': doi})
        if works['message']['total-results'] > 0:
            authors = works['message']['items'][0]['author']
    elif title != '': 
        works = cr.works(query = f'title:"{title}"', select = ["title","author"], limit=10)
        cnt = 0
        name = ''
        # check that you grabbed the proper paper
        if works['message']['items'][cnt]['title'][0].lower() == title.lower():
            authors = works['message']['items'][0]['author']

    # check the all fields are available
    if not 'given' in authors[idx]:
        name = ''
    else:
        # trim initials
        name = authors[idx]['given'].replace('.',' ').split()[0]

    return name


cr = Crossref()
homedir = '/home/jovyan/'
bib_files = glob.glob(homedir + '*.bib')
paper_aux_file = glob.glob(homedir + '*.aux')
paper_bib_file = 'library_paper.bib'
try:
    tex_file = glob.glob(homedir + "*.tex")[0]
except:
    print('No .tex file found.')

### Define the _first_ and _last_ author of your paper.

For example: 
```
yourFirstAuthor = 'Teich, Erin G.'
yourLastAuthor = 'Bassett, Danielle S.'
```

And optionally, define any co-first or co-last author(s), making sure to keep the square brackets to define a list.

For example:
```
optionalEqualContributors = ['Dworkin, Jordan', 'Stiso, Jennifer']
```

or 

```
optionalEqualContributors = ['Dworkin, Jordan']
```

Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
yourFirstAuthor = 'LastName, FirstName OptionalMiddleInitial'
yourLastAuthor = 'LastName, FirstName OptionalMiddleInitial'
optionalEqualContributors = ['LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial']

if (yourFirstAuthor == 'LastName, FirstName OptionalMiddleInitial') or (yourLastAuthor == 'LastName, FirstName OptionalMiddleInitial'):
    raise ValueError("Please enter your manuscript's first and last author names")

if paper_aux_file:
    if optionalEqualContributors == ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):
        citing_authors = np.array([yourFirstAuthor, yourLastAuthor])
    else:
        citing_authors = np.array([yourFirstAuthor, yourLastAuthor, optionalEqualContributors])
    print(checkcites_output(paper_aux_file))
    unused_in_paper = checkcites_output(paper_aux_file) # get citations in library not used in paper
    print("Unused citations: ", unused_in_paper.count('=>'))
    
    
    parser = BibTexParser()
    parser.ignore_nonstandard_types = False
    parser.common_strings = True
    
    bib_data = None
    for bib_file in bib_files:
        with open(bib_file) as bibtex_file:
            if bib_data is None:
                bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
            else:
                bib_data_extra = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
                bib_data.entries_dict.update(bib_data_extra.entries_dict)
                bib_data.entries.extend(bib_data_extra.entries)
    
    all_library_citations = list(bib_data.entries_dict.keys())
    print("All citations: ", len(all_library_citations))
    
    for k in all_library_citations:
        if re.search('\\b'+ k + '\\b', unused_in_paper.replace('\n',' ').replace('=>',' ')) != None:
            del bib_data.entries_dict[k] # remove from entries dictionary if not in paper
            
    in_paper_mask = [re.search('\\b'+ bib_data.entries[x]['ID'] + '\\b', unused_in_paper.replace('\n',' ').replace('=>',' ')) == None for x in range(len(bib_data.entries))]
    bib_data.entries = [bib_data.entries[x] for x in np.where(in_paper_mask)[0]] # replace entries list with entries only in paper
    del bib_data.comments
    
    duplicates = []
    for key in bib_data.entries_dict.keys():
        count = str(bib_data.entries).count("'ID\': \'"+ key + "\'")
        if count > 1:
            duplicates.append(key)
            
    if len(duplicates) > 0:
        raise ValueError("In your .bib file, please remove duplicate entries or duplicate entry ID keys for:", ' '.join(map(str, duplicates)))

    if os.path.exists(paper_bib_file):
        os.remove(paper_bib_file)
    
    with open(paper_bib_file, 'w') as bibtex_file:
        bibtexparser.dump(bib_data, bibtex_file)
    
    # define first author and last author names of citing paper -- will exclude citations of these authors
    # beware of latex symbols within author names
    # in_paper_citations = list(bib_data.entries_dict.keys())
    in_paper_citations = [bib_data.entries[x]['ID'] for x in range(len(bib_data.entries))] # get list of citation keys in paper
    
    # extract author list for every cited paper
    cited_authors = [bib_data.entries_dict[x]['author'] for x in in_paper_citations]
    # find citing authors in cited author list
    # using nested list comprehension, make a citing author -by- citation array of inclusion
    self_cite_mask = np.array([[str(citing_author) in authors for authors in cited_authors] for citing_author in citing_authors])
    self_cite_mask = np.any(self_cite_mask,axis=0) # collapse across citing authors such that any coauthorship by either citing author -> exclusion
    
    print("Self-citations: ", [bib_data.entries[x]['ID'] for x in np.where(self_cite_mask)[0]]) # print self citations
    for idx,k in enumerate(in_paper_citations):
        if self_cite_mask[idx]:
            del bib_data.entries_dict[k] # delete citation from dictionary if self citationi
    bib_data.entries = [bib_data.entries[x] for x in np.where(np.invert(self_cite_mask))[0]] # replace entries list with entries that aren't self citations
    
    paper_bib_file_excl_sc = os.path.splitext(paper_bib_file)[0] + '_noselfcite.bib'
    
    if os.path.exists(paper_bib_file_excl_sc):
        os.remove(paper_bib_file_excl_sc)
    
    with open(paper_bib_file_excl_sc, 'w') as bibtex_file:
        bibtexparser.dump(bib_data, bibtex_file)
        
if os.path.exists('*_noselfcite.bib'):
    ID = glob.glob(homedir + paper_bib_file_excl_sc)
else:
    ID = glob.glob(homedir + '*bib')
    with open(ID[0]) as bibtex_file:
        bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
    duplicates = []
    for key in bib_data.entries_dict.keys():
        count = str(bib_data.entries).count("'ID\': \'"+ key + "\'")
        if count > 1:
            duplicates.append(key)
            
    if len(duplicates) > 0:
        raise ValueError("In your .bib file, please remove duplicate entries or duplicate entry ID keys for:", ' '.join(map(str, duplicates)))

FA = []
LA = []
parser = bibtex.Parser()
bib_data = parser.parse_file(ID[0])
counter = 1
nameCount = 0
outPath = homedir + 'cleanedBib.csv'

if os.path.exists(outPath):
    os.remove(outPath)

with open(outPath, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(['Article', 'FA', 'LA', 'Title', 'SelfCite', 'CitationKey'])

for key in bib_data.entries.keys():
    diversity_bib_titles = ['The extent and drivers of gender imbalance in neuroscience reference lists','The gender citation gap in international relations','Quantitative evaluation of gender bias in astronomical publications from citation counts', '\# CommunicationSoWhite', '{Just Ideas? The Status and Future of Publication Ethics in Philosophy: A White Paper}','Gendered citation patterns across political science and social science methodology fields','Gender Diversity Statement and Code Notebook v1.0']
    if bib_data.entries[key].fields['title'] in diversity_bib_titles:
        continue
        
    try:
        author = bib_data.entries[key].persons['author']
    except:
        author = bib_data.entries[key].persons['editor']
    FA = author[0].rich_first_names
    LA = author[-1].rich_first_names
    FA = convertLatexSpecialChars(str(FA)[7:-3]).translate(str.maketrans('', '', string.punctuation)).replace('Protected',"").replace(" ",'')
    LA = convertLatexSpecialChars(str(LA)[7:-3]).translate(str.maketrans('', '', string.punctuation)).replace('Protected',"").replace(" ",'')

    # check that we got a name (not an initial) from the bib file, if not try using the title in the crossref API
    try:
        title = bib_data.entries[key].fields['title'].replace(',', '').replace(',', '').replace('{','').replace('}','')
    except:
        title = ''
    try:
        doi =  bib_data.entries[key].fields['doi']
    except:
        doi = ''
    if FA == '' or len(FA.split('.')[0]) <= 1:
        while True:
            try:
                FA = namesFromXref(doi, title, 'first')
            except UnboundLocalError:
                sleep(1)
                continue
            break
    if LA == '' or len(LA.split('.')[0]) <= 1:
        while True:
            try:
                LA = namesFromXref(doi, title, 'last')
            except UnboundLocalError:
                sleep(1)
                continue
            break

    if (yourFirstAuthor!='LastName, FirstName OptionalMiddleInitial') and (yourLastAuthor!='LastName, FirstName OptionalMiddleInitial'):
        selfCiteCheck1 = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck1a = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck1b = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), LA]).replace("'", "")]

        selfCiteCheck2 = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck2a = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck2b = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), FA]).replace("'", "")]

        nameCount = 0
        if optionalEqualContributors != ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):
            for name in optionalEqualContributors:
                selfCiteCheck3 = [s for s in author if removeMiddleName(name) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
                selfCiteCheck3a = [s for s in author if removeMiddleName(name) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
                if len(selfCiteCheck3)>0:
                    nameCount += 1
                if len(selfCiteCheck3a)>0:
                    nameCount += 1
        selfCiteChecks = [selfCiteCheck1, selfCiteCheck1a, selfCiteCheck1b, selfCiteCheck2, selfCiteCheck2a, selfCiteCheck2b]
        if sum([len(check) for check in selfCiteChecks]) + nameCount > 0:
            selfCite = 'Y'
            if len(FA) < 2:
                print(str(counter) + ": " + key + "\t\t  <-- self-citation <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***")
            else:
                print(str(counter) + ": " + key + "  <-- self-citation")
        else:
            selfCite= 'N'
            if len(FA) < 2:
                print(str(counter) + ": " + key + "\t\t  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***")
            else:
                print(str(counter) + ": " + key)
    else:
        selfCite = 'NA'
        
    with open(outPath, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        if selfCite=='N':
            writer.writerow([counter, convertSpecialCharsToUTF8(FA), convertSpecialCharsToUTF8(LA), title, selfCite, key])
    counter += 1

## Estimate gender of authors from cleaned bibliography

### Checkpoint for cleaned bibliography and using Gender API to estimate genders by first names
After registering for a [gender-api](https://gender-api.com/) (free account available), use your 500 free monthly search credits by __pasting your API key in the code for the line indicated below__:

```genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'```

[You can find your key in your account's profile page.](https://gender-api.com/en/account/overview#my-api-key)

__NOTE__: If any of your cleanBib.csv entries are incomplete or contain first initials, the code will not continue to the stage that will use your limited free credits. 

Please manually edit the cleanedBib.csv by downloading the file, modifying it, and re-uploading it. Alternatively, you can edit the file directly within the Binder environment by clicking the `Edit` button, making modifications, and saving the file.

![edit button](img/manualEdit.png)

Common issues include: 

* Bibliography entry did not include a last author because the author list was truncated by "and Others" or "et al." 
* Some older journals articles only provide first initial and not full first names, in which case you will need to go digging via Google to identify that person. 
* In rare cases where the author cannot be identified even after searching by hand, replace the first name with "UNKNOWNNAME" so that the classifier will estimate the gender as unknown. 

__NOTE__: your free account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.

Then, run the code blocks below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'

names=read.csv("/home/jovyan/cleanedBib.csv",stringsAsFactors=F)
setwd('/home/jovyan/')

require(rjson)
gendFA=NULL;gendLA=NULL
gendFA_conf=NULL;gendLA_conf=NULL

namesIncompleteFA=NULL
namesIncompleteLA=NULL
incompleteKeys=list()
incompleteRows=list()

for(i in 1:nrow(names)){
  if (nchar(names$FA[i])<2 || grepl("\\.", names$FA[i])){
    namesIncompleteFA[i] = i+1
    incompleteKeys = c(incompleteKeys, names$CitationKey[i])
    incompleteRows = c(incompleteRows, i+1)
  }
  namesIncompleteFA = namesIncompleteFA[!is.na(namesIncompleteFA)]
    
  if (nchar(names$LA[i])<2 || grepl("\\.", names$LA[i])){
    namesIncompleteLA[i] = i+1
    incompleteKeys = c(incompleteKeys, names$CitationKey[i])
    incompleteRows = c(incompleteRows, i+1)
  }
  namesIncompleteLA = namesIncompleteLA[!is.na(namesIncompleteLA)]
}

write.table(incompleteKeys[2:length(incompleteKeys)], "incompleteKeys.csv", sep=",",  col.names=FALSE)
write.table(incompleteRows[2:length(incompleteRows)], "incompleteRows.csv", sep=",",  col.names=FALSE)

if (length(namesIncompleteFA)>0 || length(namesIncompleteLA)>0){
    print(paste("STOP: Please revise incomplete full first names or empty cells in these rows: ", paste(unique(c(namesIncompleteFA, namesIncompleteLA)))))
    stop("Do not continue without revising the incomplete names on rows in the .bib file as indicated above.")
}

In [None]:
if os.path.exists('incompleteRows.csv'):
    nameCount = 0
    df = pd.read_table('cleanedBib.csv', sep=',')
    df = pd.DataFrame(df)
    selfCite = []
    selfKey = []
    counter=0

    with open('incompleteRows.csv', newline='') as f:
        reader = csv.reader(f)
        incompleteRows = [int(y) for x in list(reader) for y in x]
        incompleteRows = incompleteRows[1:]
        
    with open('incompleteKeys.csv', newline='') as f:
        reader = csv.reader(f)
        incompleteKeys = [y for x in list(reader) for y in x]
        incompleteKeys = incompleteKeys[1:]

    for incompleteRow in incompleteRows:
        FA = df.iloc[incompleteRow-2]['FA']
        LA = df.iloc[incompleteRow-2]['LA']

        if (yourFirstAuthor!='LastName, FirstName OptionalMiddleInitial') and (yourLastAuthor!='LastName, FirstName OptionalMiddleInitial'):
            if FA in returnFirstName(removeMiddleName(yourFirstAuthor)) or FA in returnFirstName(removeMiddleName(convertSpecialCharsToUTF8(yourFirstAuthor))) or LA in returnFirstName(removeMiddleName(yourLastAuthor)) or LA in returnFirstName(removeMiddleName(convertSpecialCharsToUTF8(yourLastAuthor))):
                nameCount += 1
                selfCite.append(incompleteRow)
                selfKey.append(incompleteKeys[counter])
        if optionalEqualContributors != ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):
            for name in optionalEqualContributors:
                if FA in returnFirstName(removeMiddleName(name)) or FA in returnFirstName(removeMiddleName(convertSpecialCharsToUTF8(name))) or LA in returnFirstName(removeMiddleName(name)) or LA in returnFirstName(removeMiddleName(convertSpecialCharsToUTF8(name))):
                    nameCount += 1
                    selfCite.append(incompleteRow)
                    selfKey.append(incompleteKeys[counter])
        counter += 1
    if nameCount > 0:
        print("WARNING: Before continuing, please check and remove the manually modified citations that are self-citations for row(s): " + str(selfCite) + " for the citation key(s): " + str(selfKey))
    else:
        print("Please proceed to the next code block.")
else:
    print("Please proceed to the next code block.")
                    

In [None]:
for(i in 1:nrow(names)){
  ### get probabilistic genders for the ith article from GenderAPI
  tfa=names$FA[i]
  tla=names$LA[i]
  
  json_file_fa=paste0("https://gender-api.com/get?name=",tfa,
                      genderAPI_key)
  json_data_fa=fromJSON(file=json_file_fa)
  
  ### Only query the server once if the first/last authors are the same
  if(tla!=tfa){
    json_file_la=paste0("https://gender-api.com/get?name=",tla,
                        genderAPI_key)
    json_data_la=fromJSON(file=json_file_la)
  }else{
    json_data_la=json_data_fa
    json_file_la=json_data_fa
  }
  
  ### Locate and save gender probabilities from json query
  if(json_data_fa$accuracy>=70){
    ### If probability is above 70%, assigned "W" or "M" to author
    gendFA[i]=ifelse(json_data_fa$gender=="female","W","M")
    gendFA_conf[i]=json_data_fa$accuracy
  }else{
    ### If not, assign "U" for unknown, and potentially fill these in manually
    gendFA[i]="U"
    gendFA_conf[i]=json_data_fa$accuracy
  }
  ### Do the same for last authors
  if(json_data_la$accuracy>=70){
    gendLA[i]=ifelse(json_data_la$gender=="female","W","M")
    gendLA_conf[i]=json_data_la$accuracy
  }else{
    gendLA[i]="U"
    gendLA_conf[i]=json_data_la$accuracy
  }
  
  ### Take a quick break before sending the server another request
  Sys.sleep(sample(1:2,1))
  print(i)
}

### Add new columns to data.frame to save for later use
names$FA_bin=gendFA; names$FA_conf=gendFA_conf
names$LA_bin=gendLA; names$LA_conf=gendLA_conf


### Pull names that the query server wasn't sure about
unknownFAs=names$FA[names$FA_bin=="U"]
unknownLAs=names$LA[names$LA_bin=="U"]
unknownFAs; unknownLAs

### At this stage, you can manually enter the gender of any
### if you can find pronouns or other signifiers online

# e.g. names$FA_bin[names$FA_bin=="Romy"]="W"


### Create column of gender categories (i.e., MM, WM, MW, WW)
names$GendCat=paste0(gendFA,gendLA)

## Describe the proportions of genders in your reference list and compare it to published base rates in neuroscience.

The output will provide a frequency count for male-male, male-female, female-male, and female-female. Your reference proportions will be displayed in the 1st row in comparison to expected proportions in the field of neuroscience in the 2nd row (these row values do not include unknown authors and so do not add to 1). We print the percent difference relative to expected proportions for neuroscience. Positive values mean overcitation, whereas negative values mean undercitation. 

OPTIONALLY: Modify Authors.csv, re-upload your manually modified Authors.csv, uncomment 

```names<-read.csv('Authors.csv')```
```names$GendCat=paste0(names$FA_bin,names$LA_bin)```

and rerun the box below. At this stage, you can manually enter the gender of any authors if you can find pronouns or other signifiers online. 

This box does NOT contain code that will use your limited API credits/queries.

Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

### Additional info about the neuroscience benchmark
For the top 5 neuroscience journals (Nature Neuroscience, Neuron, Brain, Journal of Neuroscience, and Neuroimage), the expected gender proportions in reference lists as reported by [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) are 58.4% for male/male, 9.4% for male-female, 25.5% for female-male, and 6.7% for female-female. Expected proportions were calculated by randomly sampling papers from 28,505 articles in the 5 journals, estimating gender breakdowns using probabilistic name classification tools, and regressing for relevant article variables like publication date, journal, number of authors, review article or not, and first-/last-author seniority. See [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) for more details. 

In [None]:
# load manually modified results (OPTIONAL)
#names<-read.csv('Authors.csv')
#names$GendCat=paste0(names$FA_bin,names$LA_bin)
##########################
# Tables and proportions #
##########################

#Get the overall counts and proportions for each category
table(names$GendCat)
round(table(names$GendCat)/sum(table(names$GendCat)),3)
tab1<- round(table(names$GendCat, exclude=c("MU", "UM", "UU", "WU", "UW"))*sum(table(names$GendCat))/
               sum(table(names$GendCat, exclude=c("MU", "UM", "UU", "WU", "UW"))),3)
tab1<- rbind(tab1, c(0.584*sum(table(names$GendCat)), 0.094*sum(table(names$GendCat)),
                     0.255*sum(table(names$GendCat)), 0.067*sum(table(names$GendCat))))

# Output table will show the observed (your) reference proportions in the first row
# The second row displays estimated expected proportions in neuroscience from:
# https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf

# Get proportions without unknowns
checkProportions <- round(table(names$GendCat, exclude=c("MU", "UM", "UU", "WU", "UW")))/sum(table(names$GendCat, exclude=c("MU", "UM", "UU", "WU", "UW")),3)

# Check gap between observed and expected
# Expected proportions in neuroscience were 58.4% for MM, 25.5% for WM, 9.4% for MW, and 6.7% for WW
checkProportions <- rbind(checkProportions, c(0.584, 0.094, 0.255, 0.067))
checkProportions
gap <- round((checkProportions[1,]-checkProportions[2,])*100/checkProportions[2,], 2)
gap

# Write
write.csv(names,"Authors.csv")

### (OPTIONAL) Color-code your .tex file using the estimated gender classifications

Running this code-block will optionally output your uploaded `.tex` file with color-coding for gender pair classifications. You can find the [example below's pre-print here.](https://www.biorxiv.org/content/10.1101/664250v1)

![Color-coded .tex file, Eli Cornblath](img/texColors.png)

In [None]:
cite_gender = pd.read_csv(homedir+'Authors.csv') # output of getReferenceGends.ipynb
cite_gender.index = cite_gender.CitationKey
cite_gender['Color'] = '' # what color to make each gender category
colors = {'MM':'red','MW':'blue','WW':'green','WM':'magenta','UU':'black',
'MU':'black','UM':'black','UW':'black','WU':'black'}
for idx in cite_gender.index: # loop through each citation key and set color
    cite_gender.loc[idx,'Color'] = colors[cite_gender.loc[idx,'GendCat']]
cite_gender.loc[cite_gender.index[cite_gender.SelfCite=='Y'],'Color'] = 'black' # make self citations black

fin = open(homedir+tex_file)
texdoc=fin.readlines()
with open(homedir+tex_file[:-4]+'_gendercolor.tex','w') as fout:
    for i in range(len(texdoc)):
        s = texdoc[i]
        cite_instances = re.findall('\\\\cite\{.*?\}',s)
        cite_keys = re.findall('\\\\cite\{(.*?)\}',s)
        cite_keys = [x.split(',') for x in cite_keys]
        cite_keys_sub = [['\\textcolor{' + cite_gender.loc[x.strip(),'Color'] + '}{\\cite{'+x.strip()+'}}' for x in cite_instance] for cite_instance in cite_keys]
        cite_keys_sub = ['\\textsuperscript{,}'.join(x) for x in cite_keys_sub]
        for idx,cite_instance in enumerate(cite_instances):
            s = s.replace(cite_instances[idx],cite_keys_sub[idx])
        fout.write(s)
        # place color key after abstract
        if '\\section*{Introduction}\n' in s:            
            l = ['\\textcolor{' + colors[k] + '}{'+k+'}' for k in colors.keys()]
            fout.write('\tKey: '+ ', '.join(l)+'.\n')