# Clean Bibliography

To goal of this notebook is to clean your `.bib` file to ensure that it only contains references that you have cited in your paper. This cleaned `.bib` will then be used to generate a data table of names that will be used to query the probabilistic gender classifier, [Gender API](https://gender-api.com). 

The only required file you need is your manuscript's bibliography in `.bib` format. __Your `.bib` must only contain references cited in the manuscript__. Otherwise, the estimated gender proportions will be inaccurate. 

If you are not using LaTeX, collect and organize only the references you have cited in your manuscript using your reference manager of choice (e.g. Mendeley, Zotero, EndNote, ReadCube, etc.) and export that selected bibliography as a `.bib` file. For those working in LaTeX, we can use an optional `.aux` file to automatically filter your `.bib` to check that it only contains entries which are cited in your manuscript.

| Input               | Output                                                                                                                      |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------|
| **.bib file(s)**    | cleanBib.csv: table of author first names, titles, and .bib keys                                                            |
| .aux file (OPTIONAL)| Authors.csv: table of author first names, estimated gender classification, and confidence                                   |
| .tex file (OPTIONAL)| yourTexFile_gendercolor.tex: your .tex file modified to compile with references colored by a legend indicating gender pairs (OPTIONAL) |

## Import libraries, set paths, check settings

### Upload your `.bib` file(s) and optionally an `.aux` file generated from compiling your LaTeX manuscript and your `.tex` file

![upload button](img/upload.png)

![confirm upload button](img/confirmUpload.png)

In [123]:
import numpy as np
import bibtexparser
from bibtexparser.bparser import BibTexParser
import glob
import subprocess
import os
from pybtex.database.input import bibtex
import csv
from pylatexenc.latex2text import LatexNodes2Text 
import unicodedata
import re
import pandas as pd
from habanero import Crossref


def checkcites_output(aux_file):
    # take in aux file for tex document, return list of citation keys
    # that are in .bib file but not in document

    result = subprocess.run(['texlua', 'checkcites.lua', aux_file[0]], stdout=subprocess.PIPE)
    result = result.stdout.decode('utf-8')
    unused_array_raw = result.split('\n')
    # process array of unused references + other output 
    unused_array_final = list()
    for x in unused_array_raw:
        if len(x) > 0: # if line is not empty
            if x[0] == '-':  # and if first character is a '-', it's a citation key
                unused_array_final.append(x[2:]) # truncate '- '            
    if "------------------------------------------------------------------------" in unused_array_final:
        return(result)
    else:
        return(unused_array_final)


def removeMiddleName(line):
    arr = line.split()
    last = arr.pop()
    n = len(arr)
    if n == 4:
        first, middle = ' '.join(arr[:2]), ' '.join(arr[2:])
    elif n == 3:
        first, middle = arr[0], ' '.join(arr[1:])
    elif n == 2:
        first, middle = arr
    elif n==1:
        return line
    return(str(first + ' ' + middle))


def convertLatexSpecialChars(latex_text):
    return LatexNodes2Text().latex_to_text(latex_text)


def convertSpecialCharsToUTF8(text):
    data = LatexNodes2Text().latex_to_text(text)
    return unicodedata.normalize('NFD', data).encode('ascii', 'ignore').decode('utf-8')

def namesFromXref(doi, title, authorPos):
    if authorPos == 'first':
        idx = 0
    elif authorPos == 'last':
        idx = -1
    # get cross ref data
    authors = ['']
    # first try DOI
    if doi != "":
        works = cr.works(query = title, select = ["DOI","author"], cursor_max=1, filter = {'doi': doi})
        if works['message']['total-results'] > 0:
            authors = works['message']['items'][0]['author']
    elif title != '': 
        works = cr.works(query = f'title:"{title}"', select = ["title","author"], cursor_max=10)
        cnt = 0
        name = ''
        # check that you grabbed the proper paper
        while works['message']['items'][cnt]['title'][0].lower() == title.lower():
            cnt = cnt + 1
            authors = works['message']['items'][0]['author']

    # check the all fields are available
    if not 'given' in authors[idx]:
        name = ''
    else:
        # trim initials
        name = authors[idx]['given'].replace('.',' ').split()[0]

    return name

cr = Crossref()
homedir = 'C:\\Users\\jenis\\Documents\\cleanBib\\'
bib_files = glob.glob(homedir + '*.bib')
paper_aux_file = glob.glob(homedir + '*.aux')
paper_bib_file = 'ecog_methods.bib'
try:
    tex_file = glob.glob(homedir + "*.tex")[0]
except:
    print('No .tex file found.')
        

No .tex file found.


### Define the _first_ and _last_ author of your paper.

For example: 
```
yourFirstAuthor = 'Teich, Erin G.'
yourLastAuthor = 'Bassett, Danielle S.'
```

And optionally, define any co-first or co-last author(s), making sure to keep the square brackets to define a list.

For example:
```
optionalEqualContributors = ['Dworkin, Jordan', 'Stiso, Jennifer']
```

or 

```
optionalEqualContributors = ['Dworkin, Jordan']
```

In [132]:
yourFirstAuthor = 'Stiso, Jennifer'
yourLastAuthor = 'Bassett, Danielle S'
optionalEqualContributors = ['Stiso, Jennifer']

if (yourFirstAuthor == 'LastName, FirstName OptionalMiddleInitial') or (yourLastAuthor == 'LastName, FirstName OptionalMiddleInitial'):
    raise ValueError("Please enter your manuscript's first and last author names")

if paper_aux_file:
    if optionalEqualContributors == ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):
        citing_authors = np.array([yourFirstAuthor, yourLastAuthor])
    else:
        citing_authors = np.array([yourFirstAuthor, yourLastAuthor, optionalEqualContributors])
    print(checkcites_output(paper_aux_file))
    unused_in_paper = checkcites_output(paper_aux_file) # get citations in library not used in paper
    print("Unused citations: ", unused_in_paper.count('=>'))
    
    parser = BibTexParser()
    parser.ignore_nonstandard_types = False
    parser.common_strings = True
    
    bib_data = None
    for bib_file in bib_files:
        with open(bib_file) as bibtex_file:
            if bib_data is None:
                bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
                # bib_data = bibtexparser.load(bibtex_file, parser)
            else:
                bib_data_extra = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
                # bib_data_extra = bibtexparser.load(bibtex_file, parser)
                bib_data.entries_dict.update(bib_data_extra.entries_dict)
                bib_data.entries.extend(bib_data_extra.entries)
    
    all_library_citations = list(bib_data.entries_dict.keys())
    print("All citations: ", len(all_library_citations))
    
    for k in all_library_citations:
        if k in unused_in_paper:
            del bib_data.entries_dict[k] # remove from entries dictionary if not in paper
    
    #in_paper_mask = [x not in unused_in_paper for x in all_library_citations] # get mask of citations in paper
    in_paper_mask = [bib_data.entries[x]['ID'] not in unused_in_paper for x in range(len(bib_data.entries))]
    bib_data.entries = [bib_data.entries[x] for x in np.where(in_paper_mask)[0]] # replace entries list with entries only in paper
    del bib_data.comments
    
    duplicates = []
    for key in bib_data.entries_dict.keys():
        count = str(bib_data.entries).count(key)
        if count > 1:
            duplicates.append(key)
            
    if len(duplicates) > 0:
        raise ValueError("In your .bib file, please remove duplicate entries or duplicate entry ID keys for:", ' '.join(map(str, duplicates)))

    if os.path.exists(paper_bib_file):
        os.remove(paper_bib_file)
    
    with open(paper_bib_file, 'w') as bibtex_file:
        bibtexparser.dump(bib_data, bibtex_file)
    
    # define first author and last author names of citing paper -- will exclude citations of these authors
    # beware of latex symbols within author names
    # in_paper_citations = list(bib_data.entries_dict.keys())
    in_paper_citations = [bib_data.entries[x]['ID'] for x in range(len(bib_data.entries))] # get list of citation keys in paper
    
    # extract author list for every cited paper
    cited_authors = [bib_data.entries_dict[x]['author'] for x in in_paper_citations]
    # find citing authors in cited author list
    # using nested list comprehension, make a citing author -by- citation array of inclusion
    self_cite_mask = np.array([[citing_author in authors for authors in cited_authors] for citing_author in citing_authors])
    self_cite_mask = np.any(self_cite_mask,axis=0) # collapse across citing authors such that any coauthorship by either citing author -> exclusion
    
    print("Self-citations: ", [bib_data.entries[x]['ID'] for x in np.where(self_cite_mask)[0]]) # print self citations
    for idx,k in enumerate(in_paper_citations):
        if self_cite_mask[idx]:
            del bib_data.entries_dict[k] # delete citation from dictionary if self citationi
    bib_data.entries = [bib_data.entries[x] for x in np.where(np.invert(self_cite_mask))[0]] # replace entries list with entries that aren't self citations
    
    paper_bib_file_excl_sc = os.path.splitext(paper_bib_file)[0] + '_noselfcite.bib'
    
    if os.path.exists(paper_bib_file_excl_sc):
        os.remove(paper_bib_file_excl_sc)
    
    with open(paper_bib_file_excl_sc, 'w') as bibtex_file:
        bibtexparser.dump(bib_data, bibtex_file)
        
if os.path.exists('*_noselfcite.bib'):
    ID = glob.glob(homedir + paper_bib_file_excl_sc)
else:
    ID = glob.glob(homedir + '*bib')
    with open(ID[0]) as bibtex_file:
        bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)
    duplicates = []
    for key in bib_data.entries_dict.keys():
        count = str(bib_data.entries).count("'ID\': \'"+ key + "\'")
        if count > 1:
            duplicates.append(key)
            
    if len(duplicates) > 0:
        raise ValueError("In your .bib file, please remove duplicate entries or duplicate entry ID keys for:", ' '.join(map(str, duplicates)))

FA = []
LA = []
parser = bibtex.Parser()
bib_data = parser.parse_file(ID[0])
counter = 1
nameCount = 0
outPath = homedir + 'cleanedBib.csv'

if os.path.exists(outPath):
    os.remove(outPath)

with open(outPath, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(['Article', 'FA', 'LA', 'Title', 'SelfCite', 'CitationKey'])

for key in bib_data.entries.keys():
    try:
        author = bib_data.entries[key].persons['author']
    except:
        author = bib_data.entries[key].persons['editor']
    FA = author[0].rich_first_names
    LA = author[-1].rich_first_names
    FA = convertLatexSpecialChars(str(FA)[7:-3]).replace("', Protected('","").replace("'), '", "")
    LA = convertLatexSpecialChars(str(LA)[7:-3]).replace("', Protected('","").replace("'), '", "")


    # check that we got a name (not an initial) from the bib file, if not try using the title in the crossref API
    try:
        title = bib_data.entries[key].fields['title'].replace(',', '').replace(',', '').replace('{','').replace('}','')
    except:
        title = ''
    try:
        doi =  bib_data.entries[key].fields['doi']
    except:
        doi = ''
    if FA == '' or len(FA.split('.')[0]) <= 1:
        FA = namesFromXref(doi, title, 'first')
    if LA == '' or len(LA.split('.')[0]) <= 1:
        LA = namesFromXref(doi, title, 'last')


    if (yourFirstAuthor!='LastName, FirstName OptionalMiddleInitial') and (yourLastAuthor!='LastName, FirstName OptionalMiddleInitial'):
        selfCiteCheck1 = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck1a = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        
        selfCiteCheck2 = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        selfCiteCheck2a = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
        nameCount = 0
        if optionalEqualContributors != ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):
            for name in optionalEqualContributors:
                selfCiteCheck3 = [s for s in author if removeMiddleName(name) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
                selfCiteCheck3a = [s for s in author if removeMiddleName(name) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace("', Protected('","").replace("'), '", ""), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace("', Protected('","").replace("'), '", "")]).replace("'", "")]
                if len(selfCiteCheck3)>0:
                    nameCount += 1
                if len(selfCiteCheck3a)>0:
                    nameCount += 1
        selfCiteChecks = [selfCiteCheck1, selfCiteCheck1a, selfCiteCheck2, selfCiteCheck2a]
        if sum([len(check) for check in selfCiteChecks]) + nameCount > 0:
            selfCite = 'Y'
            if len(FA) < 2:
                print(str(counter) + ": " + key + "\t\t  <-- self-citation <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***")
            else:
                print(str(counter) + ": " + key + "  <-- self-citation")
        else:
            selfCite= 'N'
            if len(FA) < 2:
                print(str(counter) + ": " + key + "\t\t  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***")
            else:
                print(str(counter) + ": " + key)
    else:
        selfCite = 'NA'

    with open(outPath, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        if selfCite=='N':
            writer.writerow([counter, FA, LA, title, selfCite, key])
    counter += 1


1: Tian
2: Dabney2019
3: Kucyi
4: Zhou  <-- self-citation
5: Lachaux2012
6: Vidaurre2018
7: DeCheveigne2019
8: Crone2011
9: Goyal2018
10: Kramer2011
11: Chu2012
12: Cole2018
13: Voytek2012
14: Cole2017a
15: Das2017
16: Lachaux2002
17: Lebedev2016
18: Catanese2016
19: Tooley2018  <-- self-citation
20: Haller2018
21: VanderMeij2018
22: Pesaran2018
23: Akam2014
24: Roach2018a
25: Fox2018
26: Horn2017
27: Staba2002
28: Buzsaki2004
29: Onojima2018
30: Rossini2017
31: Schnitzler2005
32: Buzsaki2012
33: Holdgraf2017
34: Causality2017
35: Peterson2017
36: Peterson
37: DuprelaTour2017
38: Matsui2017
39: Bahramisharif2017
40: Miller		  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***
41: Atasoy
42: Schalk2017
43: Schwalb
44: Wiener2017
45: Centre2017
46: Grossman2017
47: Cole2017
48: Bruns2004
49: ChandranKS2016		  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***
50: Miller2009
51: Gerber2016
52: Grashow2009
53: Mercier2016
54: Klausberger2008		  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***
55: Co

## Estimate gender of authors from cleaned bibliography

### Checkpoint for cleaned bibliography and using Gender API to estimate genders by first names
After registering for a [gender-api](https://gender-api.com/) (free account available), use your 500 free monthly search credits by pasting your API key in the code for the line indicated below:

```
genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'
```

[You can find your key in your account's profile page.](https://gender-api.com/en/account/overview#my-api-key)

If any of your cleanBib.csv entries are incomplete or contain first initials, the code will not continue to the stage that will use your limited free credits. Please manually edit the cleanedBib.csv by downloading the file, modifying it, and re-uploading it. Common issues include: bibliography entry did not include a last author because the author list was truncated by "and Others" or "et al." Some older journals articles only provide first initial and not full first names, in which case you will need to go digging via Google to identify that person. In rare cases where the author cannot be identified even after searching by hand, replace the first name with "UNKNOWNNAME" so that the classifier will estimate the gender as unknown. 

__NOTE__: your free account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.

### Describe the proportions of genders in your reference list and compare it to published base rates in neuroscience.

The output will provide a frequency count for male-male, male-female, female-male, and female-female. Your reference proportions will be displayed next to expected proportions in the field of neuroscience. We print the proportion difference relative to expected proportions for neuroscience.

OPTIONALLY: Modify Authors.csv, re-upload your manually modified Authors.csv, uncomment #names<-read.csv('Authors.csv'), and rerun the second box/section. This box does NOT contain code that will use your limited API credits/queries.

### (OPTIONAL) Color-code your .tex file using the estimated gender classifications

Running this code-block will optionally output your uploaded `.tex` file with color-coding for gender pair classifications.

In [None]:
cite_gender = pd.read_csv(homedir+'Authors.csv') # output of getReferenceGends.ipynb
cite_gender.index = cite_gender.CitationKey
cite_gender['Color'] = '' # what color to make each gender category
colors = {'MM':'red','MW':'blue','WW':'green','WM':'magenta','UU':'black',
'MU':'black','UM':'black','UW':'black','WU':'black'}
for idx in cite_gender.index: # loop through each citation key and set color
    cite_gender.loc[idx,'Color'] = colors[cite_gender.loc[idx,'GendCat']]
cite_gender.loc[cite_gender.index[cite_gender.SelfCite=='Y'],'Color'] = 'black' # make self citations black

fin = open(homedir+tex_file)
texdoc=fin.readlines()
with open(homedir+tex_file[:-4]+'_gendercolor.tex','w') as fout:
    for i in range(len(texdoc)):
        s = texdoc[i]
        cite_instances = re.findall('\\\\cite\{.*?\}',s)
        cite_keys = re.findall('\\\\cite\{(.*?)\}',s)
        cite_keys = [x.split(',') for x in cite_keys]
        cite_keys_sub = [['\\textcolor{' + cite_gender.loc[x.strip(),'Color'] + '}{\\cite{'+x.strip()+'}}' for x in cite_instance] for cite_instance in cite_keys]
        cite_keys_sub = ['\\textsuperscript{,}'.join(x) for x in cite_keys_sub]
        for idx,cite_instance in enumerate(cite_instances):
            s = s.replace(cite_instances[idx],cite_keys_sub[idx])
        fout.write(s)
        # place color key after abstract
        if '\\section*{Introduction}\n' in s:            
            l = ['\\textcolor{' + colors[k] + '}{'+k+'}' for k in colors.keys()]
            fout.write('\tKey: '+ ', '.join(l)+'.\n')