# Clean Bibliography

To goal of this notebook is to clean your `.bib` file to ensure that it only contains the full first names of references that you have cited in your paper. The full first names will then be used to query the probabilistic gender classifier, [Gender API](https://gender-api.com). The full names will be used to query for probabilistic race using the [ethnicolr package](https://ethnicolr.readthedocs.io/).

The only required file you need is your manuscript's bibliography in `.bib` format. __Your `.bib` must only contain references cited in the manuscript__. Otherwise, the estimated proportions will be inaccurate.

If you intend to analyze the reference list of a published paper instead of your own manuscript in progress, search the paper on [Web of Knowledge](http://apps.webofknowledge.com/) (you will need institutional access). Next, [download the .bib file from Web of Science following these instructions, but start from Step 4 and on Step 6 select BibTeX instead of Plain Text](https://github.com/jdwor/gendercitation/blob/master/Step0_PullingWOSdata.pdf).

If you are not using LaTeX, collect and organize only the references you have cited in your manuscript using your reference manager of choice (e.g. Mendeley, Zotero, EndNote, ReadCube, etc.) and export that selected bibliography as a `.bib` file. __Please try to export your .bib in an output style that uses full first names (rather than only first initials) and using the full author lists (rather than abbreviated author lists with "et al.").__ If first initials are included, our code will automatically retrieve about 70% of those names using the article title or DOI. 

   * [Export `.bib` from Mendeley](https://blog.mendeley.com/2011/10/25/howto-use-mendeley-to-create-citations-using-latex-and-bibtex/)
   * [Export `.bib` from Zotero](https://libguides.mit.edu/ld.php?content_id=34248570)
   * [Export `.bib` from EndNote](https://www.reed.edu/cis/help/LaTeX/EndNote.html). Note: Please export full first names by either [choosing an output style that does so by default (e.g. in MLA style)](https://canterbury.libguides.com/endnote/basics-output) or by [customizing an output style.](http://bibliotek.usn.no/cite-and-write/endnote/how-to-use/how-to-show-the-author-s-full-name-in-the-reference-list-article185897-28181.html)
   * [Export `.bib` from Read Cube Papers](https://support.papersapp.com/support/solutions/articles/30000024634-how-can-i-export-references-from-readcube-papers-)

For those working in LaTeX, we can use an optional `.aux` file to automatically filter your `.bib` to check that it only contains entries which are cited in your manuscript.

| Input                 | Output                                                                                                                        |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------|
| `.bib` file(s)**(REQUIRED)**    | `cleanBib.csv`: table of author first names, titles, and .bib keys                                                            |
| `.aux` file (OPTIONAL)| `predictions.csv`: table of author first names, estimated gender classification, and confidence                                   |
| `.tex` file (OPTIONAL) | `race_gender_citations.pdf`: heat map of your citations broken down by probabilistic gender and race estimations
|                       | `yourTexFile_gendercolor.tex`: your `.tex` file modified to compile .pdf with in-line citations colored-coded by gender pairs |

## 1. Import functions

Upload your `.bib` file(s) and _optionally_ an `.aux` file generated from compiling your LaTeX manuscript and your `.tex` file

![upload button](img/upload.png)

![confirm upload button](img/confirmUpload.png)

Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
import glob
from habanero import Crossref
import sys
import os
from pathlib import Path
wd = Path(os.getcwd())
sys.path.insert(1, f'{wd.parent.parent.absolute()}/utils')
from preprocessing import *
from ethnicolr import pred_fl_reg_name
import tensorflow as tf
import seaborn as sns

cr = Crossref()
homedir = '/home/jovyan/'
bib_files = glob.glob(homedir + '*.bib')
paper_aux_file = glob.glob(homedir + '*.aux')
paper_bib_file = 'library_paper.bib'
try:
    tex_file = glob.glob(homedir + "*.tex")[0]
except:
    print('No optional .tex file found.')

### 2. Define the _first_ and _last_ author of your paper.

For example: 
```
yourFirstAuthor = 'Teich, Erin G.'
yourLastAuthor = 'Bassett, Danielle S.'
```

And optionally, define any co-first or co-last author(s), making sure to keep the square brackets to define a list.

For example:
```
optionalEqualContributors = ['Dworkin, Jordan', 'Stiso, Jennifer']
```

or 

```
optionalEqualContributors = ['Dworkin, Jordan']
```

If you are analyzing published papers' reference lists from Web of Science, change the variable checkingPublishedArticle to True:
```
checkingPublishedArticle = True
```

Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
yourFirstAuthor = 'LastName, FirstName OptionalMiddleInitial'
yourLastAuthor = 'LastName, FirstName OptionalMiddleInitial'
optionalEqualContributors = ['LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial']
checkingPublishedArticle = False

if paper_aux_file:
    find_unused_cites(paper_aux_file)

bib_data = get_bib_data(bib_files[0])
if checkingPublishedArticle:
    get_names_published(homedir, bib_data, cr)
else:
    # find and print duplicates
    bib_data = get_duplicates(bib_data, bib_files[0])
    # get names, remove CDS, find self cites
    get_names(homedir, bib_data, yourFirstAuthor, yourLastAuthor, optionalEqualContributors, cr)

In [None]:
bib_check(homedir)

## 3. Estimate gender and race of authors from cleaned bibliography

### Checkpoint for cleaned bibliography and using Gender API to estimate genders and race by names
After registering for a [gender-api](https://gender-api.com/) (free account available), use your 500 free monthly search credits by __pasting your API key in the code for the line indicated below__ (replace only YOUR ACCOUNT KEY HERE):

```genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'```

[You can find your key in your account's profile page.](https://gender-api.com/en/account/overview#my-api-key)

__NOTE__: Please edit your .bib file using information printed by the code and provided in cleanedBib.csv. Edit directly within the Binder environment by clicking the .bib file (as shown below), making modifications, and saving the file (as shown below).

![open button](img/openBib.png)

![save button](img/saveBib.png)

Common issues include:

* Bibliography entry did not include a last author because the author list was truncated by "and Others" or "et al."
* Some older journals articles only provide first initial and not full first names, in which case you will need to go digging via Google to identify that person.
* In rare cases where the author cannot be identified even after searching by hand, replace the first name with "UNKNOWNNAMES" so that the classifier will estimate the gender as unknown.

__NOTE__: your free account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.

Then, run the code blocks below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
genderAPI_key = '&key=YOUR ACCOUNT KEY HERE'

# TODO: Remove in the PR that gets rid of argparse. 
# The following saves the api key to a txt file just to be reloaded by the next cell
with open("genderAPIkey.txt", 'w') as f:
    f.write(genderAPI_key)

# Check your credit balance
url = "https://gender-api.com/get-stats?key=" + genderAPI_key
response = urlopen(url)
decoded = response.read().decode('utf-8')
decoded_json = json.loads(decoded)
print('Remaining credits: %s'%decoded_json["remaining_requests"])
print('This should use (at most) %d credits, '%len(np.unique(authors_full_list)) + \
      'saving you approx %d'%(len(authors_full_list)-len(np.unique(authors_full_list))) + \
      ' credits if results are stored.')

## 4. Describe the proportions of genders in your reference list and compare it to published base rates in neuroscience.

__NOTE__: your free GenderAPI account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.

Run the code blocks below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
from ethnicolr import pred_fl_reg_name
f = open("genderAPIkey.txt", "r")
genderAPI_key = f.readline().replace('\n', '')

import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

mm, wm, mw, ww, WW, aw, wa, aa, citation_matrix = get_pred_demos((yourFirstAuthor+' '+yourLastAuthor).replace(',',''), homedir, bib_data, gender_key)
statement, statementLatex = print_statements(mm, wm, mw, ww, WW, aw, wa, aa)

## 5. Print the Diversity Statement and visualize your results

The example template can be copied and pasted into your manuscript. We have included it in our methods or references section. If you are using LaTeX, [the bibliography file can be found here](https://github.com/dalejn/cleanBib/blob/master/diversityStatement/).

### Additional info about the neuroscience benchmark
For the top 5 neuroscience journals (Nature Neuroscience, Neuron, Brain, Journal of Neuroscience, and Neuroimage), the expected gender proportions in reference lists as reported by [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) are 58.4% for man/man, 9.4% for man/woman, 25.5% for woman/man, and 6.7% for woman/woman. Expected proportions were calculated by randomly sampling papers from 28,505 articles in the 5 journals, estimating gender breakdowns using probabilistic name classification tools, and regressing for relevant article variables like publication date, journal, number of authors, review article or not, and first-/last-author seniority. See [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) for more details. 

Using a similar random draw model regressing for relevant variables, the expected race proportions in reference lists as reported by Bertolero et al. were 51.8% for white/white, 12.8% for white/author-of-color, 23.5% for author-of-color/white, and 11.9% for author-of-color/author-of-color. 

This box does NOT contain code that will use your limited API credits/queries.

Run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)

In [None]:
print('Plain text template:')
print(statement)
print('\n')
print('LaTeX template:')
print(statementLatex)

cmap = sns.diverging_palette(220, 10, as_cmap=True)
names = ['white_m','api_m','hispanic_m','black_m','white_w','api_w','hispanic_w','black_w']
plt.close()
sns.set(style='white')
fig, axes = plt.subplots(ncols=2,nrows=1,figsize=(7.5,4))
axes = axes.flatten()
plt.sca(axes[0])
heat = sns.heatmap(np.around((citation_matrix/citation_matrix.sum())*100,2),annot=True,ax=axes[0],annot_kws={"size": 8},cmap=cmap,vmax=1,vmin=0)
axes[0].set_ylabel('first author',labelpad=0)  
heat.set_yticklabels(names,rotation=0)
axes[0].set_xlabel('last author',labelpad=1)  
heat.set_xticklabels(names,rotation=90) 
heat.set_title('percentage of citations')  

citation_matrix_sum = citation_matrix / np.sum(citation_matrix) 

expected = np.load('/%s/data/expected_matrix_florida.npy'%(homedir))
expected = expected/np.sum(expected)

percent_overunder = np.ceil( ((citation_matrix_sum - expected) / expected)*100)
plt.sca(axes[1])
heat = sns.heatmap(np.around(percent_overunder,2),annot=True,ax=axes[1],fmt='g',annot_kws={"size": 8},vmax=50,vmin=-50,cmap=cmap)
axes[1].set_ylabel('',labelpad=0)  
heat.set_yticklabels('')
axes[1].set_xlabel('last author',labelpad=1)  
heat.set_xticklabels(names,rotation=90) 
heat.set_title('percentage over/under-citations')
plt.tight_layout()

plt.savefig('/home/jovyan/race_gender_citations.pdf')


paper_df.to_csv('/home/jovyan/predictions.csv')

In [None]:
# Plot a histogram #
names <- read.csv('/home/jovyan/predictions.csv', header=T)
total_citations <- nrow(na.omit(names))/2
names$GendCat <- gsub("female", "W", names$GendCat, fixed=T)
names$GendCat <- gsub("male", "M", names$GendCat, fixed=T)
names$GendCat <- gsub("unknown", "U", names$GendCat, fixed=T)
gend_cats <- unique(names$GendCat)  # get a vector of all the gender categories in your paper

# Create an empty data frame that will be used to plot the histogram. This will have the gender category (e.g., WW, MM) in the first column and the percentage (e.g., number of WW citations divided by total number of citations * 100) in the second column #
dat_for_plot <- data.frame(gender_category = NA,
                           number = NA,
                           percentage = NA)


### Loop through each gender category from your paper, calculate the citation percentage of each gender category, and save the gender category and its citation percentage in dat_for_plot data frame ###
if (length(names$GendCat) != 1) {
  
  for (i in 1:length(gend_cats)){
    
    # Create an empty temporary data frame that will be binded to the dat_for_plot data frame
    temp_df <- data.frame(gender_category = NA,
                          number = NA,
                          percentage = NA)
    
    # Get the gender category, the number of citations with that category, and calculate the percentage of citations with that category
    gend_cat <- gend_cats[i]
    number_gend_cat <- length(names$GendCat[names$GendCat == gend_cat])
    perc_gend_cat <- (number_gend_cat / total_citations) * 100
    
    # Bind this information to the original data frame
    temp_df$gender_category <- gend_cat
    temp_df$number <- number_gend_cat
    temp_df$percentage <- perc_gend_cat
    dat_for_plot <- rbind(dat_for_plot, temp_df)
    
  }
  
}


# Create a data frame with only the WW, MW, WM, MM categories and their base rates - to plot percent citations relative to benchmarks
dat_for_baserate_plot <- subset(dat_for_plot, gender_category == 'WW' | gender_category == 'MW' | gender_category == 'WM' | gender_category == 'MM')
baserate <- c(6.7, 9.4, 25.5, 58.4)
dat_for_baserate_plot$baserate <- baserate[order(c(which(dat_for_baserate_plot$gender_category == 'WW'), which(dat_for_baserate_plot$gender_category == 'MW'), which(dat_for_baserate_plot$gender_category == 'WM'), which(dat_for_baserate_plot$gender_category == 'MM')))]
dat_for_baserate_plot$citation_rel_to_baserate <- dat_for_baserate_plot$percentage - dat_for_baserate_plot$baserate


# Plot the Histogram of Number of Papers per category against predicted gender category #

library(ggplot2)

dat_for_plot = dat_for_plot[-1:-2,]

dat_for_plot$gender_category <- factor(dat_for_plot$gender_category, levels = dat_for_plot$gender_category)
ggplot(dat_for_plot, aes(x = gender_category, y = number, fill = gender_category)) +
  geom_bar(stat = 'identity', width = 0.75, na.rm = TRUE, show.legend = TRUE) + 
  scale_x_discrete(limits = c('WW', 'MW', 'WM', 'MM', 'UW', 'UM', 'WU', 'MU', 'UU')) +
  geom_text(aes(label = number), vjust = -0.3, color = 'black', size = 2.5) +
  theme(legend.position = 'right') + theme_minimal() +
  xlab('Predicted gender category') + ylab('Number of papers') + ggtitle("") + theme_classic(base_size=15)


# Plot the Histogram of % citations relative to benchmarks against predicted gender category
ggplot(dat_for_baserate_plot, aes(x = gender_category, y = citation_rel_to_baserate, fill = gender_category)) +
  geom_bar(stat = 'identity', width = 0.75, na.rm = TRUE, show.legend = TRUE) +
  scale_x_discrete(limits = c('WW', 'MW', 'WM', 'MM')) +
  geom_text(aes(label = round(citation_rel_to_baserate, digits = 2)), vjust = -0.3, color = 'black', size = 2.5) +
  theme(legend.position = 'right') + theme_minimal() +
  xlab('Predicted gender category') + ylab('% of citations relative to benchmarks') + ggtitle("") + theme_classic(base_size=15)

### (OPTIONAL) Color-code your .tex file using the estimated gender classifications

Running this code-block will optionally output your uploaded `.tex` file with color-coding for gender pair classifications. You can find the [example below's pre-print here.](https://www.biorxiv.org/content/10.1101/664250v1)

![Color-coded .tex file, Eli Cornblath](img/texColors.png)

In [None]:
cite_gender = pd.read_csv(homedir+'Authors.csv') # output of getReferenceGends.ipynb
cite_gender.index = cite_gender.CitationKey
cite_gender['Color'] = '' # what color to make each gender category
colors = {'MM':'red','MW':'blue','WW':'green','WM':'magenta','UU':'black',
'MU':'black','UM':'black','UW':'black','WU':'black'}
for idx in cite_gender.index: # loop through each citation key and set color
    cite_gender.loc[idx,'Color'] = colors[cite_gender.loc[idx,'GendCat']]
cite_gender.loc[cite_gender.index[cite_gender.SelfCite=='Y'],'Color'] = 'black' # make self citations black

fin = open(homedir+tex_file)
texdoc=fin.readlines()
with open(homedir+tex_file[:-4]+'_gendercolor.tex','w') as fout:
    for i in range(len(texdoc)):
        s = texdoc[i]
        cite_instances = re.findall('\\\\cite\{.*?\}',s)
        cite_keys = re.findall('\\\\cite\{(.*?)\}',s)
        cite_keys = [x.split(',') for x in cite_keys]
        cite_keys_sub = [['\\textcolor{' + cite_gender.loc[x.strip(),'Color'] + '}{\\cite{'+x.strip()+'}}' for x in cite_instance] for cite_instance in cite_keys]
        cite_keys_sub = ['\\textsuperscript{,}'.join(x) for x in cite_keys_sub]
        for idx,cite_instance in enumerate(cite_instances):
            s = s.replace(cite_instances[idx],cite_keys_sub[idx])
        fout.write(s)
        # place color key after abstract
        if '\\section*{Introduction}\n' in s:            
            l = ['\\textcolor{' + colors[k] + '}{'+k+'}' for k in colors.keys()]
            fout.write('\tKey: '+ ', '.join(l)+'.\n')