## Find seminar speakers on bioRxiv
This tool extracts first and last authors who have published bioRxiv preprints relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.

Enter the keywords you'd like to search for and whether you are looking for trainees (first authors) or not in cell 2 (labeled below). Run each cell (Shift+Enter). This notebook will print a list of authors in order by # of relevant preprints and print them to a CSV at the end.  
*Note: Be specific, as the API can only return 250 results maximum.*

If you'd like to get a list of predicted female or minority authors, follow the instructions for the last 6 code blocks. Each section will print a list of predicted relevant authors, again in order of # of relevant preprints, and append this as a column to the outputted CSV.

**Important caveats for using the gender & ethnicity APIs**
1. The Gender-API predicts gender as a binary based on first name. Gender is not a binary and names do not equal identity.
2. The NamSor API predicts a single ethnicity based on full name. The groups it uses are overly broad (only 4 groups: Asian, Hispanic/Latinx, Black/Non-Latinx, White/Non-Latinx), it does not account for multiple ethnicities, and names do not equal ethnicity.
3. Remember that these APIs can only provide guesses to help start a search for more diverse speakers. They cannot tell you the gender or ethnicity of any author.

If you use this tool, please cite the original paper which created the Rxivist API:  
*Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.*  
[Full API documentation](https://rxivist.org/docs)

In [20]:
#imports
import urllib.request
import json
import pandas as pd
import re

In the cell below, fill in the keywords you'd like to search for and whether you are searching for trainees or PIs.

For example:  
`keywords = 'sharp wave ripple'
trainee = True`  
returns a list of first authors on papers that include the terms sharp, wave, and ripple in the title or abstract.

`keywords = 'GABA'
trainee = False`  
returns a list of last authors on papers that include the term GABA in the title or abstract.

In [2]:
keywords = 'Put your keyword here'
trainee = False

In [5]:
#download all papers with the search term
#replace spaces for the URL
keywords = re.sub('\s','%20',keywords)
api_link = 'https://api.rxivist.org/v1/papers?q=' \
    + keywords + '&timeframe=alltime&metric=downloads&page_size=250'
with urllib.request.urlopen(api_link) as url:
    papers = json.loads(url.read().decode())

In [9]:
#build lists of first & last authors with preprint counts
authors_dict = dict()

for p in papers['results']:
    if trainee:
        if p['authors'][0]['id'] in authors_dict:
            authors_dict[p['authors'][0]['id']] += 1
        else:
            authors_dict[p['authors'][0]['id']] = 1
    else:
        if p['authors'][-1]['id'] in authors_dict:
            authors_dict[p['authors'][-1]['id']] += 1
        else:
            authors_dict[p['authors'][-1]['id']] = 1

In [17]:
#extract author information
#NB this section calls the API once for each author, so it takes a while
authors = list()
auth_list = authors_dict.keys()
    
for author_id in auth_list:
    api_link = "https://api.rxivist.org/v1/authors/"+str(author_id)
    with urllib.request.urlopen(api_link) as url:
        author_info = json.loads(url.read().decode())
        if not author_info['emails']:
            author_info['emails'].append('No email listed')
        try:
            temp = author_info['emails'][-1]
        except:
            print(api_link)
        authors.append([author_info['name'],author_info['institution'],
                        author_info['emails'][-1],author_info['orcid']
                        authors_dict[author_id],len(author_info['articles']),
                        author_info['ranks'][0]['downloads']])

In [56]:
#prints an author list sorted by # relevant preprints
auth_df = pd.DataFrame(authors, columns=['Name','Institution','Email','ORCID'
                                    'Keyword Preprints','Total Preprints','Downloads'])
auth_df = auth_df.sort_values('Keyword Preprints', ascending=False)
print(auth_df)

                     Name                                        Institution  \
107          Thad A. Polk   Department of Psychology, University of Michigan   
11        Etienne Sibille  Department of Pharmacology and Toxicology, Uni...   
40          Michel Loreau  Center for Biodiversity Theory and Modelling, ...   
5    Bernardo L. Sabatini  Harvard Medical School, Howard Hughes Medical ...   
6             Eric Gouaux                                               OHSU   
..                    ...                                                ...   
85      Suresh Jesuthasan                  Lee Kong Chian School of Medicine   
86      Henning Sprekeler  Modelling of Cognitive Processes, Berlin Insti...   
87              Assaf Tal                      Weizmann Institute of Science   
88         Veronica Egger  Neurophysiology, Institute of Zoology, Univers...   
230         Y. Albert Pan                                      Virginia Tech   

                                       

### APIs
Below are code blocks to use the gender & ethnicity APIs to make predictions about the author lists. Use with caution.

### Gender API
If you'd like a list of predicted female authors, run the following 3 blocks.

Register for a [Gender-API account](https://gender-api.com/) and add your API code in the code block below.

*Note: Free accounts are limited to 500 names per month.*

In [None]:
#input your key
genderAPI_key = 'YOUR ACCOUNT KEY HERE'

In [58]:
#get first names of all authors
first_names = list()
for name in auth_df['Name']:
    fn = re.findall('^\S+',name)
    fn = fn[0]
    #if first name is a letter, try the next name
    if len(fn)<3:
        fn = re.findall('\s\S+\s',name)
        if len(fn)>0:
            fn = re.sub('\s','',fn[0])
        else:
            fn = ''
    first_names.append(fn)

In [None]:
import requests

#build a data frame of only predicted female authors
headers={
   'X-RapidAPI-Host': 'gender-api.com',
   'X-RapidAPI-Key': genderAPI_key
 }

gender_list = list()
female_auth_df = pd.DataFrame(columns=['Name','Institution','Email','ORCID'
                                    'Keyword Preprints','Total Preprints','Downloads'])

for i, name in enumerate(first_names):
    #query API
    gender = requests.get('https://gender-api.com/get?name=' + name,
                          headers=headers)
    gender_list.append(gender.json()['gender'])
    #add to df if predicted female
    if gender.json()['gender']=='female':
        female_auth_df = female_auth_df.append(auth_df.iloc[i])
        
auth_df['Gender'] = gender_list
        
print('Authors predicted to be female identifying')
print(female_auth_df)

### Ethnicity API

If you'd like to a list of predicted Black & Latinx authors, run the following 3 blocks. This uses the [Namsor-Client](https://pypi.org/project/namsor-client/) package.

Register for a [NamSor API account](https://v2.namsor.com/NamSorAPIv2/index.html) and add your API key in the code block below.

*Note: Free accounts are limited to 500 names per month.*

In [None]:
#input your key
namsor_KEY = "YOUR ACCOUNT KEY HERE"

In [74]:
!pip install namsor-client
from namsorclient import NamsorClient

In [80]:
client = NamsorClient(namsor_KEY)

# find female-identifying authors (same as GenderAPI)
# female_auth_df = pd.DataFrame(columns=['Name','Institution','Email',
#                                     'Keyword Preprints','Total Preprints','Downloads'])
# for i, name in enumerate(auth_df['Name']):
#     gender = client.genderFull(name)
#     if gender.likely_gender=='female':
#         female_auth_df = female_auth_df.append(auth_df.iloc[i])
        
# print('Authors predicted to be female identifying')
# print(female_auth_df)

# find minority authors
ethnicity_list = list()
minority_auth_df = pd.DataFrame(columns=['Name','Institution','Email','ORCID'
                                    'Keyword Preprints','Total Preprints','Downloads'])
for i, name in enumerate(auth_df['Name']):
    
    #separate names into first and last
    fn = re.findall('^\S+',name)
    fn = fn[0]
    #if first name is a letter, try the next name
    if len(fn)<3:
        fn = re.findall('\s\S+\s',name)
        if len(fn)>0:
            fn = re.sub('\s','',fn[0])
        else:
            fn = ''
    
    ln = re.findall('\S+$',name)
    ln = ln[0]
        
    ethnicity = client.usRaceEthnicity(fn,ln)
    # options: ['HL', 'A', 'W_NL', 'B_NL']
    ethnicity_list.append(ethnicity.race_ethnicity)
    if ethnicity.race_ethnicity=='HL' or ethnicity.race_ethnicity=='B_NL':
        minority_auth_df = minority_auth_df.append(auth_df.iloc[i])

auth_df['Ethnicity'] = ethnicity_list

print('Authors predicted to be Black or Latinx')
print(minority_auth_df)

KeyError: '[0 1 2 3 4 7 8 9] not found in axis'

### Save to a CSV

In [None]:
csvfile = re.sub('%20','_',keywords) + '_bioRxiv_speaker_finder.csv'
auth_df.to_csv(csvfile)