## Find seminar speakers on bioRxiv
This tool extracts first and last authors who have published bioRxiv preprints or Pubmed manuscripts relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.

This notebook consists of 6 parts, all of which are optional. Fill in the variables as specified & run each cell (Shift+Enter or Play button). Each section prints a list of authors sorted in order of # of relevant manuscripts. More detailed instructions are provided in a markdown cell before each part.

**Parts**
1. Search bioRxiv. User inputs: keyword, trainee selection
2. Search Pubmed. User inputs: same as above, plus email
3. Fill in the blanks using ORCID. User inputs: API client ID & key
4. Gender API. User inputs: API key
5. Ethnicity API. User inputs: API key
6. Print to CSV

**Important caveats for using the gender & ethnicity APIs**
1. The Gender-API predicts gender as a binary based on first name. Gender is not a binary and names do not equal identity.
2. The NamSor API predicts a single ethnicity based on full name. The groups it uses are overly broad (only 4 groups: Asian, Hispanic/Latinx, Black/Non-Latinx, White/Non-Latinx), it does not account for multiple ethnicities, and names do not equal ethnicity.
3. Remember that these APIs can only provide guesses to help start a search for more diverse speakers. They cannot tell you the gender or ethnicity of any author.

If you use this tool, please cite the original paper which created the Rxivist API:  
*Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.*  
[Full Rxivist API documentation](https://rxivist.org/docs)

In [None]:
#imports
!pip install biopython
from Bio import Entrez
import urllib.request
import requests
import json
import pandas as pd
import re

In the cell below, fill in the keywords you'd like to search for and whether you are searching for trainees or PIs. Be specific, as the bioRxiv API can only return 250 results maximum.

For example:  
`keywords = 'sharp wave ripple'
trainee = True`  
returns a list of first authors on papers that include the terms sharp, wave, and ripple in the title or abstract.

`keywords = 'GABA'
trainee = False`  
returns a list of last authors on papers that include the term GABA in the title or abstract.

In [None]:
keywords = 'Put your keyword here'
trainee = False

### Part 1: BioRxiv

In [None]:
#download all papers with the search term
#replace spaces for the URL
keywords_url = re.sub('\s','%20',keywords)
api_link = 'https://api.rxivist.org/v1/papers?q=' \
    + keywords_url + '&timeframe=alltime&metric=downloads&page_size=250'
with urllib.request.urlopen(api_link) as url:
    papers = json.loads(url.read().decode())

In [None]:
#build lists of first & last authors with preprint counts
authors_dict = dict()

for p in papers['results']:
    if trainee:
        if p['authors'][0]['id'] in authors_dict:
            authors_dict[p['authors'][0]['id']] += 1
        else:
            authors_dict[p['authors'][0]['id']] = 1
    else:
        if p['authors'][-1]['id'] in authors_dict:
            authors_dict[p['authors'][-1]['id']] += 1
        else:
            authors_dict[p['authors'][-1]['id']] = 1

In [None]:
#extract author information
#NB this section calls the API once for each author, so it takes a while
authors = list()
auth_list = authors_dict.keys()
    
for author_id in auth_list:
    api_link = "https://api.rxivist.org/v1/authors/"+str(author_id)
    with urllib.request.urlopen(api_link) as url:
        author_info = json.loads(url.read().decode())
        if not author_info['emails']:
            author_info['emails'].append('No email listed')
        try:
            temp = author_info['emails'][-1]
        except:
            print(api_link)
        authors.append([author_info['name'],author_info['institution'],
                        author_info['emails'][-1],author_info['orcid'],
                        authors_dict[author_id],'bioRxiv',len(author_info['articles']),
                        author_info['ranks'][0]['downloads']])

In [None]:
#build data frame
auth_df = pd.DataFrame(authors, columns=['Last Name','First Name','Institution',
                                         'Email','ORCID','Keyword Manuscripts',
                                         'Source','Total Preprints','Preprint Downloads'])

#print an author list sorted by # relevant preprints
auth_df = auth_df.sort_values('Keyword Manuscripts', ascending=False)
print(auth_df)

### Part 2: Pubmed
Expand your search to include published articles from the last 5 years. Similar to the Rxivist API, this does not need a key, but you will need to supply your email so Pubmed can contact you if you are spamming their API too much.

Code adapted from [this blog post](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/) using the [Entrez API from BioPython](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html).

In [None]:
email = 'Put your email here'

In [None]:
#build an API request for up to 500 papers
Entrez.email = email
query = '('+keywords+'[Title/Abstract]) AND (("2016"[Date - Publication] : "3000"[Date - Publication]))'
handle = Entrez.esearch(db='pubmed', sort='relevance', 
                        retmax='500', retmode='xml', term=query)
results = Entrez.read(handle)
id_list = ','.join(results['IdList'])

#fetch information relevant to each of the queried papers
handle = Entrez.efetch(db='pubmed', retmode='xml', id=id_list)
papers = Entrez.read(handle)

In [None]:
#extract author information
authors = list()

for i, paper in enumerate(papers['PubmedArticle']):
    author_info = paper['MedlineCitation']['Article']['AuthorList']
    
    #pick the first or last author based on user entry
    if trainee:
        author_info = author_info[0]
    else:
        author_info = author_info[-1]
        
    #extract institution, ORCID, & email if available
    if not author_info['AffiliationInfo']:
        affiliation = ''
        email = ''
    else:
        affiliation = author_info['AffiliationInfo'][0]['Affiliation']
        email = re.findall('\S+@\S+',affiliation)
        if not email:
            email = ''
        else:
            email = email[0][:-1]
    if not author_info['Identifier']:
        auth_orcid = ''
    else:
        if not re.findall('orcid.org',author_info['Identifier'][0]):
            auth_orcid = 'https://orcid.org/'+author_info['Identifier'][0]
        else:
            auth_orcid = author_info['Identifier'][0]
            
    #add to list
    authors.append([author_info['LastName'],author_info['ForeName'],affiliation,
                    email, auth_orcid, 1, 'Pubmed', NaN, NaN])

In [None]:
#build data frame
auth_df2 = pd.DataFrame(authors, columns=['Last Name','First Name','Institution',
                                         'Email','ORCID','Keyword Manuscripts',
                                         'Source','Total Preprints','Preprint Downloads'])

#if authors have >1 keyword paper, add that to their counts
#find duplicates
dups = auth_df2.pivot_table(index=['First Name', 'Last Name'], aggfunc='size')
#deduplicate
auth_df2 = auth_df2[~auth_df2.duplicated(subset=['Last Name','First Name'])]
for name in dups[dups>1].axes[0]:
    mult_paper_auth = (auth_df2['First Name']==name[0]) & (auth_df2['Last Name']==name[1])
    auth_df2.loc[mult_paper_auth,'Keyword Manuscripts'] = dups[name]

#print an author list sorted by # relevant preprints
auth_df2 = auth_df2.sort_values('Keyword Manuscripts', ascending=False)
print(auth_df2)

In [None]:
#combine bioRxiv & Pubmed lists
auf_df = auth_df.append(auth_df2)
auth_df.reset_index(drop=True, inplace=True)

### Part 3: Fill in the blanks using ORCID
Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)

You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation.

In [None]:
#input your client ID & key
ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'
ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'

In [None]:
# build a request for a token
payload = {'client_id': ORCIDAPI_ID,
                   'client_secret': ORCIDAPI_key,
                   'scope': '/read-public',
                   'grant_type': 'client_credentials'
                   }
url = 'https://orcid.org/oauth/token'
headers = {'Accept': 'application/json'}
response = requests.post(url, data=payload, headers=headers, timeout=None)
response.raise_for_status()
token = response.json()['access_token']

# set up headers for searches
headers = {'Accept': 'application/vnd.orcid+json',
           'Authorization type': 'Bearer',
           'Access token': token}

In [None]:
# get list of missing ORCIDs
#NB this section calls the API once for each author, so it takes a while
for index, row in auth_df.iterrows():
    if len(row['ORCID'])==0:
        given = re.sub(' ','%20',row['First Name'])
        family = re.sub(' ','%20',row['Last Name'])
#         affil = re.sub(' ','%20',row['Institution'])
        
        #build search
        url = "https://pub.orcid.org/v3.0/search/?q=" \
            + "family-name:" + family + "+AND+given-names:" + given \
            + "&rows=1"
#             + "+AND+affiliation-org-name:" + affil + "&rows=1"
        auth_id = requests.get(url, headers=headers, timeout=None)
        
        #get first returned ORCID
        if auth_id.json()['result'] is not None:
            auth_df.loc[auth_df.index[index], 'ORCID'] = auth_id.json()['result'][0]['orcid-identifier']['uri']

In [None]:
# get email & total works from ORCID
for index, row in auth_df.iterrows():
    # if we found one, first remove url portion
    if len(row['ORCID'])>0:
        orcid = re.findall('\d+-\S+-\S+-\S+', row['ORCID'])

        url = "https://pub.orcid.org/v2.1/" + ''.join(orcid) + "/record"
        orcid_request = requests.get(url, headers=headers, timeout=None)
        
        # replace or add values as needed
        if len(row['Email'])>0 and len(orcid_request.json()['person']['emails']['email'])>0:
            auth_df.loc[auth_df.index[index], 'Email'] = orcid_request.json()['person']['emails']['email'][0]['email']
        if len(orcid_request.json()['activities-summary']['works']['group']) > 0:
            auth_df.loc[auth_df.index[index], 'Total Works'] = len(orcid_request.json()['activities-summary']['works']['group'])

In [None]:
# append new info & print updated list
print(auth_df)

### Part 4: Gender API
If you'd like a list of predicted female authors, run the following 3 blocks.

Register for a [Gender-API account](https://gender-api.com/) and add your API code in the code block below.

*Note: Free accounts are limited to 500 names per month.*

In [None]:
#input your key
genderAPI_key = 'YOUR ACCOUNT KEY HERE'

In [None]:
#build a data frame of only predicted female authors
headers={
   'X-RapidAPI-Host': 'gender-api.com',
   'X-RapidAPI-Key': genderAPI_key
 }

gender_list = list()
female_auth_df = pd.DataFrame(columns=['Last Name','First Name','Institution',
                                       'Email','ORCID','Keyword Manuscripts',
                                       'Source','Total Preprints','Preprint Downloads',
                                       'Total Works'])

for i, name in enumerate(auth_df['First Name']):
    #query API
    gender = requests.get('https://gender-api.com/get?name=' + name,
                          headers=headers)
    gender_list.append(gender.json()['gender'])
    #add to df if predicted female
    if gender.json()['gender']=='female':
        female_auth_df = female_auth_df.append(auth_df.iloc[i])
        
auth_df['Gender'] = gender_list
        
print('Authors predicted to be female identifying')
print(female_auth_df)

### Part 5: Ethnicity API

If you'd like to a list of predicted Black & Latinx authors, run the following 3 blocks. This uses the [Namsor-Client](https://pypi.org/project/namsor-client/) package.

Register for a [NamSor API account](https://v2.namsor.com/NamSorAPIv2/index.html) and add your API key in the code block below.

*Note: Free accounts are limited to 500 names per month.*

In [None]:
#input your key
namsor_KEY = "YOUR ACCOUNT KEY HERE"

In [None]:
!pip install namsor-client
from namsorclient import NamsorClient

In [None]:
client = NamsorClient(namsor_KEY)

# find female-identifying authors (same as GenderAPI)
# female_auth_df = pd.DataFrame(columns=['Name','Institution','Email',
#                                     'Keyword Preprints','Total Preprints','Downloads'])
# for i, name in enumerate(auth_df['Name']):
#     gender = client.genderFull(name)
#     if gender.likely_gender=='female':
#         female_auth_df = female_auth_df.append(auth_df.iloc[i])
        
# print('Authors predicted to be female identifying')
# print(female_auth_df)

# find minority authors
ethnicity_list = list()
minority_auth_df = pd.DataFrame(columns=['Last Name','First Name','Institution',
                                       'Email','ORCID','Keyword Manuscripts',
                                       'Source','Total Preprints','Preprint Downloads',
                                       'Total Works'])

for i, row in auth_df.iterrows():        
    ethnicity = client.usRaceEthnicity(row['First Name'], row['Last Name'])
    # options: ['HL', 'A', 'W_NL', 'B_NL']
    ethnicity_list.append(ethnicity.race_ethnicity)
    if ethnicity.race_ethnicity=='HL' or ethnicity.race_ethnicity=='B_NL':
        minority_auth_df = minority_auth_df.append(auth_df.iloc[i])

auth_df['Ethnicity'] = ethnicity_list

print('Authors predicted to be Black or Latinx')
print(minority_auth_df)

### Part 6: Save to a CSV

In [None]:
csvfile = re.sub('\s','_',keywords) + '_bioRxiv_speaker_finder.csv'
auth_df.to_csv(csvfile)