# Instructions

This script takes an excel file of SEATRAC authors and a date range and searches PubMed for papers from those authors published during those dates. Output is a semicolon-separated list of PubMed IDs which can be pasted directly into the PubMed search box to get a page listing those papers.

1. Upload "20231114_SEATRAC_Member_PubSearch.xlsx":
    * Click the folder symbol on the left, then the upload symbol. If you get a warning about saving elsewhere click "ok".
    * If you're using a newer version ensure these columns still exist: 'Last Name', 'First Name', 'Middle Initial', 'Primary Institution (Choose ONE)'
2. Update the search dates:
    * In the "Inputs" code box below, edit the text within quotes
    * Click the play button to run this chunk of code, saving the inputs
3. Do the search:
    * Click the play button of the 'Do PubMed Search' code box (should take 1-2min to finish)

#### Extra info:

* PubMed API (NCBI Entrez E-utilities) help: https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Introduction
* More help: https://pubmed.ncbi.nlm.nih.gov/help/
* List of PubMed "tags": https://www.ncbi.nlm.nih.gov/pmc/about/userguide/

# Inputs

In [19]:
start_date = '2023/07/01'
end_date = '2023/09/30'

excel_file = '/content/20231114_SEATRAC_Member_PubSearch.xlsx'

# Do PubMed Search

In [20]:
#########
# Setup #
#########

print('Loading packages and data')

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Create URL prefix Using my NCBI account API key (api_key) and retrieving a 
# maximum of 10,000 records (retmax)
prefix = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&api_key=e8c1951d5b35792885126d8a9597398a1a07&retmax=10000&usehistory=y&term='

# Load input excel sheet
pi_list = pd.read_excel(excel_file)

####################################
# Format affiliations and concepts #
####################################

print('Formatting affiliations and concepts')

affiliations_list = pi_list['Primary Institution (Choose ONE)'].drop_duplicates().tolist()
affiliations = '("' + '"[ad] OR "'.join(affiliations_list) + '")'

concept = '("tubercul*"[tw] OR "Antitubercul*"[tw] OR "Anti-Tubercul*"[tw] OR "osteotubercul*"[tw] OR "nephrotubercul*"[tw] OR "anthracosilicotubercul*"[tw] OR "coniotubercul*"[tw] OR "Tuberculin"[tw] OR "tb"[tw] OR "xdr-tb"[tw] OR "xdrtb"[tw] OR "mdr-tb"[tw] OR "mdrtb"[tw] OR "phthisis"[tw] OR "pneumonophthisis"[tw] OR "pneumophthisiology"[tw] OR "silicotubercul*"[tw] OR "bazin disease"[tw] OR "erythema induratum"[tw] OR "white swelling"[tw] OR "king`s evil"[tw] OR "scrofula"[tw] OR "pott disease"[tw] OR "koch`s disease"[tw] OR "Interferon-gamma Release Test"[tw] OR "Tuberculosis"[Mesh] OR "Mycobacterium tuberculosis"[Mesh] OR "Antitubercular Agents"[Mesh] OR "Tuberculin Test"[Mesh] OR "Interferon-gamma Release Tests"[Mesh] OR "Tuberculosis Vaccines"[Mesh])'
concept_list = concept.\
    replace('[Mesh]', '[mh]').\
    split(' OR ')
concepts = ' OR '.join(concept_list)

#######################
# Format author names #
#######################

print('Formatting author names (searching for recent papers to decide whether to use middle initial)')

# Step 1: Determine if need to use middle initial

pi_list['LastFirst'] = '"' + pi_list['Last Name'] + ' ' + pi_list['First Name'].str[0] + '"' + '[au]'
pi_list['LastFirstMiddle'] = '"' + pi_list['Last Name'] + ' ' + pi_list['First Name'].str[0] + pi_list['Middle Initial'].str[0] + '"' + '[au]'

# For name in LastFirst, how many papers show up published within the last four 
# years? Show authors with no papers (probably need middle initial)
pubdate_4yr = '2019/11/21:2023/11/21[pdat]'

has_papers = set()
no_papers = list()
for author in pi_list['LastFirst'].drop_duplicates().to_list():
    url = prefix + pubdate_4yr + '+AND+' + affiliations + author
    page = requests.get(url).text
    result = BeautifulSoup(page, 'xml')
    IDlist = [i.text for i in result.find_all('Id')]
    num_papers = len(IDlist)
    if num_papers > 0:
        has_papers.add(author)
    else:
        no_papers.append(author)

# Now look with LastFirstMiddle
new_pi_list = pi_list[pi_list['LastFirst'].isin(no_papers)].sort_values(by='LastFirstMiddle')
new_has_papers = set()
new_no_papers = list()
for author in new_pi_list['LastFirstMiddle'].drop_duplicates().dropna().to_list():
    url = prefix + pubdate_4yr + '+AND+' + affiliations + author
    page = requests.get(url).text
    result = BeautifulSoup(page, 'xml')
    IDlist = [i.text for i in result.find_all('Id')]
    num_papers = len(IDlist)
    if num_papers > 0:
        new_has_papers.add(author)
    else:
        new_no_papers.append(author)

# Step 2: Put together final author list

# Combine and include authors who didn't have papers with LastFirst but also don't have a MI
author_set = has_papers.union(new_has_papers)
authors_list = list(author_set) + ['"Sorri Y"[au]', '"Connolly A"[au]', '"Ghassemieh B"[au]']
authors = '(' + ' OR '.join(authors_list) + ')'  # Make authors into string

#############
# Do search #
#############

print('Doing PubMed search')

date_range = start_date + ':' + end_date + '[pdat]'
url = prefix + date_range + '+AND+' + affiliations + '+AND+' + authors + '+AND+' + concepts
page = requests.get(url).text
result = BeautifulSoup(page, 'xml')
IDlist = [i.text for i in result.find_all('Id')]

out = ';'.join(list(IDlist))

print('Done!\n')
print(f'PubMed IDs of papers from SEATRAC authors published between {start_date} and {end_date}:\n')
print(out)

Loading packages and data
Formatting affiliations and concepts
Formatting author names (searching for recent papers to decide whether to use middle initial)
Doing PubMed search
PubMed IDs of papers from SEATRAC authors published between 2023/07/01 and 2023/09/30:

37773037;37708378;37696247;37693490;37676852;37556423;37461439;37400753;37379655;37336104;36945395
