# Text Source PubMed
Pubmed is the database that collates a great deal of articles related to medical and biological research. The database can be programmatically accessed through APIs. Pubmed is not the publisher, however, and can only provide full text for certain open access articles. The Abstracts, however, are always available.

The code below shows how search, abstract fetching and full-text fetching can be done. This enables the creation of large chunks of text that can be further analyzed or processed by NLP tools or LLMs etc.

June 16, 2023, Anders Ohrn

## Preliminaries
A few installs, adjustments to Python, imports and constants are first handled. The first two code blocks may not be needed or should be customized to the user's setup.

In [2]:
!python -m pip install requests
!python -m pip install numexpr



In [3]:
import sys
sys.path.append('/Users/andersohrn/opt/anaconda3/lib/python3.8/site-packages')
print (sys.path)

['/Users/andersohrn/Development/das_wort', '/Users/andersohrn/opt/anaconda3/lib/python38.zip', '/Users/andersohrn/opt/anaconda3/lib/python3.8', '/Users/andersohrn/opt/anaconda3/lib/python3.8/lib-dynload', '', '/Users/andersohrn/river_chatgpt/river_gpt/lib/python3.8/site-packages', '/Users/andersohrn/opt/anaconda3/lib/python3.8/site-packages']


In [4]:
from typing import Optional, List
from datetime import date

import json
import requests
from xml.etree import ElementTree as ET

In [5]:
PUBMED_APIs = {
    'standard' : {
        'base_url': 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/',
        'apis': {
            'search' : 'esearch.fcgi',
            'fetch' : 'efetch.fcgi'
        }
    },
    'open_access': {
        'base_url': 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/'
    }
}

## Search PubMed API
PubMed searches can be done through an API. A special syntax is used documented at length here: https://www.ncbi.nlm.nih.gov/books/NBK25499/. A great deal of NIHs databases uses these Entrez utilities. For PubMed only a subset is relevant.

A basic search term is constructed by the helper function, which can be extended if needed. Advanced custom search terms can be created here: https://pubmed.ncbi.nlm.nih.gov/advanced/.

The next code cell should create a string `search_term` that represents a query to the PubMed database.

In [6]:
def create_basic_search_term(all_fields: Optional[str]=None,
                             author: Optional[str]=None, 
                             title: Optional[str]=None, 
                             abstract: Optional[str]=None, 
                             published_after_date: Optional[date]=None):
    term = []
    if not all_fields is None:
        term.append('{}[All Fields]'.format(all_fields))
    if not author is None:
        term.append('{}[Author]'.format(author))
    if not title is None:
        term.append('{}[Title]'.format(title))
    if not abstract is None:
        term.append('{}[Title/Abstract]'.format(abstract))
    if not published_after_date is None:
        term.append('{}[Date]'.format(FOOBAR(published_after_date)))
        
    if len(term) == 0:
        raise ValueError('No input values given')
        
    return '+AND+'.join(term)

search_term = create_basic_search_term(title='HER2-positive', author='Akihito Kawazoe')
#search_term = create_basic_search_term(title='HER2-positive', author='Smith')
search_term = create_basic_search_term(all_fields='HER2')

Given the search term, the function `do_search_via_api` joins the search term to the appropriate parts in order to generate a URL that calls the external Web API and returns the payload, stored in `pubmed_object`. 

It is possible a search returns a very large number of IDs. To govern how many of the IDs are returned, `retmax` and `retstart` can be set.

In [7]:
def do_search_via_api(search_term: str, retmax: int=20, retstart:int=0):
    url = '{}{}?db=pubmed'.format(
        PUBMED_APIs['standard']['base_url'], 
        PUBMED_APIs['standard']['apis']['search']
    )
    url_full = '{}&term={}&retmax={}&retstart={}'.format(url, search_term, retmax, retstart)
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    if r.status_code != 200:
        print ('Status code {} received!'.format(status_code))
        print ('Error message: {}'.format(r.content))
        
    return r.content

#pubmed_object = do_search_via_api(search_term)
pubmed_object = do_search_via_api(search_term, retmax=100)
#pubmed_object = do_search_via_api(search_term, retmax=20, retstart=20)

URL to call: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=HER2-positive&retmax=100&retstart=0


The return object is an XML file. Its format can be inspected after parsing the XML file or sometimes by visualizing the XML file in a web browser. The key part of interest is the list of PubMed IDs.

In [8]:
root = ET.fromstring(pubmed_object)
n_found = root.find('./Count').text
n_retmax = root.find('./RetMax').text
n_retstart = root.find('./RetStart').text
pubmed_ids = [x.text for x in root.findall('.//Id')]

print('Found {} matching PubMed objects. Return {} of them.'.format(n_found, n_retmax))
print('List of returned PubMed IDs: {}'.format(pubmed_ids))

Found 9282 matching PubMed objects. Return 100 of them.
List of returned PubMed IDs: ['37348349', '37346044', '37345001', '37343894', '37342184', '37341815', '37340843', '37338513', '37337299', '37336650', '37335880', '37333824', '37332338', '37332328', '37331637', '37331526', '37331014', '37329891', '37328285', '37327700', '37327257', '37326930', '37325934', '37324228', '37322978', '37318852', '37318379', '37317607', '37314204', '37304544', '37303973', '37302750', '37302393', '37300985', '37299548', '37296873', '37295176', '37291377', '37289321', '37289145', '37286557', '37284715', '37279780', '37276871', '37275938', '37274273', '37273150', '37268625', '37268505', '37267721', '37267036', '37266600', '37266479', '37265453', '37264567', '37264417', '37264298', '37264265', '37264184', '37263678', '37263554', '37258548', '37258524', '37258523', '37256703', '37255389', '37254964', '37251081', '37247580', '37246901', '37246414', '37240498', '37233208', '37232839', '37232838', '37232822', '3

## Abstract Fetching
For practically all articles, there is an abstract freely available. It can be retrieved for articles given their PubMed IDs. This uses the PubMed API, but a different endpoint than for search. The function `get_abstract_via_api` joins a list of PubMed IDs with the other parts of the API, issues the request to the Web API and (if successful) returns the Pubmed Abstract object.

In [187]:
def get_abstracts_via_api(ids: List[str]):
    url = '{}{}?db=pubmed'.format(
        PUBMED_APIs['standard']['base_url'], 
        PUBMED_APIs['standard']['apis']['fetch']
    )
    url_full = '{}&rettype=Abstract&id={}'.format(url, ','.join(ids))
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    print ('API call completed with status code {}'.format(r.status_code))
        
    return r.content

pubmed_abstract_object = get_abstracts_via_api(ids=pubmed_ids)

URL to call: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=Abstract&id=34912120
API call completed with status code 200


The Abstract object is a great deal more convoluted than the search object. It contains a lot of additional information, like author metadata, references, journal data etc. This is an XML file that requires a fair bit of inspection to retrieve the desired data. The XML file can be visualized typically in browsers, so click the HTTPS URL from the previous code block. 

In the next code block only the AbstractText is obtained. This is complicated by that Abstracts can contain references which are represented as tags within the text. Other issues may be possible.

In [188]:
root = ET.fromstring(pubmed_abstract_object)

In [189]:
# Function to extract text recursively from an element
def extract_text(element):
    text = element.text or ''
    for child in element:
        text += extract_text(child)
    text += element.tail or ''
    return text


# Find the AbstractText element and extract its text content, associate with the PubMed ID key
abstract_texts = {}
pubmed_articles = root.findall('.//PubmedArticle')
for article in pubmed_articles:
    pubmed_element = article.find(".//ArticleId[@IdType='pubmed']")
    
    abstract_text = []
    for abstract_element in article.findall('.//Abstract'):
        text = ''
        for abstract_text_element in abstract_element.findall('./AbstractText'):
            text += extract_text(abstract_text_element)
        
        abstract_text.append(text)
    abstract_texts[pubmed_element.text] = '\n'.join(abstract_text)
    

# Print the extracted texts
print (abstract_texts)

{'34912120': 'Human epidermal growth factor receptor 2 (HER2,\xa0also known as ERBB2) amplification or overexpression occurs in approximately 20% of advanced gastric or gastro-oesophageal junction adenocarcinomas1-3. More than a decade ago, combination therapy with the anti-HER2 antibody trastuzumab and chemotherapy became the standard first-line treatment for patients with these types of tumours4. Although adding the anti-programmed death 1 (PD-1) antibody pembrolizumab to chemotherapy does not significantly improve efficacy in advanced HER2-negative gastric cancer5, there are preclinical6-19 and clinical20,21 rationales for adding pembrolizumab in HER2-positive disease. Here we describe results of the protocol-specified first interim analysis of the randomized, double-blind, placebo-controlled phase III KEYNOTE-811 study of pembrolizumab plus trastuzumab and chemotherapy for unresectable or metastatic, HER2-positive gastric or gastro-oesophageal junction adenocarcinoma22 ( https://cl

## Full Text Open Access
The full text of articles are copyrighted and owned by publishing houses, Elsevier, Springer etc. Open access to the full text is therefore not guaranteed. PubMed does however compile the open access article subset accessible through a different API. 

In [191]:
def get_fulltext_via_api(pid: str):
    url = '{}'.format(PUBMED_APIs['open_access']['base_url'])
    url_full = '{}{}/unicode'.format(url, pid)
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    print ('API call completed with status code {}'.format(r.status_code))
        
    return r.content

fulltext_data = get_fulltext_via_api('34912120')
fulltext_data = json.loads(fulltext_data)

URL to call: https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/34912120/unicode
API call completed with status code 200


The `fulltext_data` dictionary contains the fulltext (if in the open access subset). It is a big and highly nested dictionary from which text can be extracted. The code below is an illustrative example of how a particular subset of text can be retrieved for further processing by NLP tools or LLMs.

In [192]:
text_total = []
for passage in fulltext_data['documents'][0]['passages']:
    section_type = passage['infons']['section_type']
    if section_type in ['TABLE', 'REF']:
        continue
        
    text_paragraph = passage['text']
    text_total.append(text_paragraph)
full_text = '\n'.join(text_total)
print (full_text)

Combined PD-1 and HER2 blockade for HER2-positive gastric cancer
SUMMARY
Human epidermal growth factor receptor 2 (ERBB2/HER2) amplification or overexpression occurs in approximately 20% of advanced gastric or gastroesophageal junction adenocarcinomas. Over a decade ago, combination therapy with the anti–HER2 antibody trastuzumab and chemotherapy became the standard first-line treatment for these patients. Although adding the anti–programmed death 1 (PD-1) antibody pembrolizumab to chemotherapy does not significantly improve efficacy in advanced HER2-negative gastric cancer, there are preclinical and clinical rationales for adding pembrolizumab in HER2-positive disease. Here we describe results of the protocol-specified first interim analysis of the randomized, double-blind, placebo-controlled phase III KEYNOTE-811 study of pembrolizumab plus trastuzumab and chemotherapy for unresectable or metastatic, HER2-positive gastric or gastroesophageal junction adenocarcinoma (ClinicalTrials.go