# Text Source PubMed
Sales words

In [6]:
!python -m pip install openai
!python -m pip install pandas
!python -m pip install scipy
!python -m pip install requests
!python -m pip install numexpr



In [7]:
import sys
sys.path.append('/Users/andersohrn/opt/anaconda3/lib/python3.8/site-packages')
print (sys.path)

['/Users/andersohrn/Development/das_wort', '/Users/andersohrn/opt/anaconda3/lib/python38.zip', '/Users/andersohrn/opt/anaconda3/lib/python3.8', '/Users/andersohrn/opt/anaconda3/lib/python3.8/lib-dynload', '', '/Users/andersohrn/river_chatgpt/river_gpt/lib/python3.8/site-packages', '/Users/andersohrn/opt/anaconda3/lib/python3.8/site-packages', '/Users/andersohrn/opt/anaconda3/lib/python3.8/site-packages']


In [121]:
from typing import Optional, List
from datetime import date

import json
import requests
import openai
import pandas as pd
from xml.etree import ElementTree as ET

In [98]:
PUBMED_APIs = {
    'standard' : {
        'base_url': 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/',
        'apis': {
            'search' : 'esearch.fcgi',
            'fetch' : 'efetch.fcgi'
        }
    },
    'open_access': {
        'base_url': 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/'
    }
}

## Search PubMed API
PubMed searches can be done through an API. A special syntax is used documented at length here: https://www.ncbi.nlm.nih.gov/books/NBK25499/. A great deal of NIHs databases uses these Entrez utilities. For PubMed only a subset is relevant.

A basic search term is constructed by the helper function, which can be extended if needed. Advanced custom search terms can be created here: https://pubmed.ncbi.nlm.nih.gov/advanced/.

The next code cell should create a string `search_term` that represents a query to the PubMed database.

In [105]:
def create_basic_search_term(author: Optional[str]=None, 
                             title: Optional[str]=None, 
                             abstract: Optional[str]=None, 
                             published_after_date: Optional[date]=None):
    term = []
    if not author is None:
        term.append('{}[Author]'.format(author))
    if not title is None:
        term.append('{}[Title]'.format(title))
    if not abstract is None:
        term.append('{}[Title/Abstract]'.format(abstract))
    if not published_after_date is None:
        term.append('{}[Date]'.format(FOOBAR(published_after_date)))
        
    if len(term) == 0:
        raise ValueError('No input values given')
        
    return '+AND+'.join(term)

search_term = create_basic_search_term(title='HER2-positive', author='Akihito Kawazoe')

Given the search term, the function `do_search_via_api` joins the search term to the appropriate parts in order to generate a URL that calls the external Web API and returns the payload, stored in `pubmed_object`.

In [106]:
def do_search_via_api(search_term: str):
    url = '{}{}?db=pubmed'.format(
        PUBMED_APIs['standard']['base_url'], 
        PUBMED_APIs['standard']['apis']['search']
    )
    url_full = '{}&term={}'.format(url, search_term)
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    if r.status_code != 200:
        print ('Status code {} received!'.format(status_code))
        print ('Error message: {}'.format(r.content))
        
    return r.content

pubmed_object = do_search_via_api(search_term)

URL to call: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Akihito Kawazoe[Author]+AND+HER2-positive[Title]


The return object is an XML file. Its format can be inspected after parsing the XML file or sometimes by visualizing the XML file in a web browser. The key part of interest is the list of PubMed IDs.

In [114]:
root = ET.fromstring(pubmed_object)
n_found = root.find('./Count').text
n_retmax = root.find('./RetMax').text
n_retstart = root.find('./RetStart').text
pubmed_ids = [x.text for x in root.findall('.//Id')]

print('Found {} matching PubMed objects. Return {} of them.'.format(n_found, n_retmax))
print('List of returned PubMed IDs: {}'.format(pubmed_ids))

Found 1 matching PubMed objects. Return 1 of them.
List of returned PubMed IDs: ['34912120']


## Abstract Fetching
For practically all articles, there is an abstract freely available. It can be retrieved for articles given their PubMed IDs. This uses the PubMed API, but a different endpoint than for search. The function `get_abstract_via_api` joins a list of PubMed IDs with the other parts of the API, issues the request to the Web API and (if successful) returns the Pubmed Abstract object.

In [115]:
def get_abstracts_via_api(ids: List[str]):
    url = '{}{}?db=pubmed'.format(
        PUBMED_APIs['standard']['base_url'], 
        PUBMED_APIs['standard']['apis']['fetch']
    )
    url_full = '{}&rettype=Abstract&id={}'.format(url, ','.join(ids))
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    print ('API call completed with status code {}'.format(r.status_code))
        
    return r.content

pubmed_abstract_object = get_abstracts_via_api(ids=pubmed_ids)

URL to call: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=Abstract&id=34912120
API call completed with status code 200


The Abstract object is a great deal more convoluted than the search object. It contains a lot of additional information, like author metadata, references, journal data etc. This is an XML file that requires a fair bit of inspection to retrieve the desired data. The XML file can be visualized typically in browsers, so click the HTTPS URL from the previous code block. 

In the next code block only the AbstractText is obtained. This is complicated by that Abstracts can contain references which are represented as tags within the text. Other issues may be possible.

In [116]:
root = ET.fromstring(pubmed_abstract_object)

In [117]:
# Function to extract text recursively from an element
def extract_text(element):
    text = element.text or ''
    for child in element:
        text += extract_text(child)
    text += element.tail or ''
    return text


# Find the AbstractText element and extract its text content
abstract_texts = []
abstract_text_elements = root.findall('.//AbstractText')
for element in abstract_text_elements:
    abstract_texts.append(extract_text(element))

# Print the extracted texts
print(abstract_texts)

['Human epidermal growth factor receptor 2 (HER2,\xa0also known as ERBB2) amplification or overexpression occurs in approximately 20% of advanced gastric or gastro-oesophageal junction adenocarcinomas1-3. More than a decade ago, combination therapy with the anti-HER2 antibody trastuzumab and chemotherapy became the standard first-line treatment for patients with these types of tumours4. Although adding the anti-programmed death 1 (PD-1) antibody pembrolizumab to chemotherapy does not significantly improve efficacy in advanced HER2-negative gastric cancer5, there are preclinical6-19 and clinical20,21 rationales for adding pembrolizumab in HER2-positive disease. Here we describe results of the protocol-specified first interim analysis of the randomized, double-blind, placebo-controlled phase III KEYNOTE-811 study of pembrolizumab plus trastuzumab and chemotherapy for unresectable or metastatic, HER2-positive gastric or gastro-oesophageal junction adenocarcinoma22 ( https://clinicaltrials

## Full Text Open Access
The full text of articles are copyrighted and owned by publishing houses, Elsevier, Springer etc. Open access to the full text is therefore not guaranteed. PubMed does however compile the open access article subset accessible through a different API. 

In [134]:
def get_fulltext_via_api(pid: str):
    url = '{}'.format(PUBMED_APIs['open_access']['base_url'])
    url_full = '{}{}/unicode'.format(url, pid)
    print('URL to call: {}'.format(url_full))
    
    r = requests.get(url=url_full)
    print ('API call completed with status code {}'.format(r.status_code))
        
    return r.content

fulltext_data = get_fulltext_via_api('34912120')
fulltext_data = json.loads(fulltext_data)
print (fulltext_data)

URL to call: https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/34912120/unicode
API call completed with status code 200
{'source': 'PMC', 'date': '20221221', 'key': 'pmc.key', 'infons': {}, 'documents': [{'id': '8959470', 'infons': {'license': 'author_manuscript'}, 'passages': [{'offset': 0, 'infons': {'article-id_doi': '10.1038/s41586-021-04161-3', 'article-id_manuscript': 'NIHMS1760542', 'article-id_pmc': '8959470', 'article-id_pmid': '34912120', 'fpage': '727', 'issue': '7890', 'license': '\n          This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law.\n        ', 'lpage': '730', 'name_0': 'surname:Janjigian;given-names:Yelena Y.', 'name_1': 'surname:Kawazoe;given-names:Akihito', 'name_10': 'surname:Wyrwicz;given-names:Lucjan S.', 'name_11': 'surname:Xu;given-names:Jianming', 'name_12': 'surname:Shitara;given-names:Kohei', 'name_13': 'surname:Qin;given-names:Shukui', 'name_14': 'surnam

The `fulltext_data` dictionary contains the fulltext (if in the open access subset). It is a big and highly nested dictionary from which text can be extracted. The code below is an illustrative example. W

In [135]:
text_total = []
for passage in fulltext_data['documents'][0]['passages']:
    section_type = passage['infons']['section_type']
    if section_type in ['TABLE', 'REF']:
        continue
        
    text_paragraph = passage['text']
    text_total.append(text_paragraph)
full_text = '\n'.join(text_total)
print (full_text)

Combined PD-1 and HER2 blockade for HER2-positive gastric cancer
SUMMARY
Human epidermal growth factor receptor 2 (ERBB2/HER2) amplification or overexpression occurs in approximately 20% of advanced gastric or gastroesophageal junction adenocarcinomas. Over a decade ago, combination therapy with the anti–HER2 antibody trastuzumab and chemotherapy became the standard first-line treatment for these patients. Although adding the anti–programmed death 1 (PD-1) antibody pembrolizumab to chemotherapy does not significantly improve efficacy in advanced HER2-negative gastric cancer, there are preclinical and clinical rationales for adding pembrolizumab in HER2-positive disease. Here we describe results of the protocol-specified first interim analysis of the randomized, double-blind, placebo-controlled phase III KEYNOTE-811 study of pembrolizumab plus trastuzumab and chemotherapy for unresectable or metastatic, HER2-positive gastric or gastroesophageal junction adenocarcinoma (ClinicalTrials.go