# Parse PubMed data in order to get plain text of relevant papers

Through the URLs below, PubMed data can be ontained interactively. The idea here is to select papers based on keywords (and perhaps dates, ...) in a first query that results in PMC IDs. After that, the full text of those publications can be ontained one-by-one.

These texts can be analyzed by scispacy language modesl that include POS tagging for biological/scienti

In [26]:
import requests
import pandas as pd
import xml.etree.ElementTree as et
import regex as re

In [41]:
# Get data and write XML (URL should result from a query)
response = requests.get("https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=covid-19")

data = response.content.decode()

In [42]:
# Parse XML and extract sensible tags
root = et.XML(data)

IDs = []
for pmc in root.iter("pmcid"):
    IDs.append(pmc.text)

print(IDs)

len(IDs)

['PMC8607916', 'PMC8562401', 'PMC8563347', 'PMC8590950', 'PMC8575090', 'PMC8563592', 'PMC8493542', 'PMC8610737', 'PMC8498363', 'PMC8571104', 'PMC8610734', 'PMC8495000', 'PMC8558759', 'PMC8613003', 'PMC8612724', 'PMC8600803', 'PMC8476974', 'PMC8595926', 'PMC8590489', 'PMC8612713', 'PMC7801096', 'PMC8479323', 'PMC7784826']


23

In [12]:
# get full text per paper
# process that
# NLP that with spacy, after loading language model


In [99]:
URL = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{IDs[5]}/fullTextXML"

response = requests.get(URL)
data = response.content.decode()

# Full XML parsed 'manually'

Given the `fullTextXML` we do the following:
- Extract only the body: everything between <body> and </body>, which is the paper from introduction to just before the acknowledgements (in the cases I've seen).
- Remove tags, while keeping the content for some specific tags, e.g. italic, bold, sup, sub, underline, title, ...
- Remove some other tags including their full content, e.g. xref

In [100]:
tag_remove = ['italic', 'bold', 'sup', 'sub', 'underline', 'title', 'sec', 'p',
             'list', 'list-item']
total_remove = ['xref', 'table-wrap', 'fig', 'label']

def extract_clean(text, tag_remove=tag_remove, total_remove=total_remove):
    # XML string will come in.
    # Everything within <body>...</body> will come out,
    # every tag in tag_remove is gone (but content still there),
    # every tag in total_remove is gone, with it contents also removed,
    # every special character removed, e.g. &#x0003e;
    # Line breaks replaced by spaces.
    # Opening tags may have extra info, removed through regex: 
    # e.g. '<xref[\s\S]*?>' removes tags like <xref rid="ppat.100...>
    
    assert type(text) == str
    
    if '<body>' not in text: 
        print("Can't find the body, returning nothing!")
        return ''
    
    body_pos = text.find("<body>") + len("<body>")
    end_pos = text.find("</body>")
    
    text = text[body_pos:end_pos]
    
    for t in tag_remove:
        pattern = f'<{t}[\s\S]*?>' 
        text = re.sub(pattern, ' ', text, count=10000)
        pattern = f'</{t}>'
        text = re.sub(pattern, ' ', text, count=10000)
        
    for t in total_remove:
        pattern = f'<{t}[\s\S]*?>[\s\S]*?</{t}>'
        text = re.sub(pattern, ' ', text, count=10000)
        
    pattern = '&#x[\S]{5};'
    text = re.sub(pattern, ' ', text, count=10000)
    text = re.sub('\\n', ' ', text, count=10000)
    
    return text

In [101]:
result = extract_clean(data)

In [102]:
result


' In March 2020, our world entered the COVID-19 era [ ], and we have all since been experiencing waves of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic [ ]. The effect of COVID-19 on our populations and health care systems has been, and continues to be, significant. The global pandemic has also affected scientific medical publishing, with an acknowledged disrupting effect, including a faster track to peer review and on-line publication for COVID-19 submissions but with stress on the peer review and publication of non-COVID-19 research [ ]. At the recent virtual 69 th  Annual Scientific Meeting of the Cardiac Society of Australia and New Zealand (CSANZ),  Heart, Lung and Circulation  (HLC) presented a joint, invited session with the  European Heart Journal  about their respective journal highlights in 2020, chaired by HLC s Editor-in-Chief Professor A. Robert Denniss. This editorial presents an overview of HLC s highlights in the 2020 volume (Volume 29), incl