# Pubmed scraper

27 05 22

---

## Description

Given a set of search terms, find all the articles that are matched from the PubMed database. Useful for creating a training set of text for biomedical NLP.


## Steps

### 1: Find all matches to a given search result

Given a search string scrape the pages that contain the search results for this search. This step retrieves the title and URL for each of the articles

### 2: Scrape the abstract for each article

Get the abstract for each of the articles that have been found in step 1.

### 3. Return a dataframe with the results

Put it all into a nice data frame with each row containing text from an abstract.

In [4]:
import requests
import pandas as pd
from lxml import html

## LXML paths to extract links, titles and abstracts

In [5]:
path_link = './/div[@class="docsum-wrap"]/div/a/@href'
path_title = './/div[@class="docsum-wrap"]/div/a/text()'
path_abstract = './/div[@class="abstract-content selected"]/p/text()'

## 1. Scrape main pages to get links

In [6]:
# extract relevant links
url_base = 'https://pubmed.ncbi.nlm.nih.gov'

In [7]:
search_text = 'clinical trials'.replace(' ', '+')

In [8]:
url = f'https://pubmed.ncbi.nlm.nih.gov/?term={search_text}?page='

In [9]:
urls = [f'{url}{n}' for n in range(0, 100)]

links = []
titles = []

for url in urls:
    print('Scraping: ' + url)
    page = requests.get(url)
    tree = html.fromstring(page.text.replace('<b>', '').replace('</b>', ''))
    titles_page = [title.strip() for title in tree.xpath(path_title)]
    links_page = [url_base + link for link in tree.xpath(path_link)]
    links += links_page
    titles += titles_page

Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=0
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=1
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=2
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=3
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=4
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=5
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=6
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=7
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=8
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=9
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=10
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=11
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=12
Scraping: https://pubmed.ncbi.nlm.nih.gov/?term=clinical+trials?page=13
Sc

## 2. Scrape all the links of articles

In [10]:
abstracts = []

for url in links:
    print(url)
    page = requests.get(url)
    tree = html.fromstring(page.text)
    paragraph = tree.xpath(path_abstract)
    abstracts.append(paragraph)

https://pubmed.ncbi.nlm.nih.gov/32101663/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/25157702/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/30580575/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/31813166/
https://pubmed.ncbi.nlm.nih.gov/26884379/
https://pubmed.ncbi.nlm.nih.gov/32101663/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/30580575/
https://pubmed.ncbi.nlm.nih.gov/21518313/
https://pubmed.ncbi.nlm.nih.gov/33935593/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/31813166/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/32101663/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/28700715/
https://pubmed.ncbi.nlm.nih.gov/27

https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/25271097/
https://pubmed.ncbi.nlm.nih.gov/19762075/
https://pubmed.ncbi.nlm.nih.gov/33838758/
https://pubmed.ncbi.nlm.nih.gov/32101663/
https://pubmed.ncbi.nlm.nih.gov/33933206/
https://pubmed.ncbi.nlm.nih.gov/33307546/
https://pubmed.ncbi.nlm.nih.gov/32876697/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/30415628/
https://pubmed.ncbi.nlm.nih.gov/34351722/
https://pubmed.ncbi.nlm.nih.gov/34371522/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/31462531/
https://pubmed.ncbi.nlm.nih.gov/32678530/
https://pubmed.ncbi.nlm.nih.gov/33933206/
https://pubmed.ncbi.nlm.nih.gov/32876697/
https://pubmed.ncbi.nlm.nih.gov/25157702/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/30415628/
https://pubmed.ncbi.nlm.nih.gov/8918275/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/272

https://pubmed.ncbi.nlm.nih.gov/25157702/
https://pubmed.ncbi.nlm.nih.gov/30580575/
https://pubmed.ncbi.nlm.nih.gov/34351721/
https://pubmed.ncbi.nlm.nih.gov/28844192/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/25271097/
https://pubmed.ncbi.nlm.nih.gov/33838758/
https://pubmed.ncbi.nlm.nih.gov/32678530/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/30415628/
https://pubmed.ncbi.nlm.nih.gov/28845751/
https://pubmed.ncbi.nlm.nih.gov/28844192/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/24820247/
https://pubmed.ncbi.nlm.nih.gov/19762075/
https://pubmed.ncbi.nlm.nih.gov/34463700/
https://pubmed.ncbi.nlm.nih.gov/33933206/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/33838758/
https://pubmed.ncbi.nlm.nih.gov/31562798/
https://pubmed.ncbi.nlm.nih.gov/30

https://pubmed.ncbi.nlm.nih.gov/28902590/
https://pubmed.ncbi.nlm.nih.gov/34463700/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/32876697/
https://pubmed.ncbi.nlm.nih.gov/30580575/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/27552521/
https://pubmed.ncbi.nlm.nih.gov/24820247/
https://pubmed.ncbi.nlm.nih.gov/23543580/
https://pubmed.ncbi.nlm.nih.gov/27509100/
https://pubmed.ncbi.nlm.nih.gov/30707445/
https://pubmed.ncbi.nlm.nih.gov/28902590/
https://pubmed.ncbi.nlm.nih.gov/27283590/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/33631065/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/31483963/
https://pubmed.ncbi.nlm.nih.gov/31562798/
https://pubmed.ncbi.nlm.nih.gov/34251506/
https://pubmed.ncbi.nlm.nih.gov/23245607/
https://pubmed.ncbi.nlm.nih.gov/34496632/
https://pubmed.ncbi.nlm.nih.gov/33757798/
https://pubmed.ncbi.nlm.nih.gov/34351722/
https://pubmed.ncbi.nlm.nih.gov/33

https://pubmed.ncbi.nlm.nih.gov/32227756/
https://pubmed.ncbi.nlm.nih.gov/34463700/
https://pubmed.ncbi.nlm.nih.gov/31557429/
https://pubmed.ncbi.nlm.nih.gov/34546300/
https://pubmed.ncbi.nlm.nih.gov/30700403/
https://pubmed.ncbi.nlm.nih.gov/28902590/
https://pubmed.ncbi.nlm.nih.gov/32101663/
https://pubmed.ncbi.nlm.nih.gov/32678530/
https://pubmed.ncbi.nlm.nih.gov/32955177/
https://pubmed.ncbi.nlm.nih.gov/33933206/
https://pubmed.ncbi.nlm.nih.gov/32876697/
https://pubmed.ncbi.nlm.nih.gov/28885881/
https://pubmed.ncbi.nlm.nih.gov/30415628/
https://pubmed.ncbi.nlm.nih.gov/34371522/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/28845751/
https://pubmed.ncbi.nlm.nih.gov/32678530/
https://pubmed.ncbi.nlm.nih.gov/32222134/
https://pubmed.ncbi.nlm.nih.gov/27283591/
https://pubmed.ncbi.nlm.nih.gov/21991949/
https://pubmed.ncbi.nlm.nih.gov/35139274/
https://pubmed.ncbi.nlm.nih.gov/28118559/
https://pubmed.ncbi.nlm.nih.gov/34588162/
https://pubmed.ncbi.nlm.nih.gov/33

https://pubmed.ncbi.nlm.nih.gov/32955177/
https://pubmed.ncbi.nlm.nih.gov/33307546/
https://pubmed.ncbi.nlm.nih.gov/32876697/
https://pubmed.ncbi.nlm.nih.gov/25157702/
https://pubmed.ncbi.nlm.nih.gov/34351722/
https://pubmed.ncbi.nlm.nih.gov/28845751/
https://pubmed.ncbi.nlm.nih.gov/30280658/
https://pubmed.ncbi.nlm.nih.gov/33631065/
https://pubmed.ncbi.nlm.nih.gov/25271097/
https://pubmed.ncbi.nlm.nih.gov/34000257/


## 3. Combine all the abstracts and write the output

In [11]:
# remove spaces and 
texts = [text.strip() for abstract in abstracts for text in abstract if text.strip() != '']

In [12]:
# write the data
df = pd.DataFrame(data={'text': texts})