# Web Scraping on PubMed

In this project, we scrape scientific papers on [PubMed](https://pubmed.ncbi.nlm.nih.gov/). PubMed is a free database of over 30 million scientific references and abstracts on biomedical and life sciences topics. PubMed is  managed by the United States National Library of Medicine and the National Institutes of Health.

This scraping is performed in March 2020 using BeautifulSoup4. Since structure and underlying html code for websites, especially websites that are popular and regularly used as PubMed, frequently change with upgrades, I cannot guarantee that this code will work without modification in the future.

This script scraped the [search page](https://www.ncbi.nlm.nih.gov/pubmed?term=Study%5BText%20Word%5D) for text word *'study'*. At the time of scraping, this search yields almost 10 million references. Due to limited resources, we only scraped 1300 pages and 26000 references. Each reference contains the title, author list, journal where it is published and date of publication. Since scientific journals make money through access fee of publications, I was not able to download the content of each paper for free; therefore, only the abstract and list of keywords are included in each reference. Each reference is also uniquely identified by its `PMID` or PubMed ID.

In [1]:
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import requests
import csv
import pandas as pd
from datetime import datetime

In [2]:
# helper functions
def write_html(s,link):
    f = open(link, "w")
    f.write(s)
    f.close()

def write_csv_list(data,link):
    with open(link, 'w', newline='') as csvfile:
        fieldnames = ['Index','Titles']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        index = 1
        for row in data:
            writer.writerow({'Index': index, 'Titles': row})
            index+=1

def write_csv_df(df,link):
    df.to_csv(link, index=False)

Scraping is performed through function `crawl`, yielding a dataframe with 8 columns:

   * `Titles`: Reference title
   * `Authors`: List of co-authors
   * `Journal`: Scientific journal in which reference is published
   * `Date`: Date of official publication
   * `PMID`: PubMed identification number
   * `Free Article`: Boolean whether this article is available for free
   * `Abstract`: Summary of the paper
   * `Keywords`: List of 3-5 relevant words
    
This `crawl` function calls on two other functions, `num_pages` to scrape for the total number of pages for the search and `get_manuscripts` to scrape each reference link.

In [3]:
def crawl(url, page_start=1, page_limit=None, driver=None):
    page_sources = list()
    manuscripts = list()
    column_names = ['Titles','Authors','Journal','Date','PMID','Free Article','Abstract','Keywords']

    # chrome driver
    if(driver is None):
        options = webdriver.ChromeOptions()
        options.add_argument('--ignore-certificate-errors')
        options.add_argument('--headless')
        chrome_link = '/Users/chauvu/Documents/Chau/DataScience/bin/chromedriver'
        driver = webdriver.Chrome(chrome_link, options=options)

    driver.get(url) # first page

    # starting page not 1: for debugging
    if(page_start > 1):
        buttons = driver.find_elements_by_id("pageno")
        buttons[0].clear()
        buttons[0].send_keys(str(page_start))
        buttons[0].send_keys(Keys.RETURN)
    
    # scrape first page
    first_page = driver.page_source
    page_sources.append(first_page)
    soup = bs4.BeautifulSoup(first_page, 'lxml')
    if(page_limit is None): 
        number_pages = num_pages(soup)
    else: 
        page_count = num_pages(soup)
        if(page_count < page_limit): number_pages = page_count
        else: number_pages = page_limit

    # get tags from each page
    for page in range(page_start, number_pages):
        # print(page)
        buttons = driver.find_elements_by_partial_link_text("Next ")
        b = buttons[0].click()
        page_sources.append(driver.page_source)

    # scrape each page
    for page in page_sources:
        soup = bs4.BeautifulSoup(page, 'lxml')
        manuscripts += get_manuscripts(soup)

    manuscripts_df = pd.DataFrame(manuscripts, columns=column_names)
    write_csv_df(manuscripts_df, '../Data/pubmed/manuscripts.csv')
    return manuscripts_df

Function `num_pages` scrape the search page for the total number of pages generated. Each search page contains 20 results.

In [4]:
def num_pages(soup):
    tag_page = soup.findAll("h3", {"class": "page"})
    try: str_pages = tag_page[-1].contents[-1]
    except: return 1 # if only 1 page
    number_pages = [int(p) for p in str_pages.split() if p.isdigit()]
    return number_pages[0]

Function `get_manuscripts` receive `soup` which is a BeautifulSoup type object as the input. This is the soup-version html of each search page, containing 20 tags. Each tag can be extract to obtain the details on each manuscript.

In [5]:
def get_manuscripts(soup):
    tags = soup.findAll("div", {"class": "rprt"})
    manuscripts = list()
    for tag in tags:
        manuscript = list()
        # get title
        title = tag.findAll("p", {"class": "title"})[0].get_text()
        manuscript.append(title)
        
        # get author
        author = tag.findAll("p", {"class": "desc"})[0].get_text()
        manuscript.append(author)
        
        # get journal details
        details = tag.findAll("p", {"class": "details"})[0].get_text().replace(';','.').replace(':','.').replace('-','.').split('.')
        journal = details[0]
        manuscript.append(journal)
        date = details[1].strip()
        try: 
            date_dt = datetime.strptime(date,'%Y %b %d') if len(date.split())>2 else datetime.strptime(date,'%Y %b') if len(date.split())>1 else datetime.strptime(date,'%Y') 
            manuscript.append(date_dt.strftime('%Y'))
        except: manuscript.append(0) # unable to read date, temp = 0
        
        # get PMID
        pmid = tag.find("dd")
        manuscript.append(pmid.get_text())
        
        # get free status
        free = tag.find("a", {"class": "status_icon nohighlight"})
        manuscript.append(True if free is not None else False)
        
        # abstract: empty
        abstract_info = soup.findAll("div", {"class": "abstr"})
        try: abstract = abstract_info[0].findAll("div", {"class": ""})[0].get_text() if len(abstract_info)>0 else '' 
        except: abstract = ''
        manuscript.append(abstract)
        
        # keywords: empty
        keywords_info = soup.findAll("div", {"class": "keywords"})
        keywords = keywords_info[0].findAll("p")[0].get_text() if len(keywords_info)>0 else ''
        manuscript.append(keywords)

        manuscripts.append(manuscript)
    return manuscripts

## Results of web scrape

In [6]:
url = 'https://www.ncbi.nlm.nih.gov/pubmed?term=Study%5BText%20Word%5D'
manuscripts = crawl(url, page_limit=1300)
manuscripts = pd.read_csv('../Data/pubmed/manuscripts.csv')

We have scraped 1300 pages of the search for 'study' text word and found 26000 manuscript references. Each reference shows the title, author list, publication journal/date, abstract and keywords. All of these 26000 manuscripts were published in 2020; since these manuscripts were recently published, the journal embargo still applies, so the article is not available for free on PubMed.

In [7]:
manuscripts.head()

Unnamed: 0,Titles,Authors,Journal,Date,PMID,Free Article,Abstract,Keywords
0,Characteristics of the isocitrate dehydrogenas...,"Qu CX, Ji HM, Shi XC, Bi H, Zhai LQ, Han DW.",Brain Behav,2020,32146731,False,OBJECTIVES: To explore the characteristics of ...,Chinese gliomas; IDH mutation; TERT promoter m...
1,"Demographics, Natural History and Treatment Ou...","Savage P, Winter M, Parker V, Harding V, Sita-...",BJOG,2020,32146729,False,"OBJECTIVE: To investigate the demographics, na...",Choriocarcinoma; chemotherapy; demographics; i...
2,Integrated seed proteome and phosphoproteome a...,"Sinha A, Haider T, Narula K, Ghosh S, Chakrabo...",Proteomics,2020,32146728,False,Nutrient dynamics in storage organs is a compl...,2DE; chickpea; mass spectrometry; nutrient; pr...
3,Is R(+)-Baclofen the best option for the futur...,"Echeverry-Alzate V, Jeanblanc J, Sauton P, Blo...",Addict Biol,2020,32146727,False,"For several decades, studies conducted to eval...",GABAB receptor; R(+)-Baclofen; RS(±)-Baclofen;...
4,Association between the dimensions of the maxi...,"Zhang B, Wei Y, Cao J, Xu T, Zhen M, Yang G, C...",J Periodontol,2020,32146722,False,BACKGROUND: The information of the association...,Cone-beam computed tomography; molars; mucosal...


In [8]:
manuscripts.tail()

Unnamed: 0,Titles,Authors,Journal,Date,PMID,Free Article,Abstract,Keywords
25995,Uncovering Multidimensional Poverty Experience...,"Zhang L, Han WJ.",Fam Process,2020,32097500,False,,ECLS-K; Externalizing behaviors; Internalizing...
25996,Dietary advanced glycation end products and th...,"Peterson LL, Park S, Park Y, Colditz GA, Anbar...",Cancer,2020,32097496,False,BACKGROUND: Advanced glycation end products (A...,advanced glycation end products; breast cancer...
25997,How community therapists describe adapting evi...,"Kim JJ, Brookman-Frazee L, Barnett ML, Tran M,...",J Community Psychol,2020,32097494,False,The study sought to (a) characterize the types...,
25998,Detection of prey odors underpins dietary spec...,"Aidan Manubay J, Powell S.",J Anim Ecol,2020,32097493,False,Deciphering the mechanisms that underpin dieta...,\nEciton\n; Army Ants; Coexistence; Diet; Ecol...
25999,Statin use and risk of joint replacement due t...,"Sarmanova A, Doherty M, Kuo C, Wei J, Abhishek...",Rheumatology (Oxford),2020,32097491,False,OBJECTIVE: Statins are reported to have a pote...,TJR; TKR; cohort study; joint replacement; ost...


Let's take a look at the first manuscript reference. This manuscript investigates the characteristics of the gene for isocitrate dehydrogenase (a type of protein) in a cohort of Chinese subjects and is published in 'Brain Behavior' journal. The manuscript provided 6 keywords, which provides us quick details about the article even without looking at the abstract.

   * `Chinese gliomas`: indicates this protein is related to the development of brain tumors
   * `IDH mutation`: isocitrate dehydrogenase mutation
   * `TERT promoter mutation`: indicates the condition is caused by mutation of the promoter region before transcription
   * `Mutation frequences`
   * `Overall survival analysis`: relationship between mutation and survival
   * `Sanger sequencing`: technique to sequence the isocitrate dehydrogenase gene

In [9]:
m = manuscripts.iloc[0]
print(m)
print('\nTitle: {}'.format(m['Titles']))
print('\nKeywords: {}'.format(m['Keywords']))
print('\nAbstract: {}'.format(m['Abstract']))

Titles          Characteristics of the isocitrate dehydrogenas...
Authors              Qu CX, Ji HM, Shi XC, Bi H, Zhai LQ, Han DW.
Journal                                               Brain Behav
Date                                                         2020
PMID                                                     32146731
Free Article                                                False
Abstract        OBJECTIVES: To explore the characteristics of ...
Keywords        Chinese gliomas; IDH mutation; TERT promoter m...
Name: 0, dtype: object

Title: Characteristics of the isocitrate dehydrogenase gene and telomerase reverse transcriptase promoter mutations in gliomas in Chinese patients.

Keywords: Chinese gliomas; IDH mutation; TERT promoter mutation; mutation frequencies; overall survival analysis; sanger sequencing

Abstract: OBJECTIVES: To explore the characteristics of IDH and TERT promoter mutations in gliomas in Chinese patients.METHODS: A total of 124 Chinese patients with g

Overall, 26000 manuscript references were scraped. The dataset is available for viewing and download in our `Data/pubmed/` folder. Due to limited resources, we were only able to scrape a small number of manuscripts. This script can be run to scrape the entire 30 million references available through PubMed in the future. This script is written with BeautifulSoup html text scrape instead of using an API; PubMed has an [API](https://www.ncbi.nlm.nih.gov/home/develop/api/) available for access to the database and tools that potentially can make the process easier and more efficient.