## Scraping data from PubMed using Selenium and BeautifulSoup

In this notebook I will be scraping data from [PubMed](https://pubmed.ncbi.nlm.nih.gov/about/) - a free resource for search and retrieval of medical and life sciences literature. This is usually the first stop for a researcher interested in a specific topic. The aim is to collect as much information as possible about articles that show up in the query results of a specific topic. 

My topic of choosing is 'stem cell therapies'. I specifically chose this topic because I have been working in a biotechnology company that supplies materials for stem cell researchers. The products I have been working on are for Pulmonary (lung) researchers. I wanted to use this opportunity to step back and understand what the field looks like from afar.

In [1]:
# Import necessary packages
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import shutil
import time
import numpy as np
import pandas as pd
import re
import random

### Query 'Stem Cell Therapies' Using Selenium

[Selenium](https://www.selenium.dev/documentation/en/) is a tool to automate web browsing. You can open a browser, go to a specific url and type and click away. This is me figuring out the basics by opening PubMed's home page and entering the search term 'stem cell therapies':

In [2]:
# Set up driver
driver = webdriver.Chrome()

# Connect to pubmed
url = 'https://pubmed.ncbi.nlm.nih.gov/'
driver.get(url)

# Locate the search bar from html
search_bar = driver.find_element_by_xpath('//*[@id="id_term"]')

# Input search term into the search bar
search_term = 'stem cell therapies'
search_bar.send_keys(search_term) # I can type into the search bar!
search_bar.send_keys(Keys.RETURN) # And press ENTER to search

#### Search result
Scrolling to the bottom of the page, we see that there are 218,709 results (publications) about 'stem cell therapies'! 


The default setting for PubMed sorts the results by 'Best match'. Each result is shown with a linked title, authors, journal title and doi, PMID (PubMed's internal ID for the publication), publication type, and a snippet of its abstract. The titles are linked to a page within PubMed with more comprehensive information about the publication. 


One can scroll down and press `Show more` to show more pages of the search results, or add `&page=` and the desired page number at the end of the url. Each page contains 10 publications. The `Jump to page` link shows that there are 21,871 pages, however, PubMed limits the maximum value of this to 1000. So we will only be able to scrape 1000 pages of 10 articles (10000 articles) at most. 

The link to the PubMed page for each article follows this pattern: `https://pubmed.ncbi.nlm.nih.gov/` followed by the publication's PMID. There, one can find the article title, a comprehensive list of authors, the affiliations each author has, PMID, DOI, the full abstract, short form of the journal title, and date of publication. Towards the bottom, there are links to similar articles, the list of articles that cited the publication, MeSH terms (keywords) associated with the publication, and the number of references the publication used. 

### Scraping using Selenium and BeautifulSoup:

Selenium is great for controlling the browser, but not as helpful for extracting information about each web page that I've identified above. For this, I will use BeautifulSoup which is fantastic at extracting information from the web page's HTML code. 

I will loop through the pages (1 through 1000) and at each page, I will scrape the PMIDs for each publication. Then looping through each PMID, I will scrape the key information as noted above for each article.

In [None]:
%%time

# Define arguments
pages = np.arange(1, 1001, 1)
search_term = 'stem cell therapy'
search_term = search_term.replace(' ', '%20')

# Define source and destination paths
source = f'~/Downloads/PubMed_Timeline_Results_by_Year.csv'
destination = f'C:~/Desktop/{search_term}_npubs.csv'

# Create empty dictionary to store article information
article_dict = {'article_id':[],
                'title':[],
                'publication_type':[],
                'abstract':[],
                'journal_title':[],
                'citation':[],
                'n_authors':[],
                'affiliations':[],
                'n_affiliations':[],
                'n_citations':[],
                'keywords':[],
                'n_references':[]}

# Create empty dictionary for error articles
error_articles = []

# Loop through pages to obtain article_ids pertaining to search term
for p in pages:
    
    # Create empty list to store all article ids
    article_ids = []
    
    try:
        # Define url
        url = f'https://pubmed.ncbi.nlm.nih.gov/?term={search_term}&page={p}'
        
        # Use chromedriver to open url
        driver = webdriver.Chrome()
        driver.get(url)
        
        # Download csv for n_publications over the years
        if p == 1:
            # Find download csv button and click
            n_pubs = driver.find_element_by_xpath('//*[@id="side-download-results-by-year-button"]')
            n_pubs.click()
            # Wait for download
            time.sleep(2)
            # Move into data folder in project folder
            shutil.move(source, destination)
        
        # Make html soup of the page
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        # Create a list of the search results
        docsum = soup.find_all('a', class_='docsum-title')
        
        # Extract only the article id numbers and append to article_ids
        for i in range(len(docsum)):
            article_ids.append(docsum[i]['data-article-id'])

        # Status update
        print(f'{len(article_ids)} Articles found on Page {p}...                    ', end='\r')                       
            
        # Wait
        time.sleep(random.randint(0, 5)/10)
        
        # Loop through each article page
        for i in range(len(article_ids)):
    
            try:
    
                # Set url
                url = f'https://pubmed.ncbi.nlm.nih.gov/{article_ids[i]}/'
    
                # Set driver
                driver = webdriver.Chrome()
                driver.get(url)
    
                # Make html soup
                soup = BeautifulSoup(driver.page_source, 'html.parser')
    
                # Get article information
                # Title
                try:
                    title = soup.find('title').text
                except:
                    title = ''
                # Publication Type
                try:
                    pub_type = soup.find(class_='publication-type').text
                except:
                    pub_type = ''
                # Abstract
                try:
                    abstract = soup.find(class_="abstract-content selected").text
                except:
                    abstract = ''
                # Journal Title
                journal_info = soup.find(class_='journal-actions dropdown-block')
                journal_title = journal_info.find('button')['title']
                # Citation for Date
                try:
                    citation = soup.find('span', class_='cit').text
                except:
                    citation = ''
                # Authors Info
                authors_info = soup.find('div', class_='authors-list')
                authors = authors_info.find_all('a', class_='full-name')
                n_authors = len(authors)
                # Affiliation Info (for institution)
                affs = authors_info.find_all('a', class_='affiliation-link')
                affiliations = [aff['title'] for aff in affs]
                n_affiliations = len(affiliations)
                # Number of citations
                try:
                    cited_by = soup.find('em', class_='amount').text
                except:
                    cited_by = ''
                # Keywords deemed by Pubmed
                s = str(soup.find('div', class_='mesh-terms keywords-section'))
                keywords_long = re.findall(r'Toggle dropdown menu for keyword [\w\s /\*]+', s)
                keywords = [keyword.lstrip('Toggle dropdown menu for keyword ') for keyword in keywords_long]
                # Number of References
                try:
                    n_refs = soup.find('div', class_='refs-list').find('button').text
                except:
                    n_refs = ''
    
                # Add info into article_dict
                article_dict['article_id'].append(article_ids[i])
                article_dict['title'].append(title)
                article_dict['publication_type'].append(pub_type)
                article_dict['abstract'].append(abstract)
                article_dict['journal_title'].append(journal_title)
                article_dict['citation'].append(citation)
                article_dict['n_authors'].append(n_authors)
                article_dict['affiliations'].append(affiliations)
                article_dict['n_affiliations'].append(n_affiliations)
                article_dict['n_citations'].append(cited_by)
                article_dict['keywords'].append(keywords)
                article_dict['n_references'].append(n_refs)
        
                # Status update
                print(f'Currently on Page {p} / {max(pages)}: Article {i+1} / {len(article_ids)} done          ', end='\r')
        
                # Wait
                time.sleep(random.randint(0, 5)/10)
    
            # Exception message
            except Exception as ex:
                template = "An exception of type {0} occurred. Arguments:\n{1!r}"
                message = template.format(type(ex).__name__, ex.args)
                print(message, f'Page {p}, Article {i+1}')
                
                # Delete entries of this loop if problem arises
                for k in article_dict.keys():
                    if len(article_dict[k]) == i+1:
                        article_dict[k].pop()
                
                error_articles.append(article_ids[i])
        
                continue
        
        # Status update
        print(f'Page {p} / {len(pages)} done          ', end='\r')
        
        # Wait
        time.sleep(random.randint(0, 5)/10)
        
    except Exception as ex:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(ex).__name__, ex.args)
        print(message, f'Page {p}')
        # https://stackoverflow.com/questions/9823936/python-how-do-i-know-what-type-of-exception-occurred
        
print('All done!')

In [11]:
# Create a dataframe from results and look at it's shape
pd.DataFrame(article_dict).shape

(9926, 12)

#### Scrape result

I was able to collect information on 99.26% of the publications that I crawled in 14.5 hours. The errors are most likely due to articles lacking information about authors, affiliations, or keywords as I forgot to add a try/except message there. I can check this in the future from the `error_articles` list. 

The scrape took 14.5 hours. Although I was able to do things in the meantime (beauty of automation!), I hope there is a way to cut this time down. For now, I will not trouble shoot this any further so that I can move on with my analysis.

In [9]:
# Save the results into a csv
pd.DataFrame(article_dict).to_csv('data/pubmedscrape_full.csv')

In [12]:
# Save error pages into a csv
pd.DataFrame({'errors':error_articles}).to_csv('error_articles.csv')

Few things to note:

- Status update message can be improved
- Would be helpful to print `x% of job done` type of message
- Could really use helper functions here
- Need to add try and except for `keywords`, `authors`, `affiliations`
- Would love to compile this into a script

### Summary

Using Selenium and BeautifulSoup I was able to automate data collection and compile information from 9926 publications about 'stem cell therapies' from PubMed. The current method is not perfect as it does not make use of helper functions or scripting and will need to be improved on in the future for more efficiency.