# Searching for papers in PubMed using Python

You can execute this code in <a href="https://colab.research.google.com/github/edu9as/web-scraping/blob/master/Searching-For-Papers-In-PubMed.ipynb">**Google Colab**</a>

PubMed is an open-access search engine that allows the user to look for the contents of MEDLINE database, and also other magazines not included in this database. Its use is very extended among the scientific community, specifically in the life sciences area.

In this notebook, I am using web-scraping to look for the PubMed URL of a given paper and to obtain the abstract, title, authors and journal of publication of all the papers that are related with a topic of my interest.

## 1. Load libraries

Only three libraries are needed here: **requests**, **BeautifulSoup** and **time**.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import time

## 2. Examine PubMed conditions for crawling

Because we are scraping PubMed domain, first of all we have to look at its ```robots.txt``` file:

In [2]:
print(requests.get("https://pubmed.ncbi.nlm.nih.gov/robots.txt").text)

User-agent: *
Crawl-delay: 1
Disallow: /api
Disallow: /rss
Disallow: /advanced/adv-suggestions/
Disallow: /terms/
Disallow: /addToHistory/
Disallow: /deleteHistory/
Disallow: /downloadHistory/
Disallow: /deleteHistoryRecord/
Disallow: /historyCacheExists/
Disallow: /*/references/
Disallow: /*/citations/
Disallow: /*/export/
Disallow: /*/citedby/
Disallow: /*/similar/
Disallow: /*/adj-nav/
Disallow: /ajax/
Disallow: /clipboard/
Disallow: /clipboard-next-page/
Disallow: /health/
Disallow: /deep-health-abstract/
Disallow: /deep-health-search/
Disallow: /deep-health-auth/
Disallow: /error/400/
Disallow: /error/403/
Disallow: /error/404/
Disallow: /error/500/
Disallow: /results-export-ids/
Disallow: /results-export-search-data/
Disallow: /results-export-email-by-search-data/
Disallow: /results-export-search-by-year/
Disallow: /send-email/
Disallow: /list-existing-collections/
Disallow: /add-to-existing-collection/
Disallow: /create-and-add-to-new-collection/
Disallow: /toggle-favorites-coll

We are only making requests of query results and PubMed publication entries, so we are free to get the information we want. However, we must wait 1 second after making any request to this domain.

## 3. Define some functions

In this section, I am creating some functions that will be useful when dealing with PubMed webpages.

### 3.1. Scrape PubMed query results webpage

Given an HTML file with the results of a query in PubMed, this function will store the URL for each paper in the webpage in a dictionary and the information about the paper title, author(s) and journal in a tuple of lists.

In [3]:
def scrape_pubmed_query(url, file):
    """
    Requests a PubMed query results webpage and extracts the URLs, title,
    author(s) and journal of publication of all the papers.

    Parameters
    ----------
    url : string
        URL for querying PubMed.

    Returns
    -------
    urls_dict : dictionary
        A dictionary that associates the number (between 1 and 10) of each
        paper in the webpage with the URL of each paper's PubMed entry.
        
    summary : tuple
        Tuple containing three lists. These lists contain the titles, authors
        and journals of the papers in the query results webpage.

    """
    pubmed = get_webpage(url)
    soup = bs(pubmed, "html.parser")
    
    papers = soup.findAll("div", class_="docsum-content")
    urls = [paper.find("a", class_="docsum-title").get("href")
              for paper in papers]
    
    urls_dict = {str(i): urls[i] for i in range(len(urls))}
    
    
    titles = [prepare_text(paper.find("a", class_="docsum-title"), 
                           encode = file)
              for paper in papers]
    authors = [prepare_text(paper.find("span", class_="full-authors"),
                            encode = file)
              for paper in papers]
    journals = [prepare_text(paper.find("span", 
                                        class_="short-journal-citation"), 
                             encode = file)
              for paper in papers]
    
    summary = (titles, authors, journals)
    
    return urls_dict, summary

### 3.2. Print table from list of query results

This function will render a nicely formatted table showing the title, author(s) and journal of the hits of the query.

In [4]:
def print_table_from_list(summary, line = 105):
    """
    Given a tuple with the lists containing the papers information, renders a
    table with a nice format.

    Parameters
    ----------
    summary : tuple
        Tuple containing the lists with the papers information (title, author 
        and journal of publication).
    line : integer, optional
        Maximum line character length for printing the table. The default is
        105 characters long.

    Returns
    -------
    None.

    """
    
    titles, authors, journals = summary
    line = line//3*3  # Length of each line in the table
    col = int(line/3)  # Length of each cell in the table
    
    print("#"*line)  # Top border
    
    print("|    | Title " + (col - 11) * " " + 
          " | Author(s)" + " "*(col-14) + 
          " | Journal" + " "*(col-12) + " |")  # Headers
    
    print("-"*line)  # Header line separator
    
    for i, row in enumerate(zip(titles, authors, journals)):  # Paper info
        print("| ", end = "")  # Left border
        print(i + 1, end = (2-len(str(i + 1)))*" " + " | ")  # Paper number
        
        for cell in row:  # Print row
            n = len(cell.strip())
            cell = cell[0:col - 5] if n > col - 5 else cell
            cell = cell + " "*(col - 5 - n)
            print(cell, end = " | ")
        print("")
        
    print("#"*line)  # Bottom border

### 3.3. Search term (or group of terms) in PubMed

This is the longest function in this notebook. It has two modes:

- Default (```crawl = False```): all the papers that matched the query are printed in the form of a table (defined in **3.2.**) sequentially, until the user chooses a paper. Then, the PubMed URL of that publication is printed to the console and output by this function.

- Crawling (```crawl = True```): the abstracts of all the papers are output in a list of tuples. If desired, also the paper title, author(s) and journal where the paper was published are included in the tuples. Also, the information can be stored in a file.

In [5]:
def pubmed_search_term(term, crawl = False, title = False, author = False,
                       journal = False, file = False):
    """
    Query PubMed and get the desired information. Two modes: default for 
    point requests and "crawl" to get information about multiple papers at the
    same time.

    Parameters
    ----------
    term : string
        The term, terms set or PubMed query to search for papers of interest.
    crawl : boolean, optional
        Set to False for point requests, and True to get information about
        multiple papers at the same time. The default is False.
    title : boolean, optional
        Whether to include the paper title or not in the output. Only for 
        crawling mode. The default is False.
    author : boolean, optional
        Whether to include the paper authors or not in the output. Only for 
        crawling mode. The default is False.
        DESCRIPTION. The default is False.
    journal : boolean, optional
        Whether to include the paper journal or not in the output. Only for 
        crawling mode. The default is False.
    file : string, optional
        The name of the file to be written with the required information, if 
        desired. The default is False.

    Returns
    -------
    abstract_list : list
        List with the required information about the papers of interest.

    """
    pubmed_url = "https://pubmed.ncbi.nlm.nih.gov"
    pubmed_query = pubmed_url + "/?term={}"
    
    term = term.replace(" ", "+")
    i = 0
    url = pubmed_query.format(term)
    abstract_list = []
    
    try:
        while True:
            i += 1
            search_url = url if i == 1 else url + "&page=" + str(i)
            urls, summary = scrape_pubmed_query(search_url, file)
            
            if len(urls) == 0:
                print("No more papers for your query!")
                break
            
            print_table_from_list(summary, line = 80)
            if not crawl:
                paper_num = input("Leave blank to see next results page,\n"
                                  "enter 'ABS' to show all abstracts\n"
                                  "or choose a paper (enter number): ")
                print("")
            else:
                paper_num = "ABS"
            
            if paper_num in "12345678910" and paper_num != "":                    
                pmid = urls[str(int(paper_num) - 1)]
                url = pubmed_url + pmid
                print("The PubMed URL for this paper is: {}".format(url))
                return url
            
            elif paper_num == "ABS":
                for n, paper in enumerate(urls.values()):
                    page = get_webpage(pubmed_url + paper)
                    time.sleep(1)  # Crawler delay in PubMed domain
                    
                    try:
                        if crawl:
                            output = ("page{}, paper{}".format(i,n+1),
                                      show_abstract(page, crawl, file))
                            if file:
                                beg = 2
                                end = -1
                            else:
                                beg = 0
                                end = None
                            for m, info in enumerate((journal, author, title)):
                                if info:
                                    output = ((output[0],) + 
                                              (summary[2-m][n][beg:end],) + 
                                              output[1:])
                                
                            print(str(i)+str(n+1), end = ", ")
                            abstract_list.append(output)
                            continue
                        
                        print("\n\n"
                              "#######\n"
                              "## {} ##\n"
                              "#######".format(str(n + 1)))
                        
                        show_abstract(page)
                        
                        if input("Go to SciHub? ").lower() in ("yes", "y", "si"):
                            print(get_doi(page))
                            time.sleep(3)
                            
                    except AttributeError:
                        print("No abstract for paper {}".format(n+1),
                              end = ", ")
                        
                print("")
                
    except KeyboardInterrupt:
        print("OK, it's time to quit...")
        
    if crawl and file:
        while True:
            try:
                assert(type(file) == type("a"))
                content = u""
                for paper in abstract_list:
                    content += ";".join(paper) + "\n"
                f = open(file, "w")
                f.write(content)
                f.close()
                return abstract_list
                
            except AssertionError:
                file = input("Please, introduce a valid file name: ")
        
    else:
        return abstract_list

### 3.4. Show abstract of a given paper

This function gets the abstract of a paper and returns it after formatting (if necessary) with the function defined in **3.5**.

In [6]:
def show_abstract(page, crawl = False, file = False):
    """
    Parses the HTML file of a PubMed entry and finds the abstract of the paper.

    Parameters
    ----------
    page : string
        HTML file of the paper whose abstract is to be found.
    crawl : boolean, optional
        If True, the abstract is output but not printed to the console. The
        default is False.
    file : string, optional
        The name of the file where the results of the crawling will be written. 
        The default is False.

    Returns
    -------
    abstract : string
        The abstract of the paper of interest, with an adequate format for the
        task.

    """
    soup = bs(page, "html.parser")
    abstract = soup.find("div", id="enc-abstract")
    if crawl and file:
        abstract = prepare_text(abstract, 
                                replace_for_csv = True)
        return abstract
    elif crawl:
        abstract = prepare_text(abstract, encode = False)
        return abstract
    else:
        print("{}".format(abstract.text.strip()))

### 3.5. Prepare text

Prepares the text as needed, i.e., if the text is to be printed to a file, it needs to be encoded as UTF-8 first. Also, in the output file the different fields will be separated by a character (```sep```): it is important that this character is replaced in the original text to avoid problems when opening the file.

In [7]:
def prepare_text(soup, replace_for_csv = False, sep = ";", encode = True):
    """
    Prepares a text from a bs4.BeautifulSoup, removing leading and trailing
    spaces and newlines by default.

    Parameters
    ----------
    soup : bs4.BeautifulSoup
        BeautifulSoup object where the desired text is enclosed.
    replace_for_csv : boolean, optional
        Whether the text has to be prepared to be written to a (csv) file or
        not. The default is False.
    sep : string, optional
        The separator in the output file. It prevents the future file from
        being wrongly formatted by replacing the future separator by ',.'. The
        default is ";".
    encode : boolean, optional
        Whether to encode or not the text as UTF-8. It prevents encoding errors
        when writting a new file. The default is True.

    Returns
    -------
    text : string
        The text, ready to be printed/stored.

    """
    text = soup.text.strip()
    
    if replace_for_csv:
        text = text.replace(sep, ",.").replace("\t", " ").replace("\n", " ")
        
    if encode:
        text = str(text.encode("utf-8"))
        
    return text

### 3.6. Get webpage

To avoid writting "requests.get(...).text" every time I am requesting a webpage, I prefered to write this function.

In [8]:
def get_webpage(url):
    """
    Requests a given webpage and returns its text content.

    Parameters
    ----------
    url : string
        URL of the webpage to be searched.

    Returns
    -------
    text : string
        Text of the required webpage.

    """
    text = requests.get(url).text
    return text

### 3.7. Get DOI of a publication

Finds the DOI of a given paper. The DOI of a paper might be useful for some tasks that now I don't remember.

In [9]:
def get_doi(paper_page):
    """
    Finds the DOI of the desired paper.

    Parameters
    ----------
    paper_page : string
        HTML file of the paper of interest.

    Returns
    -------
    doi : string
        The DOI of the paper.

    """
    scihub_url = "https://sci-hub.st/"
    soup = bs(paper_page, "html.parser")
    
    doi = soup.find("a", {"class": "id-link", "data-ga-action": "DOI"}).text
    url = scihub_url + doi
    
    return doi

### 3.8. Print an introduction to the console when running this code.

In [10]:
def get_intro():
    """
    Prints an introduction to the console when the script is run.

    Returns
    -------
    None.

    """
    intro = """
    Hi! Welcome to my script. By running this code, first you will have to 
    introduce some words to query in PubMed (if you know some structured PubMed 
    query, you can also introduce it).\n
    \n
    As a result, a table with the first results will be printed.
    \n
    You can enter a number to obtain a paper's PubMed URL,\n
    leave a blank spaceto look for more results,\n
    or enter 'ABS' to show the abstracts ofall the papers in the table.\n
    Within this last mode, you can obtain the SciHub URL for each paper\n
    if you want to read the whole paper.\n"""
    
    print(intro)

## 4. Obtain information from PubMed

Let's scrape PubMed using the previous functions!

### 4.1. A single paper

In [11]:
get_intro()
pubmed_search_term(input("Please, enter your query: "))


    Hi! Welcome to my script. By running this code, first you will have to 
    introduce some words to query in PubMed (if you know some structured PubMed 
    query, you can also introduce it).

    

    As a result, a table with the first results will be printed.
    

    You can enter a number to obtain a paper's PubMed URL,

    leave a blank spaceto look for more results,

    or enter 'ABS' to show the abstracts ofall the papers in the table.

    Within this last mode, you can obtain the SciHub URL for each paper

    if you want to read the whole paper.

Please, enter your query: trpa1
##############################################################################
|    | Title                 | Author(s)             | Journal               |
------------------------------------------------------------------------------
| 1  | TRPA1: a molecular vi | Meents JE, Ciotu CI,  | J Neurophysiol. 2019. | 
| 2  | TRPA1.                | Zygmunt PM, Högestätt | Handb Exp Pharmacol.  |

'https://pubmed.ncbi.nlm.nih.gov/24756722/'

### 4.2. Multiple papers

To avoid overloading PubMed just because of a simple demonstration, I'm using KeyboardInterrupt (Ctrl+C) after 43 papers have been fetched.

In [12]:
abstracts = pubmed_search_term(input("Please, enter your query: "),
                               crawl = True,
                               title = True, 
                               author = True,
                               journal = True)

Please, enter your query: trpa1
##############################################################################
|    | Title                 | Author(s)             | Journal               |
------------------------------------------------------------------------------
| 1  | TRPA1: a molecular vi | Meents JE, Ciotu CI,  | J Neurophysiol. 2019. | 
| 2  | TRPA1.                | Zygmunt PM, Högestätt | Handb Exp Pharmacol.  | 
| 3  | Mammalian Transient R | Talavera K, Startek J | Physiol Rev. 2020.    | 
| 4  | The TRPA1 channel in  | Nassini R, Materazzi  | Rev Physiol Biochem P | 
| 5  | ROS/TRPA1/CGRP signal | Jiang L, Ma D, Grubb  | J Headache Pain. 2019 | 
| 6  | TRPA1 channel contrib | Conklin DJ, Guo Y, Ny | Am J Physiol Heart Ci | 
| 7  | TRPA1 Channel as a Re | Logashina YA, Korolko | Biochemistry (Mosc).  | 
| 8  | [Roles of TRPA1 in Pa | So K.                 | Yakugaku Zasshi. 2020 | 
| 9  | Structural Insights i | Suo Y, Wang Z, Zubcev | Neuron. 2020.         | 
| 10 | Role

Let's see what is the output of this function when searching for multiple papers (only first 3 publications are shown):

In [13]:
for i, paper in enumerate(abstracts):
    if i > 2: break
    
    print("# "+ paper[0])
    print("Title:\n" + paper[1])
    print("Author(s):\n" + paper[2])
    print("Journal:\n" + paper[3])
    print("Abstract:\n" + paper[4])
    print("\n")
    

# page1, paper1
Title:
TRPA1: a molecular view.
Author(s):
Meents JE, Ciotu CI, Fischer MJM.
Journal:
J Neurophysiol. 2019.
Abstract:
The transient receptor potential ankyrin 1 (TRPA1) ion channel is expressed in pain-sensing neurons and other tissues and has become a major target in the development of novel pharmaceuticals. A remarkable feature of the channel is its long list of activators, many of which we are exposed to in daily life. Many of these agonists induce pain and inflammation, making TRPA1 a major target for anti-inflammatory and analgesic therapies. Studies in human patients and in experimental animals have confirmed an important role for TRPA1 in a number of pain conditions. Over the recent years, much progress has been made in elucidating the molecular structure of TRPA1 and in discovering binding sites and modulatory sites of the channel. Because the list of published mutations and important molecular sites is steadily growing and because it has become difficult to see