<a href="https://colab.research.google.com/github/mauro-nievoff/MultiCaRe_Dataset/blob/main/1_How_to_Query_Case_Reports_from_PubMed_using_BioPython.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Query Case Reports from PubMed using Biopython (Bio.Entrez)

When it comes to biomedical literature, [PubMed](https://pubmed.ncbi.nlm.nih.gov/) is the most comprehensive database, but querying data from it can sometimes be a complex task. In this notebook we will delve into the use of [Biopython](https://biopython.org/), a versatile bioinformatic toolkit to search and retrieve relevant open-access articles from PubMed.

We will discuss:
- How to create a search string
- Retrieving article IDs
- Tips to run a large query

## 📓 General Concepts about PubMed Searches

Generally speaking, a PubMed search should be as specific and sensitive as possible. This means that it should return as many relevant articles while minimizing the inclusion of irrelevant ones.

Let's see an example. If we wanted to get all the case reports published in 2022 related to hypertension, we would write something like this:

```('high blood pressure'[Title/Abstract] OR 'Hyptertension'[Mesh]) AND case reports[Publication Type] AND 2022[Date - Publication]```

This search string includes:
- pertinent strings (using quotation marks if the match should be exact), such as `high blood pressure` or `2022`
- the field in which they should be searched (between square brackets), such as `[Mesh]` or `[Publication Type]`
- boolean operators (using brackets as necessary), such as `AND` or `OR`

In order to get the most appropriate string for a specific use case, an iterative process should be followed: first running a query, then analyzing the results and finally modifying the search string if necessary. This is the kind of questions that should be kept in mind while refining the search:
- Is it necessary to add a new filter?
- Is there any other relevant term that should be included in the string?
- Is there any specific term associated with irrelevant articles that should be used together with the boolean operator `NOT`?

## 🔍 Creating a Search String for Case Report Articles

The create_search_string() function is used to create a search string to retrieve as many relevant case reports as possible given a specific clinical use case (e.g. a specific disease, symptom or therapy). When no use case is provided, the generic string 'case' is used, in order to get any case report.

In [None]:
def create_search_string(clinical_usecase = ''):

  if clinical_usecase:
    search = clinical_usecase
  else:
    search = 'case'

  cr_filter_search_string = f"({search}[All Fields] AND case reports[Publication Type] NOT animal[filter])"

  case_synonyms = ['case study', 'case studies', 'case series', 'case report', 'case reports', 'clinical case', 'clinical cases', 'case presentation', 'case presentations']
  case_search_string = '('
  for idx, synonym in enumerate(case_synonyms):
    case_search_string += synonym + '[Title/Abstract]'
    if idx != len(case_synonyms) -1:
      case_search_string += ' OR '
    else:
      case_search_string += ')'

  cr_term_search_string = f"(({search}[All Fields]) AND {case_search_string} NOT case reports[Publication Type] NOT animal[filter])" # Animal case reports are excluded.

  search_string = f"({cr_filter_search_string} OR {cr_term_search_string}) AND ffrft[Filter]" # ffrft is used to retrieve only full free-text articles.

  return search_string

In [None]:
search_string = create_search_string()

In [None]:
search_string

'((case[All Fields] AND case reports[Publication Type] NOT animal[filter]) OR ((case[All Fields]) AND (case study[Title/Abstract] OR case studies[Title/Abstract] OR case series[Title/Abstract] OR case report[Title/Abstract] OR case reports[Title/Abstract] OR clinical case[Title/Abstract] OR clinical cases[Title/Abstract] OR case presentation[Title/Abstract] OR case presentations[Title/Abstract]) NOT case reports[Publication Type] NOT animal[filter])) AND ffrft[Filter]'

## 💻 Getting Article IDs with Biopython

Now that we have our search string, we will get all the relevant PubMed article IDs (PMIDs) using Biopython. For that, we need to install the library and import Entrez.

In [None]:
%%capture
!pip install Bio

In [None]:
from Bio import Entrez

In order to use Entrez, the email address from an NCBI account should be set, along with an API key (available at https://www.ncbi.nlm.nih.gov/account/settings/).

In [None]:
Entrez.email = "your@email.com"
Entrez.api_key = "your_api_key"

Once everything is set up, the code below can be run to get the list of PMIDs given a search string.

In [None]:
handle = Entrez.esearch(db="pubmed", term=search_string, retmode="xml", retmax= 10000)
record = Entrez.read(handle)
pmid_list = record["IdList"]

In [None]:
print(f"A total amount of {len(pmid_list)} were retrieved.")

A total amount of 9999 were retrieved.


PMIDs can be easily mapped to PubMed Central IDs (PMCIDs) with the function below.

In [None]:
def get_pmcid(pmid):
  pmid_handle = Entrez.efetch(db="pubmed", id=pmid, rettype="xml")
  pmid_record = Entrez.read(pmid_handle)
  article_ids = pmid_record['PubmedArticle'][0]['PubmedData']['ArticleIdList']
  for e in article_ids:
    if e.attributes['IdType'] == 'pmc':
      pmcid = str(e)
      break
    else:
      pmcid = 'not_found'
  return pmcid

In [None]:
get_pmcid(pmid = '36709280')

'PMC9884407'

## ✅ How to Run a Large Query

Entrez will return a maximum of 10,000 results per query. Usually this amount should be ok, but if you want to get more than that, one possible workaround is to split the time period of your search into multiple periods and run different queries.

The function get_pmids() returns all the article ids for a certain query between the start and end years that are specified (by default, the period 2018-2022 is used).

In [None]:
def get_pmids(search_string, start_year = 2018, end_year = 2022):

  pmid_list = []

  for year in range(int(start_year), int(end_year)+1):
    date_string = str(year)
    query = search_string + ' AND ' + date_string + '[Date - Publication]'
    handle = Entrez.esearch(db="pubmed", term=query, retmode="xml", retmax= 10000)
    record = Entrez.read(handle)
    pmid_list += record["IdList"]

  pmid_list = list(set(pmid_list))
  return pmid_list

In [None]:
full_pmid_list = get_pmids(search_string)

In [None]:
print(f"A total amount of {len(full_pmid_list)} were retrieved.")

A total amount of 49734 were retrieved.


If we wanted to get even more PMIDs, then more granular temporal filters could be used (by month or by day).