# BioMed: Information Retrieval - BioMedical Information Retrieval System

---

**Group:**
- Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)
- Rodriguez Fernández, Cristina ()

**Course:** BioMedical Informatics - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** November 2026

---

## Goal

To develop an Information Retrieval system — specifically, a **binary text classifier** — to identify scientific articles in the PubMed database that are related to a given set of abstracts within a defined research topic. In this case, the focus is on a collection of 1,308 manuscripts containing information on the polyphenol composition of various foods.

## Setup and Installation

In [None]:
%pip install pandas requests

In [1]:
import requests
import time

import pandas as pd

## **Task 1:** 

Retrieve from PubMed the abstracts associated with each publication in publications.xlsx

In [2]:
dataset = pd.read_csv('publications.csv')
dataset.head()

Unnamed: 0,id,authors,year_of_publication,title,abbreviation,journal_name,journal_volume,journal_issue,pages,created_at,updated_at
0,1216,"Aaby K., Wrolstad R.E., Ekeberg D., Skrede G.",2007,Polyphenol composition and antioxidant activit...,AABY 2007,Journal of Agricultural and Food Chemistry,55,13.0,5156-5166,2012-12-01 22:21:08 UTC,2015-04-14 04:25:30 UTC
1,1052,"Abd El Mohsen M.M., Kuhnle G., Rechner A.R., S...",2002,Uptake and metabolism of epicatechin and its a...,ABD EL MOHSEN 2002,Free Radic Biol Med,33,12.0,1693-702,2015-04-13 21:45:29 UTC,2015-04-14 04:25:30 UTC
2,356,"Abdel-Aal E.-S.M., Hucl P.",2003,Composition and stability of anthocyanins in b...,ABDEL-AAL 2003,Journal of Agricultural and Food Chemistry,51,,2174-2180,2015-04-13 21:45:25 UTC,2015-04-14 04:25:30 UTC
3,458,"Abdel-Aal E.-S. M., Young C., Rabalski I.",2006,"Anthocyanin composition in black, blue, pink, ...",ABDEL-AAL 2006,Journal of Agricultural and Food Chemistry,54,,4696-4704,2006-04-09 12:07:36 UTC,2015-04-14 04:25:31 UTC
4,332,"Abril M., Negueruela A.I., Perez C., Juan T., ...",2005,Preliminary study of resveratrol content in Ar...,Apr-05,Food Chemistry,92,4.0,729-736,2015-04-13 21:45:25 UTC,2015-04-13 21:45:25 UTC


In [3]:
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
ESEARCH_URL = BASE_URL + "esearch.fcgi"
FETCH_URL = BASE_URL + "efetch.fcgi"

# Step 1: Search for the PMID of the article by title
def search_pmid_by_title(title, api_key=None):
    params = {
        "db": "pubmed",
        "term": f"{title}[Title]",
        "retmode": "json",
        "api_key": api_key
    }

    try:

        response = requests.get(ESEARCH_URL, params=params)
        response.raise_for_status()
        data = response.json()

        if 'esearchresult' in data and data['esearchresult']['count'] != '0':
            return data['esearchresult']['idlist'][0]
        else:
            print(f"Found {data['esearchresult']['count']} PMIDs for title: {title}. Skipping...")
            return None

    except requests.exceptions.RequestException as e:
        print(f"Error during request for title '{title}': {e}")
        return None

# Step 2: Fetch article abstract by PMID
def fetch_abstract_by_pmid(pmid, api_key=None):
    params = {
        "db": "pubmed",
        "id": pmid,
        "retmode": "text",
        "rettype": "abstract",
        "api_key": api_key
    }

    try:
        response = requests.get(FETCH_URL, params=params)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching abstract for PMID '{pmid}': {e}")
        return None

# Process each article in the dataset
abstracts = []
for i, article in dataset.iterrows():
    title = article['title']
    pmid = search_pmid_by_title(title, "8e029cc2ba291ed9ee30e494f27c18017408")
    if pmid:
        abstract = fetch_abstract_by_pmid(pmid, "8e029cc2ba291ed9ee30e494f27c18017408")
        abstracts.append(abstract)
    else:
        abstracts.append("Abstract not found")

    print("Sleeping for 0.1...")
    time.sleep(0.1)  # Delaying 0.1s to respect NCBI rate limits (10 requests per second)

# Add abstracts to the dataset
dataset['abstract'] = abstracts

# Save the updated dataset
dataset.to_csv('publications_v2.csv', index=False)

Sleeping for 0.1...
Found 0 PMIDs for title: Uptake and metabolism of epicatechin and its access to the brain after oral ingestion. Skipping...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Found 0 PMIDs for title: Preliminary study of resveratrol content in Aragon red and rose wines. Skipping...
Sleeping for 0.1...
Sleeping for 0.1...
Found 0 PMIDs for title: Enhancement of total phenolics and antioxidant properties of some tropical green leafy vegetables by steam cooking. Skipping...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Found 0 PMIDs for title: Correlation of tocopherol, tocotrienol, gamma-oryzanol and total polyphenol content in rice bran with different antioxidant capacity assays. Skipping...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Found 0 PMIDs for title: Functional attributes of soybean seeds and products, with reference to isoflavone content and antioxidant activity. Skipping...
Sleeping for 0.1...
Slee

There are a lot of PMIDs whose abstract is not available :( ... Ask professor?

## **Task 2:**

Use the EUtilities tool to search for articles whose content is not relevant to this task. Size of the dataset should be the same of relevant documents.

In [11]:
def search_irrelevant_articles(term, count, api_key):
    print(f"Fetching {count} irrelevant articles...")

    params = {
        "db": "pubmed",
        "term": term,
        "retmode": "json",
        "retmax": count,
        "api_key": api_key
    }

    try:
        response = requests.get(ESEARCH_URL, params=params)
        response.raise_for_status()
        data = response.json()

        if 'esearchresult' in data and data['esearchresult']['count'] != '0':
            return data['esearchresult']['idlist']
        else:
            print(f"Found {data['esearchresult']['count']} irrelevant articles.")
            return []

    except requests.exceptions.RequestException as e:
        print(f"Error during request for irrelevant articles: {e}")
        return []

irrelevant_pmids_list = search_irrelevant_articles("cancer[Title]", len(dataset), "8e029cc2ba291ed9ee30e494f27c18017408")
irrelevant_abstracts = []
for pmid in irrelevant_pmids_list:
    abstract = fetch_abstract_by_pmid(pmid, "8e029cc2ba291ed9ee30e494f27c18017408")
    irrelevant_abstracts.append(abstract)
    print("Sleeping for 0.1...")
    time.sleep(0.1)  # Delaying 0.1s to respect NCBI rate limits (10 requests per second)


# Save irrelevant abstracts to a new dataset
irrelevant_dataset = pd.DataFrame({'pmid': irrelevant_pmids_list, 'abstract': irrelevant_abstracts})

# Erasing entries with "Abstract not found"
irrelevant_dataset = irrelevant_dataset[irrelevant_dataset['abstract'] != "Abstract not found"]
irrelevant_dataset.to_csv('irrelevant_publications.csv', index=False)

Fetching 1308 irrelevant articles...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sleeping for 0.1...
Sle

In [None]:
irrelevant_dataset

Unnamed: 0,pmid,abstract
0,41123968,1.
1,41123956,1.
2,41123928,1.
3,41123909,1.
4,41123893,1.
...,...,...
1303,41098933,1. Cureus. 2025 Oct 14;17(10):e94537. doi: 10....
1304,41098932,1. Cureus. 2025 Oct 14;17(10):c360. doi: 10.77...
1305,41098902,1. Front Cell Infect Microbiol. 2025 Sep 30;15...
1306,41098869,1. Contemp Oncol (Pozn). 2025;29(3):297-315. d...


In [None]:
# There are strange abstract in the irrelevant dataset like "1.", erasing them and researching...
irrelevant_dataset_cleaned = irrelevant_dataset[~irrelevant_dataset['abstract'].str.match('1.')]

irrelevant_dataset_cleaned

Empty DataFrame
Columns: [pmid, abstract]
Index: []


In [None]:
# Fetching other irrelevant abstracts
# new_irrelevant_pmids_list = search_irrelevant_articles("pneumonia[Title]", len(dataset) - len(irrelevant_dataset), "8e029cc2ba291ed9ee30e494f27c18017408")
# new_irrelevant_abstracts = []
# for pmid in new_irrelevant_pmids_list:
#     abstract = fetch_abstract_by_pmid(pmid, "8e029cc2ba291ed9ee30e494f27c18017408")
#     new_irrelevant_abstracts.append(abstract)
#     print("Sleeping for 0.1...")
#     time.sleep(0.1)  # Delaying 0.1s to respect NCBI rate limits (10 requests per second)
# 
# # Adding abstracts to the irrelevant dataset
# new_irrelevant_dataset = pd.DataFrame({'pmid': new_irrelevant_pmids_list, 'abstract': new_irrelevant_abstracts})
# irrelevant_dataset = pd.concat([irrelevant_dataset, new_irrelevant_dataset], ignore_index=True)
# irrelevant_dataset.to_csv('irrelevant_publications_v2.csv', index=False)