# BioMed: Information Retrieval - BioMedical Information Retrieval System

---

**Group:**
- Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)
- Rodriguez Fernández, Cristina ()

**Course:** BioMedical Informatics - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** November 2026

---

## Goal

To develop an Information Retrieval system — specifically, a **binary text classifier** — to identify scientific articles in the PubMed database that are related to a given set of abstracts within a defined research topic. In this case, the focus is on a collection of 1,308 manuscripts containing information on the polyphenol composition of various foods.

## Setup and Installation

In [None]:
%pip install pandas requests

In [11]:
import requests
import time

import pandas as pd

## **Task 1:** 

Retrieve from PubMed the abstracts associated with each publication in publications.xlsx

In [12]:
dataset = pd.read_csv('publications.csv')
dataset.head()

Unnamed: 0,id,authors,year_of_publication,title,abbreviation,journal_name,journal_volume,journal_issue,pages,created_at,updated_at
0,1216,"Aaby K., Wrolstad R.E., Ekeberg D., Skrede G.",2007,Polyphenol composition and antioxidant activit...,AABY 2007,Journal of Agricultural and Food Chemistry,55,13.0,5156-5166,2012-12-01 22:21:08 UTC,2015-04-14 04:25:30 UTC
1,1052,"Abd El Mohsen M.M., Kuhnle G., Rechner A.R., S...",2002,Uptake and metabolism of epicatechin and its a...,ABD EL MOHSEN 2002,Free Radic Biol Med,33,12.0,1693-702,2015-04-13 21:45:29 UTC,2015-04-14 04:25:30 UTC
2,356,"Abdel-Aal E.-S.M., Hucl P.",2003,Composition and stability of anthocyanins in b...,ABDEL-AAL 2003,Journal of Agricultural and Food Chemistry,51,,2174-2180,2015-04-13 21:45:25 UTC,2015-04-14 04:25:30 UTC
3,458,"Abdel-Aal E.-S. M., Young C., Rabalski I.",2006,"Anthocyanin composition in black, blue, pink, ...",ABDEL-AAL 2006,Journal of Agricultural and Food Chemistry,54,,4696-4704,2006-04-09 12:07:36 UTC,2015-04-14 04:25:31 UTC
4,332,"Abril M., Negueruela A.I., Perez C., Juan T., ...",2005,Preliminary study of resveratrol content in Ar...,Apr-05,Food Chemistry,92,4.0,729-736,2015-04-13 21:45:25 UTC,2015-04-13 21:45:25 UTC


In [13]:
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
ESEARCH_URL = BASE_URL + "esearch.fcgi"
FETCH_URL = BASE_URL + "efetch.fcgi"

# Step 1: Search for the PMID of the article by title
def search_pmid_by_title(title):
    params = {
        "db": "pubmed",
        "term": title,
        "retmode": "json"
    }
    response = requests.get(ESEARCH_URL, params=params)
    data = response.json()

    # Checking only one PMID is returned
    if 'esearchresult' not in data or '1' != data['esearchresult']['count']:
        print(f"PMID not found for title: {title}")
        return None

    return data['esearchresult']['idlist'][0]

# Step 2: Fetch article abstract by PMID
def fetch_abstract_by_pmid(pmid):
    params = {
        "db": "pubmed",
        "id": pmid,
        "retmode": "text",
        "rettype": "abstract"
    }
    response = requests.get(FETCH_URL, params=params)
    return response.text

# Process each article in the dataset
abstracts = []
for i, article in dataset.iterrows():
    title = article['title']
    pmid = search_pmid_by_title(title)
    if pmid:
        abstract = fetch_abstract_by_pmid(pmid)
        abstracts.append(abstract)
        print(f"Fetched abstract for article: {title}")
    else:
        print(f"Failed to fetch abstract for article: {title}")
        abstracts.append("Abstract not found")

    print("Sleeping for 1...")
    time.sleep(1)  # Delaying 1s to respect NCBI rate limits (3 requests per second)

# Add abstracts to the dataset
dataset['abstract'] = abstracts

# Save the updated dataset
dataset.to_csv('publications_with_abstracts.csv', index=False)

Fetched abstract for article: Polyphenol composition and antioxidant activity in strawberry purees  impact of achene level and storage
Sleeping for 1...
PMID not found for title: Uptake and metabolism of epicatechin and its access to the brain after oral ingestion
Failed to fetch abstract for article: Uptake and metabolism of epicatechin and its access to the brain after oral ingestion
Sleeping for 1...
Fetched abstract for article: Composition and stability of anthocyanins in blue-grained wheat
Sleeping for 1...
PMID not found for title: Anthocyanin composition in black, blue, pink, purple, and red cereal grains
Failed to fetch abstract for article: Anthocyanin composition in black, blue, pink, purple, and red cereal grains
Sleeping for 1...
PMID not found for title: Preliminary study of resveratrol content in Aragon red and rose wines
Failed to fetch abstract for article: Preliminary study of resveratrol content in Aragon red and rose wines
Sleeping for 1...
Fetched abstract for arti