# BioMed: Information Retrieval - BioMedical Information Retrieval System

---

**Group:**
- Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)
- Rodriguez Fernández, Cristina ()

**Course:** BioMedical Informatics - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** November 2026

---

## Goal

To develop an Information Retrieval system — specifically, a **binary text classifier** — to identify scientific articles in the PubMed database that are related to a given set of abstracts within a defined research topic. In this case, the focus is on a collection of 1,308 manuscripts containing information on the polyphenol composition of various foods.

## Setup and Installation

In [1]:
# !pip install scikit-learn pandas requests transformers pytorch datasets numpy

In [3]:
import requests
import time
import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback
from datasets import Dataset
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report

## **Task 1:**

Retrieve from PubMed the abstracts associated with each publication in publications.xlsx

(21 minutes with API KEY)

In [4]:
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
ESEARCH_URL = BASE_URL + "esearch.fcgi"
FETCH_URL = BASE_URL + "efetch.fcgi"
DS_WITH_PMID = 'publications_pmid.csv'
PMID_ABSTRACTS = 'publications_pmid_abstract.csv'
ELSEVIER_SEARCH_URL = "https://api.elsevier.com/content/search/scopus"
ELSEVIER_ABSTRACTS = 'publications_abstract_pubmed_elsevier.csv'


In [5]:
dataset = pd.read_csv('publications.csv')
dataset

Unnamed: 0,id,authors,year_of_publication,title,abbreviation,journal_name,journal_volume,journal_issue,pages,created_at,updated_at
0,1216,"Aaby K., Wrolstad R.E., Ekeberg D., Skrede G.",2007,Polyphenol composition and antioxidant activit...,AABY 2007,Journal of Agricultural and Food Chemistry,55,13,5156-5166,2012-12-01 22:21:08 UTC,2015-04-14 04:25:30 UTC
1,1052,"Abd El Mohsen M.M., Kuhnle G., Rechner A.R., S...",2002,Uptake and metabolism of epicatechin and its a...,ABD EL MOHSEN 2002,Free Radic Biol Med,33,12,1693-702,2015-04-13 21:45:29 UTC,2015-04-14 04:25:30 UTC
2,356,"Abdel-Aal E.-S.M., Hucl P.",2003,Composition and stability of anthocyanins in b...,ABDEL-AAL 2003,Journal of Agricultural and Food Chemistry,51,,2174-2180,2015-04-13 21:45:25 UTC,2015-04-14 04:25:30 UTC
3,458,"Abdel-Aal E.-S. M., Young C., Rabalski I.",2006,"Anthocyanin composition in black, blue, pink, ...",ABDEL-AAL 2006,Journal of Agricultural and Food Chemistry,54,,4696-4704,2006-04-09 12:07:36 UTC,2015-04-14 04:25:31 UTC
4,332,"Abril M., Negueruela A.I., Perez C., Juan T., ...",2005,Preliminary study of resveratrol content in Ar...,Apr-05,Food Chemistry,92,4,729-736,2015-04-13 21:45:25 UTC,2015-04-13 21:45:25 UTC
...,...,...,...,...,...,...,...,...,...,...,...
1303,816,"Zielinski H., Kozlowska H., Lewczuk B.",2001,Bioactive compounds in the cereal grains befor...,ZIELINSKI 2001,Innovative Food Science and Emerging Technologies,2,,159-169,2015-04-13 21:45:28 UTC,2015-04-13 21:45:28 UTC
1304,497,"Zielinski H., Michalska A., Piskula M.K., Kozl...",2006,Antioxidants in thermally treated buckwheat gr...,ZIELINSKI 2006,Molecular Nutrition and Food Research,50,,824-832,2015-04-13 21:45:26 UTC,2015-04-14 13:51:47 UTC
1305,743,"Zimmermann R., Bauermann U., Morales F.",2006,Effects of growing site and nitrogen fertiliza...,ZIMMERMANN 2006,Journal of the Science of Food and Agriculture,86,,415-419,2015-04-13 21:45:27 UTC,2015-04-13 21:45:27 UTC
1306,203,"Zuo Y., Wang C., Zhan J.",2002,"Separation, characterization and quantitation ...",ZUO 2002,Journal of Agricultural and Food Chemistry,50,13,3789-3794,2015-04-13 21:45:24 UTC,2015-04-14 13:51:48 UTC


In [10]:
# Step 1: Search for the PMID of the articles by title
def search_pmid(article):

    title = article['title']
    params = {
        "db": "pubmed",
        "term": f"{title}",
        "retmode": "json",
        "field": "title"
    }

    try:

        # Trying to find the PMID
        response = requests.get(ESEARCH_URL, params=params)
        response.raise_for_status()
        data = response.json()

        if len(data["esearchresult"]["idlist"]) >= 1:
            pmid = data['esearchresult']['idlist'][0]
            print(f"> Found PMID for article: {pmid}")
            return pmid

        print(f"> No PMID found for article.")
        return None

    except requests.exceptions.RequestException as e:
        print(f"> ERROR during request for article: {e}")
        return None

ds_pmid = dataset.copy()
for idx, article in ds_pmid.iterrows():
    print(f"[{idx + 1}/{len(ds_pmid)}] Searching PMID for: {article['title']}")
    pmid = search_pmid(article)
    ds_pmid.at[idx, 'pmid'] = pmid

ds_pmid.to_csv(DS_WITH_PMID, index=False)

[1/1308] Searching PMID for: Polyphenol composition and antioxidant activity in strawberry purees  impact of achene level and storage
> ERROR during request for article: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/esearch.fcgi?db=pubmed&term=Polyphenol+composition+and+antioxidant+activity+in+strawberry+purees++impact+of+achene+level+and+storage&retmode=json&field=title (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7dd4689ad8b0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
[2/1308] Searching PMID for: Uptake and metabolism of epicatechin and its access to the brain after oral ingestion
> ERROR during request for article: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/esearch.fcgi?db=pubmed&term=Uptake+and+metabolism+of+epicatechin+and+its+access+to+the+brain+after+oral+ingestion&retmode=json&field=

In [None]:
print("Number of articles with PMID found:", ds_pmid['pmid'].notnull().sum())

In [9]:
# Step 2: Fetch article abstract by PMID
def fetch_abstract_by_pmid(pmid):
    params = {
        "db": "pubmed",
        "id": f"{pmid}",
        "retmode": "text",
        "rettype": "abstract",
    }

    try:
        response = requests.get(FETCH_URL, params=params)
        response.raise_for_status()
        print(f"> Fetched abstract!!")
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"> ERROR fetching abstract for PMID '{pmid}': {e}")
        return None

ds_pmid_abstract = ds_pmid.copy()
for idx, article in ds_pmid_abstract.iterrows():
    pmid = article['pmid']
    if pd.notnull(pmid):
        print(f"[{idx + 1}/{len(ds_pmid_abstract)}] Fetching abstract for PMID: {pmid}")
        abstract = fetch_abstract_by_pmid(pmid)
        ds_pmid_abstract.at[idx, 'abstract'] = abstract
    else:
        print(f"[{idx + 1}/{len(ds_pmid_abstract)}] No PMID available, skipping abstract fetch.")
        ds_pmid_abstract.at[idx, 'abstract'] = None

NameError: name 'ds_pmid' is not defined

In [None]:
ds_pmid_abstract.to_csv(PMID_ABSTRACTS, index=False)
print("Number of articles with abstract fetched:", ds_pmid_abstract['abstract'].notnull().sum())
print("Number of articles without abstract fetched:", ds_pmid_abstract['abstract'].isnull().sum())

In [None]:
def search_scopus(article_row, api_key):

    try:

        title = article_row.get('title', '').replace('"', '') # Remove quotes for query
        headers = {"Accept": "application/json"}
        params = {
            "query": f"TITLE-ABS-KEY(\"{title}\")",
            "apiKey": api_key
        }

        response = requests.get(ELSEVIER_SEARCH_URL,
                                headers=headers, params=params, timeout=10)
        response.raise_for_status()
        data = response.json()

        if 'search-results' in data and data['search-results']['entry']:
            try:
                return data['search-results']['entry'][0]['prism:url']
            except KeyError:
                return None
        return None

    except requests.exceptions.RequestException as e:
        print(f"> Scopus ERROR: {e}")

In [None]:
ds_elsevier = pd.read_csv(PMID_ABSTRACTS)

counters = {
    'total_missing': ds_elsevier['abstract'].isnull().sum(),
    'elsevier_found': 0,
    'failed': 0
}

for i, row in ds_elsevier.iterrows():

    if not pd.isnull(row['abstract']):
        continue

    print(f"[{i + 1}/{len(ds_elsevier)}] Searching ELSEVIER for abstract of article: {row['title']}")
    abstract_url = search_scopus(row, "3f5ff36eb8d1d409e3befea2ed2aa2cc")

    if abstract_url:
        response = requests.get(abstract_url,
                                headers={"Accept": "application/json",
                                         "X-ELS-APIKey": "3f5ff36eb8d1d409e3befea2ed2aa2cc"})
        if response.status_code == 200:
            data = response.json()
            try:
                abstract_text = data['abstracts-retrieval-response']['coredata']['dc:description']
                ds_elsevier.at[i, 'abstract'] = abstract_text
                print("> Found abstract via ELSEVIER!")
                counters['elsevier_found'] += 1
            except KeyError:
                print("> Abstract not found in ELSEVIER response.")
                counters['failed'] += 1
    else:
        print("> Nope :(")
        counters['failed'] += 1

    time.sleep(1) # Polite 1-second delay

ds_elsevier.to_csv(ELSEVIER_ABSTRACTS, index=False)
print("Summary of ELSEVIER abstract search:")
print(f"Total missing abstracts at start: {counters['total_missing']}")
print(f"Abstracts found via ELSEVIER: {counters['elsevier_found']}")
print(f"Failed attempts: {counters['failed']}")

TODO: ENCONTRAR LOS ~100 ARTÍCULOS QUE FALTAN @CRISTINA. SI NO A MANO. EL ULTIMO CSV ES "publications_abstract_pubmed_elsevier.csv", PARTIR DE AHÍ.

LUEGO EL RESTO DE CODIGO DEBERÍA FUNCIONAR BIEN AUNQUE HAY QUE CAMBIAR EL DATASET FINAL Y ALGUNAS VARIABLES. SINO PREGUNTAR A @DIDIER.



In [6]:
import os

# --- Configuración ---

# Este es el archivo final después de intentar PubMed y Elsevier (mencionado en la celda TODO)
FINAL_DATASET_FILE = 'publications_abstract_pubmed_elsevier.csv'

# Nombre del archivo de salida para los artículos sin abstract
OUTPUT_FILE = 'documentos_sin_abstract.csv'

# ---------------------

print(f"Buscando el archivo: {FINAL_DATASET_FILE}...")

if os.path.exists(FINAL_DATASET_FILE):
    print(f"Archivo encontrado. Cargando datos...")

    try:
        # Cargar el dataset que contiene los resultados de todos los intentos de búsqueda
        final_df = pd.read_csv(FINAL_DATASET_FILE)

        print(f"Datos cargados. Total de artículos: {len(final_df)}")

        # Filtrar para encontrar las filas donde la columna 'abstract' sigue estando vacía (NaN)
        missing_abstracts_df = final_df[final_df['abstract'].isnull()]

        num_missing = len(missing_abstracts_df)

        if num_missing > 0:
            print(f"Se encontraron {num_missing} documentos a los que les falta el abstract.")

            # Guardar estos documentos (con toda su información) en un nuevo CSV
            missing_abstracts_df.to_csv(OUTPUT_FILE, index=False, encoding='utf-8')

            print(f"¡Éxito! Se han guardado los {num_missing} documentos en el archivo: {OUTPUT_FILE}")

            # Opcional: Muestra los primeros 5 artículos encontrados
            print("\nPrimeros 5 artículos sin abstract:")
            print(missing_abstracts_df.head())
        else:
            print("¡Enhorabuena! No se encontraron documentos con abstracts faltantes en el archivo.")

    except Exception as e:
        print(f"Ocurrió un error al procesar el archivo: {e}")

else:
    print(f"Error: No se pudo encontrar el archivo '{FINAL_DATASET_FILE}'.")
    print("Por favor, asegúrate de haber ejecutado las celdas anteriores del notebook (Tasks 1, incluyendo la búsqueda en PubMed y Elsevier) para generar este archivo.")

Buscando el archivo: publications_abstract_pubmed_elsevier.csv...
Archivo encontrado. Cargando datos...
Datos cargados. Total de artículos: 1308
Se encontraron 105 documentos a los que les falta el abstract.
¡Éxito! Se han guardado los 105 documentos en el archivo: documentos_sin_abstract.csv

Primeros 5 artículos sin abstract:
      id                                            authors  \
33   167                  Amiot M.J., Aubert S., Nicolas J.   
46   584  Aparicio-Fernandez X., Manzo-Bonilla L., Loarc...   
49   143                 Arena E., Fallico B., Maccarone E.   
96   104                                 Bassi D., Selli R.   
102  115              Begona Barroso, M.  Werken van de, G.   

    year_of_publication                                              title  \
33                 1993  Phenolic composition and browning susceptibili...   
46                 2005  Comparison of antimutagenic activity of phenol...   
49                 2001  Evaluation of the antioxidant ca

## **Task 2:**

Use the EUtilities tool to search for articles whose content is not relevant to this task. Size of the dataset should be the same of relevant documents.

In [7]:
def get_articles_pmids_for_title(title, count, api_key=None):

    params = {
        "db": "pubmed",
        "term": f"{title}[Title]",
        "retmode": "json",
        "retmax": count,
        "api_key": api_key
    }

    try:
        response = requests.get(ESEARCH_URL, params=params)
        response.raise_for_status()
        data = response.json()

        if 'esearchresult' in data and data['esearchresult']['count'] != '0':
            return data['esearchresult']['idlist']
        else:
            print(f"Found {data['esearchresult']['count']} irrelevant articles.")
            return []

    except requests.exceptions.RequestException as e:
        print(f"Error during request for irrelevant articles: {e}")
        return []


In [8]:
IRRELEVANT_PUBLICATIONS = 'irrelevant_publications.csv'

final_df = pd.read_csv('publications_abstract_pubmed_elsevier.csv')

irrelevant_pmids_list = get_articles_pmids_for_title("cancer", len(final_df))

irrelevant_abstracts = []
for pmid in irrelevant_pmids_list:

    article_info = {
        'pmid': pmid,
        'abstract': None
    }

    article_info['abstract'] = fetch_abstract_by_pmid(pmid)
    irrelevant_abstracts.append(article_info)

    # CHANGE ME TO 0.1 IF YOU HAVE AN API KEY
    print("Sleeping for 0.1...")
    time.sleep(0.1)  # Delaying 0.1s to respect NCBI rate limits (3 requests per second)

# Save irrelevant abstracts to a new dataset
irrelevant_df = pd.DataFrame(irrelevant_abstracts)
irrelevant_df.to_csv(IRRELEVANT_PUBLICATIONS, index=False)

Error during request for irrelevant articles: Expecting value: line 1 column 1 (char 0)


In [None]:
irrelevant_df

## **Task 4:**

Implement the chosen retrieval system using the programming language of their choice. If the information retrieval system is based on machine learning techniques, the student must split the existing datasets (relevant and non-relevant documents) into three distinct groups (training, validation, and testing) to carry out the model training.

**CHOSEN RETRIEVAL SYSTEM:** BioBERT-based Binary Text Classifier

In [None]:
# Adding target variable 'relevance'
relevant_df['relevance'] = 1
irrelevant_df['relevance'] = 0

# Combining relevant and irrelevant datasets and maintaining only abstract and relevance columns
features = ['abstract', 'relevance']
combined_df = pd.concat([relevant_df[features], irrelevant_df[features]], ignore_index=True)

# Remove any rows where the abstract is missing (e.g., API fetch failed)
combined_df.dropna(subset=['abstract'], inplace=True)
combined_df.reset_index(drop=True, inplace=True)

# Saving
combined_df.to_csv('combined_publications.csv', index=False)

print("Class distribution:")
print(combined_df['relevance'].value_counts())

combined_df

Following Fine-tuning of BERT for text classification tasks: https://huggingface.co/docs/transformers/en/tasks/sequence_classification

- Train-Test-Validation Split: 80%-10%-10%

In [None]:
RANDOM_STATE = 42

train_df, test_df = train_test_split(combined_df,
                                     test_size=0.2,
                                     stratify=combined_df["relevance"],
                                     random_state=RANDOM_STATE)

val_df, test_df = train_test_split(test_df,
                                   test_size=0.5,
                                   stratify=test_df["relevance"],
                                   random_state=RANDOM_STATE)

print(f"Training size: {len(train_df)}")
print(f"Validation size: {len(val_df)}")
print(f"Test size: {len(test_df)}")

- Convert Pandas DataFrame to HuggingFace Dataset

In [None]:
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

- Tokenization of abstracts using BioBERT tokenizer

In [None]:
BERT_MODEL_NAME = "dmis-lab/biobert-v1.1"
tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL_NAME)

def tokenize(examples):
    return tokenizer(examples["abstract"],
                     padding="max_length",
                     truncation=True,
                     max_length=512 # Maximum length for BERT models
                    )

train_dataset = train_dataset.map(tokenize, batched=True)
val_dataset = val_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# Renaming the target column to 'labels' as expected by HuggingFace Trainer
train_dataset = train_dataset.rename_column("relevance", "labels")
val_dataset = val_dataset.rename_column("relevance", "labels")
test_dataset = test_dataset.rename_column("relevance", "labels")

- Loading BioBERT model for binary text classification (relevant vs irrelevant)

In [None]:
id2label = {0: "irrelevant", 1: "relevant"}
label2id = {"irrelevant": 0, "relevant": 1}

model = AutoModelForSequenceClassification.from_pretrained(BERT_MODEL_NAME,
                                                           num_labels=2,
                                                           id2label=id2label,
                                                           label2id=label2id)

- Defining evaluation metrics

In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    roc_auc_score
)
import numpy as np

# ---- For ranked results ----

def precision_at_k(y_true, y_score, k):
    order = np.argsort(y_score)[::-1]
    y_true_sorted = np.array(y_true)[order]
    top_k = y_true_sorted[:k]
    return np.mean(top_k)

def r_precision(y_true, y_score):
    R = int(np.sum(y_true))
    return precision_at_k(y_true, y_score, R) if R > 0 else 0

def average_precision(y_true, y_score):
    order = np.argsort(y_score)[::-1]
    y_true_sorted = np.array(y_true)[order]
    precisions = []
    relevant = 0
    for i, val in enumerate(y_true_sorted, start=1):
        if val == 1:
            relevant += 1
            precisions.append(relevant / i)
    return np.mean(precisions) if precisions else 0

def reciprocal_rank(y_true, y_score):
    order = np.argsort(y_score)[::-1]
    y_true_sorted = np.array(y_true)[order]
    for i, val in enumerate(y_true_sorted, start=1):
        if val == 1:
            return 1 / i
    return 0

def f_measure(precision, recall, beta=1.0):
    return (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall) if (precision + recall) else 0


# ---- metrics ----

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = logits[:, 1]  # positive class probability
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary", zero_division=0)

    # IR Metrics
    p_at_5 = precision_at_k(labels, probs, 5)
    p_at_10 = precision_at_k(labels, probs, 10)
    r_prec = r_precision(labels, probs)
    avg_prec = average_precision(labels, probs)
    rr = reciprocal_rank(labels, probs)
    auroc = roc_auc_score(labels, probs)
    f_beta_2 = f_measure(precision, recall, beta=2.0)  # example

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "F2": f_beta_2,
        "AUROC": auroc,
        "P@5": p_at_5,
        "P@10": p_at_10,
        "R-Precision": r_prec,
        "AvgPrecision": avg_prec,
        "ReciprocalRank": rr
    }


- Putting the training arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./biobert_pubmed_classifier",

    # Training hyperparameters
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,

    # Optimiser settings
    weight_decay=0.01,

    # Evaluation settings
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=100,

    # Model selection
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,

    # Performance
    fp16=torch.cuda.is_available(),
    dataloader_num_workers=4,

    seed=RANDOM_STATE,
    push_to_hub=False,
    report_to="none"
)

- Actual training using Trainer API

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

trainer.train()

- Evaluating on the test set

In [None]:
predictions_output = trainer.predict(test_dataset)
predictions = np.argmax(predictions_output.predictions, axis=-1)
true_labels = predictions_output.label_ids

# Calculate all metrics
test_metrics = compute_metrics((predictions_output.predictions, true_labels))

print("\nTest Set Results:")
print(f"Accuracy:  {test_metrics['accuracy']:.4f}")
print(f"Precision: {test_metrics['precision']:.4f}")
print(f"Recall:    {test_metrics['recall']:.4f}")
print(f"F1-Score:  {test_metrics['f1']:.4f}")

print("\n Ranked Results:")
print(f"AUROC:          {test_metrics['AUROC']:.4f}")
print(f"P@5:            {test_metrics['P@5']:.4f}")
print(f"P@10:           {test_metrics['P@10']:.4f}")
print(f"R-Precision:    {test_metrics['R-Precision']:.4f}")
print(f"Avg Precision:  {test_metrics['AvgPrecision']:.4f}")
print(f"ReciprocalRank: {test_metrics['ReciprocalRank']:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(
    true_labels,
    predictions,
    target_names=['Irrelevant', 'Relevant'],
    digits=4
))

# Confusion matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(true_labels, predictions)
print(cm)
print(f"\nTrue Negatives:  {cm[0][0]} (correctly identified irrelevant)")
print(f"False Positives: {cm[0][1]} (incorrectly marked relevant)")
print(f"False Negatives: {cm[1][0]} (missed relevant papers)")
print(f"True Positives:  {cm[1][1]} (correctly identified relevant)")

- saving the trained model

In [None]:
model_save_path = './final_biobert_classifier'
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)