<div style="text-align: center;">
    <h3>Applied Data Science Project</h3>
    <h2><b>Patient Preference Studies Classification System</b><h2>
    <h1><b>Class Prediction System</b></h1>
    <h5>Francesco Giuseppe Gillio</h5>
    <h5>César Augusto Seminario Yrigoyen</h5>
</div>

<div style="text-align: center;">
    <img src="https://upload.wikimedia.org/wikipedia/it/4/47/Logo_PoliTo_dal_2021_blu.png" width="250">
</div>

https://github.com/adsp-polito/2024-P8-PPS

**Requirements**

In [1]:
import gc
import sys
import torch
import joblib

import numpy as np
import pandas as pd

from typing import List, Tuple
from transformers import AutoTokenizer, AutoModel

In [2]:
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x7f8f54815bd0>

**Class Prediction Function (Patient Preference Study or Not)**

In [3]:
def predict(
    input: str,
    title: str = 'title',
    abstract: str = 'abstract',
    threshold: float = 0.3875,
    weights: Tuple[float, float] = (0.4375, 0.5625)
) -> pd.DataFrame:
    """
    evaluates the relevance of academic papers to patient preference studies by
    predictions of multiple machine learning models and
    embeddings from transformer architectures for text representation

    parameters:
        input (str): path to the dataset file (csv or xlsx format)
        title (str): column in the dataset that contains the titles of the papers
        abstract (str): column in the dataset that contains the abstracts of the papers
        threshold (float): a floating-point value that serves as the decision threshold for classifying
                           papers as relevant to patient preference studies by prediction probabilities
        weights (tuple): a tuple of two floating-point values that determine the relative weight of the
                         predictions from each model in the final decision

    returns:
        pd.DataFrame: the original dataset with an additional column, 'PPS',
                      with binary values: relevant (1) or non-relevant (0)
    """

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    def remove(model):
        """
        deallocates memory and clears the gpu cache

        parameters:
            model: a transformer model to get embeddings for the text inputs (title and abstract)
        """
        del model
        gc.collect()
        torch.cuda.empty_cache()

    def get(
        row: pd.Series,
        base: str,
        model: AutoModel,
        tokenizer: AutoTokenizer
    ) -> np.ndarray:
        """
        computes a numerical representation (embedding) for a single paper by title and abstract
        through a transformer model and a vector combination of the resulting embeddings

        parameters:
            row (pd.Series): a single row from the dataset
            base (str): the identifier of the transformer model architecture
            model (AutoModel): the pre-trained transformer model for text embeddings
            tokenizer (AutoTokenizer): the tokenizer of the transformer model

        returns:
            np.ndarray: the sentence embeddings of the paper
        """

        title = [str(row['title']) if isinstance(row['title'], str) else '.']
        abstract = [str(row['abstract']) if isinstance(row['abstract'], str) else '.']

        def meanpooling(
            output: Tuple[torch.Tensor, ...],
            mask: torch.Tensor
        ) -> torch.Tensor:
            """
            computes the mean of the token-level embeddings, with weights on the attention mask

            parameters:
                output (tuple): the output tuple from the transformer model
                mask (torch.Tensor): the attention mask

            returns:
                torch.Tensor: the embedding vector of the input text
            """
            embeddings = output[0]
            mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
            return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

        def tokenize(
            text: List[str],
            tokenizer: AutoTokenizer
        ) -> dict:
            """
            tokenizes the input text for model inference

            parameters:
                text (list): a list of text strings
                tokenizer (AutoTokenizer): the tokenizer of the transformer model

            returns:
                dict: a dictionary with tokens, ready for the model to process
            """
            inputs = tokenizer(
                text,
                padding=True,
                truncation=True,
                return_tensors='pt',
                max_length=512
            )
            inputs = {
                key: value.to(device)
                for key, value in inputs.items()
            }
            return inputs

        def encode(
            text: List[str],
            model: AutoModel,
            tokenizer: AutoTokenizer,
            pooling: bool
        ) -> torch.Tensor:
            """
            transforms the input text into an embedding by transformer model,
            with an optional mean pooling step to the hidden states

            parameters:
                text (list): the text (e.g., title or abstract of a paper)
                model (AutoModel): the transformer model to get embeddings
                tokenizer (AutoTokenizer): the tokenizer to preprocess the text
                pooling (bool): a flag for mean pooling (to the model’s hidden states)

            returns:
                torch.Tensor: the final embedding of the input text
            """
            inputs = tokenize(text, tokenizer)
            with torch.no_grad():
                output = model(**inputs)
            embeddings = output.pooler_output if not pooling else meanpooling(
                output,
                inputs['attention_mask']
           )
            return embeddings

        if base == 'NeuML/pubmedbert-base-embeddings':
            title = encode(title, model, tokenizer, pooling=False)
            abstract = encode(abstract, model, tokenizer, pooling=False)
            embeddings = torch.cat((title, abstract), dim=-1)
        elif base == 'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract':
            title = encode(title, model, tokenizer, pooling=True)
            abstract = encode(abstract, model, tokenizer, pooling=True)
            embeddings = 0.2 * title + 0.8 * abstract
        else:
            raise ValueError(f"unknown base model: {base}")
        return embeddings.cpu().numpy()

    !git clone https://github.com/adsp-polito/2024-P8-PPS.git
    print(f"\nreading dataset from {input}...")
    if input.endswith('.csv'):
        dataset = pd.read_csv(input)
    elif input.endswith('.xlsx'):
        dataset = pd.read_excel(input)
    else:
        raise ValueError("unknown input format")
    print(f"dataset size: {len(dataset)}")
    dataset = dataset.rename(
        columns={
            title: 'title',
            abstract: 'abstract'
        }
    )
    if 'title' not in dataset.columns or 'abstract' not in dataset.columns:
        raise ValueError("provide a file with 'title' and 'abstract' columns")
    results = {}
    for base, desc, classifier in [
        ('NeuML/pubmedbert-base-embeddings',
         'pubmed-knn-pipeline',
         '/content/2024-P8-PPS/PPS-BC/models/pubmed-knn-pipeline.joblib'
         ),
        ('microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract',
         'biomed-svc-pipeline',
         '/content/2024-P8-PPS/PPS-BC/models/biomed-svc-pipeline.joblib'
         )
    ]:
        print(f"\nloading {base}...")
        tokenizer = AutoTokenizer.from_pretrained(base)
        model = AutoModel.from_pretrained(base)
        model = model.to(device)
        print(f"loading {desc}...")
        pipeline = joblib.load(classifier)
        embeddings = list()
        for idx, row in dataset.iterrows():
            percentage = (idx + 1) / len(dataset) * 100
            sys.stdout.write(f"\rencoding data... {percentage:.2f}%")
            embeddings.append(get(row, base, model, tokenizer))
        print()
        remove(model)
        embeddings = np.vstack(embeddings)
        print(f"computing predictions...")
        preds = pipeline.predict_proba(embeddings)
        results[base] = preds
    print(f"\nprocessing soft majority vote...")
    alpha, beta = results['NeuML/pubmedbert-base-embeddings'], results['microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract']
    a, b = weights
    probs = (alpha * a) + (beta * b)
    preds = (probs[:, 1] >= threshold).astype(int)
    dataset['PPS'] = preds
    return dataset

**Test the Class Prediction System on the Project Datasets:**

In [4]:
data = predict(
    input = 'https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/DB_clinical_areas.xlsx',
    title = 'Title',
    abstract = 'Abstract'
)
areas = data[data['PPS'] == 1]
areas[['title', 'abstract', 'PPS']].head()

Cloning into '2024-P8-PPS'...
remote: Enumerating objects: 1010, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 1010 (delta 56), reused 31 (delta 5), pack-reused 888 (from 1)[K
Receiving objects: 100% (1010/1010), 1.11 GiB | 17.11 MiB/s, done.
Resolving deltas: 100% (288/288), done.
Updating files: 100% (556/556), done.

reading dataset from https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/DB_clinical_areas.xlsx...
dataset size: 2192

loading NeuML/pubmedbert-base-embeddings...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

loading pubmed-knn-pipeline...


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

loading microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract...


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

loading biomed-svc-pipeline...
encoding data... 0.05%

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

processing soft majority vote...


Unnamed: 0,title,abstract,PPS
3,How do study design features and participant c...,Research about the decision to participate in ...,1
5,What matters most to patients with multiple my...,Given the rapid increase in novel treatments f...,1
6,Patient preference for trigger finger treatment.,Trigger finger is a common disorder of the han...,1
8,Women's preference to apply shared decision-ma...,To analyse women's stated preferences for esta...,1
10,Discrete choice experiment to investigate pref...,Cardiac rehabilitation (CR) is offered to peop...,1


In [4]:
data = predict(
    input = 'https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/DB_interventions.xlsx',
    title = 'Title',
    abstract = 'Abstract'
)
interventions = data[data['PPS'] == 1]
interventions[['title', 'abstract', 'PPS']].head()

Cloning into '2024-P8-PPS'...
remote: Enumerating objects: 1010, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 1010 (delta 56), reused 31 (delta 5), pack-reused 888 (from 1)[K
Receiving objects: 100% (1010/1010), 1.11 GiB | 27.40 MiB/s, done.
Resolving deltas: 100% (288/288), done.
Updating files: 100% (556/556), done.

reading dataset from https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/DB_interventions.xlsx...
dataset size: 2192

loading NeuML/pubmedbert-base-embeddings...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

loading pubmed-knn-pipeline...


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

loading microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract...


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

loading biomed-svc-pipeline...
encoding data... 0.05%

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

processing soft majority vote...


Unnamed: 0,title,abstract,PPS
28,methodology derive screening programme experim...,involving user healthcare become increasingly ...,1
80,personality effect chinese public vaccination ...,objective aim investigate difference public va...,1
122,attribute nonattendance vaccine based chinese ...,global coronavirus pandemic well controlled va...,1
137,application elicitation method design quantify...,express different option healthcare context me...,1
143,healthcare service hypertension china,aimed support evidenceinformed policymaking pa...,1


In [4]:
data = predict(
    input = 'https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/articles-2023.csv',
    title = 'Title',
    abstract = 'Abstract'
)
articles = data[data['PPS'] == 1]
articles[['title', 'abstract', 'PPS']].head()

Cloning into '2024-P8-PPS'...
remote: Enumerating objects: 1010, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 1010 (delta 56), reused 31 (delta 5), pack-reused 888 (from 1)[K
Receiving objects: 100% (1010/1010), 1.11 GiB | 26.56 MiB/s, done.
Resolving deltas: 100% (288/288), done.
Updating files: 100% (556/556), done.

reading dataset from https://raw.githubusercontent.com/adsp-polito/2024-P8-PPS/refs/heads/main/data/articles-2023.csv...
dataset size: 1215

loading NeuML/pubmedbert-base-embeddings...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

loading pubmed-knn-pipeline...


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

loading microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract...


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

loading biomed-svc-pipeline...
encoding data... 0.08%

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


encoding data... 100.00%
computing predictions...

processing soft majority vote...


Unnamed: 0,title,abstract,PPS
2,Stakeholders' preferences for the design and d...,This systematic review aimed to synthesise evi...,1
3,Discrete Choice Experiments in Health State Va...,BACKGROUND: Discrete choice experiments (DCEs)...,1
8,Patient preferences in chronic immune-mediated...,OBJECTIVES: Immune-mediated inflammatory disea...,1
9,Discrete choice experiment versus swing-weight...,INTRODUCTION: Limited evidence exists for how ...,1
11,Cancer Survivor Preferences for Models of Brea...,BACKGROUND AND OBJECTIVE: It is critical to ev...,1


**Code to Run:**

In [None]:
"""
data = predict(
    input = # str: path to the dataset file (csv or xlsx format)
    title = # str: column in the dataset that contains the titles of the papers
    abstract = # str: column in the dataset that contains the abstracts of the papers
)
"""