<img src="https://github.com/user-attachments/assets/15fd14c0-8d62-4781-bb51-20f84afa8a48" width="100">

**Author:** [Bowen Xu](bowenxu@g.harvard.edu)

This notebook is part of the [VeritasTrial](https://github.com/VeritasTrial/ac215_VeritasTrial) project. It is not directly executable from a local environment, but to showcase our effort towards constructing the dataset and finetuning the embedding model.

In [None]:
! pip install transformers==4.45.2 sentence-transformers==3.1.1
! pip install datasets



In [None]:
import pandas as pd

df = pd.read_json(
    "/content/drive/MyDrive/VeritasTrial/ctg-triplets_w_neg_v3.2.jsonl",
    lines=True,
)
print(df.head())

                         query                       pos  \
0  refractory multiple myeloma               [selinexor]   
1  refractory multiple myeloma            [lenalidomide]   
2  refractory multiple myeloma      [methylprednisolone]   
3  refractory multiple myeloma               [selinexor]   
4  refractory multiple myeloma  [lenalidomiderefractory]   

                                                 neg      category  
0                                [insulin, adjuvant]  intervention  
1                      [cyclophosphamide, exenatide]  intervention  
2                       [benzoquinone, pioglitazone]  intervention  
3  [blended cognitive behavioural therapy cbt, co...       keyword  
4                      [natural history, repository]       keyword  


In [None]:
# Take a random subset of the entire dataset
sub_df = df.sample(n=200000, random_state=42)

Make dataset format compatible for `sentence-transformers` finetuning.

In [None]:
from datasets import Dataset

anchors = sub_df["query"]
positives = sub_df["pos"].apply(lambda x: ",".join(x))
negatives = sub_df["neg"].apply(lambda x: ",".join(x))

dataset = Dataset.from_dict(
    {"anchor": anchors, "positive": positives, "negative": negatives}
)

# Split the dataset into train, validation, and test
train_val_test = dataset.train_test_split(test_size=0.25, seed=42)
train_val = train_val_test['train'].train_test_split(test_size=0.33, seed=42)

train_set = train_val["train"]
val_set = train_val["test"]
test_set = train_val_test["test"]
print(val_set)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 50000
})


Finetune the embedding model.

In [None]:
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.losses import TripletLoss

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
loss = TripletLoss(model)

# Define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="models/bge-small-en-triplet",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-6,
    warmup_ratio=0.1,
    fp16=True,
    bf16=False,
    eval_strategy="steps",
    eval_steps=6250,
    save_strategy="steps",
    save_steps=6250,
    save_total_limit=2,
    logging_steps=6250,
)

# Evaluate the base model
dev_evaluator = TripletEvaluator(
    anchors=val_set["anchor"],
    positives=val_set["positive"],
    negatives=val_set["negative"],
    name="evaluate each epoch",
)
dev_evaluator(model)

# Train model
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_set,
    eval_dataset=val_set,
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# Evaluate on the test set
test_evaluator = TripletEvaluator(
    anchors=test_set["anchor"],
    positives=test_set["positive"],
    negatives=test_set["negative"],
    name = "evaluate trained model BGE",
)
print(f"Test result: {test_evaluator(model)}")

# Save the model
model.save_pretrained(
    "/content/drive/MyDrive/VeritasTrial/models/bge-small-en-triplet/final"
)

Step,Training Loss,Validation Loss,Evaluate each epoch Cosine Accuracy,Evaluate each epoch Dot Accuracy,Evaluate each epoch Manhattan Accuracy,Evaluate each epoch Euclidean Accuracy,Evaluate each epoch Max Accuracy
6250,3.5801,3.022929,1.0,0.0,1.0,1.0,1.0
12500,3.0875,3.007486,1.0,0.0,1.0,1.0,1.0
18750,3.0556,3.004479,1.0,0.0,1.0,1.0,1.0
25000,3.0446,3.003514,1.0,0.0,1.0,1.0,1.0
31250,3.0406,3.003256,1.0,0.0,1.0,1.0,1.0


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Test result: {'evaluate trained model BGE_cosine_accuracy': 1.0, 'evaluate trained model BGE_dot_accuracy': 0.0, 'evaluate trained model BGE_manhattan_accuracy': 1.0, 'evaluate trained model BGE_euclidean_accuracy': 1.0, 'evaluate trained model BGE_max_accuracy': 1.0}


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "/content/drive/MyDrive/VeritasTrial/models/bge-small-en-triplet/final"
)
model.encode("sentence")

array([-1.27077778e-03,  4.88708680e-03,  1.11971991e-02, -6.71610236e-03,
        2.60409061e-03, -6.96671661e-03, -1.30945724e-02,  8.07860494e-03,
        7.84309208e-03, -6.97842741e-04,  6.42183647e-02, -9.23632860e-01,
       -1.41344750e-02,  9.62376036e-03, -1.40070040e-02, -2.31561670e-03,
       -5.29550435e-03,  5.65767568e-03, -1.21716224e-03, -8.67642835e-03,
        1.31368674e-02, -1.10656768e-02, -1.42651722e-02,  1.41321141e-02,
        1.55304875e-02, -7.17377337e-03,  7.14751845e-03, -1.18336352e-02,
       -1.22135375e-02, -1.63918938e-02,  6.33910298e-03, -4.19591088e-03,
        1.93549693e-02, -1.22456197e-02,  7.22857472e-03, -1.04754763e-02,
       -1.38508929e-02, -6.31692586e-03,  3.47027858e-03,  1.34539204e-02,
        5.74229343e-04,  3.42611782e-03,  9.21294186e-03, -7.58994464e-03,
        5.19093405e-03, -1.16328551e-02, -9.69778374e-03,  2.25887690e-02,
       -9.56988521e-03,  2.11314801e-02, -2.35076770e-02,  1.17298132e-02,
        1.94369145e-02,  