## Large Language models for Scientometrics

**Large Language Models:**

The capabilities of Large Language Models (**LLM's**) to process data from different modalities and excel at different tasks ranging from information extraction, question and answering, math, coding, and recently reasoning simply shows the potential of this technology. Intuitively the complexities of training these models on different datasets/data mixes, opting different architectural choices, choosing different alignment strategies **[1]** seemingly could suggest picking a specific model for each task, but **LLM's** are geared towards being considered as general task solvers.

For this hands-on session we are going to use the Reproducibility dataset from the paper <u>Laying Foundations to Quantify the "Effort of Reproducibility"</u> **[2]** to preference tune answers using the **Direct Preference Optimization(DPO)** algorithm. *DPO* unlike other reinforcement algorithms directly applies maximum likelihood on the preference dataset to perform implicit reward modeling. Ideally, similar to most RL algorithms we would be applying the same reward maximization via **KL** divergence constraint. Theoretically, *DPO* is RL free, and doing a simple classification on a given a dataset $D$ that includes **chosen** and **rejected** responses. Learn more about *DPO* from the original paper **[3]**.

**References**(s):
<br>
**[1]** [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223)
<br>
**[2]** [Laying Foundations to Quantify the “Effort of Reproducibility”](https://ieeexplore.ieee.org/abstract/document/10266070)
<br>
**[3]** [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/pdf/2305.18290)

**Other Resources**:
<br>
**[R-1]** [Direct Preference Optimization (DPO) for LLM Alignment (From Scratch)
](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb)
<br>
**[R-2]** [Preference Tuning for Summarization using Synthetic Data
](https://github.com/anyscale/templates/tree/1939a34a54a0efeb1e86917d1175d92b50f482e6/templates/fine-tune-llm_v2/end-to-end-examples/fine-tune-preference#step-2-fine-tuning)

<img src="https://images.ctfassets.net/cnu0m8re1exe/sIyPeDxgpIluQqQWK8nhS/67004d28ebbce2ca1f654a7a0afd92b3/SciSci.png" align="center" width=400 height=500>

>(Credit: Davide Bonazzi) from [*Discover Magazine*](https://www.discovermagazine.com/the-sciences/the-science-of-science)

**Table of Contents**:
- Setup
- Prepare Preference Dataset for **Direct Preference Optimization(DPO)**
- API & Local Models setup
- Preference tuning via **Direct Preference Optimization(DPO)**

### 1. Setup

In [None]:
# @title 1.1 Install necessary libraries

# install outlines
print(f"Installing latest transformers...")
!pip install -q git+https://github.com/huggingface/transformers

# install tiktoken
print(f"Installing tiktoken...")
!pip install -q tiktoken

# install outlines
print(f"Installing outlines...")
!pip install -q outlines

# install huggingface-trl
print(f"Installing huggingface-trl...")
!pip install -q trl

# install flash attention
print(f"Installing flash-attention-2...")
!pip install -q flash-attn --no-build-isolation

# install bitsandbytes
print(f"Installing bitsandbytes...")
!pip install -q -U bitsandbytes

# install openai
print(f"Installing openai...")
!pip install -q openai

In [None]:
# @title 1.2 Import necessary libraries
# This Source Code Form is subject to the terms of the MIT
# License. If a copy of the same was not distributed with this
# file, You can obtain one at
# https://github.com/Northwestern-CSSI/LLMSciSci/blob/main/LICENSE.

import os
import gc
import bs4
import time
import json
import torch
import urllib3
import pathlib
import tiktoken
import numpy as np
import pandas as pd
import polars as pl
import openai as oai
import seaborn as sns
from pprint import pprint
from peft import LoraConfig
from ast import literal_eval
from pydantic import BaseModel
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup as BS
from collections import defaultdict
from outlines import models, generate
from typing import List, Optional, Union
from datasets import Dataset, DatasetDict
from collections import Counter, OrderedDict
from transformers import BitsAndBytesConfig, set_seed
from pydantic import BaseModel, create_model, RootModel
from transformers import (
    AutoTokenizer,
    Gemma3ForCausalLM,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    set_seed
)
from trl import (
    DPOConfig,
    DPOTrainer,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE

<img src="https://github.com/akhilpandey95/LLMSciSci/blob/main/media/LLMSciSci_dataset.png?raw=true" width=650 height=475>

In [None]:
# @title 1.3 Load `ReScience` dataset - [Download Data](https://drive.google.com/drive/folders/1qLCC5ZiDWoRtMQyBTeMxPrPlJLxcVgMN?usp=sharing)
!ls -lah ./drive/MyDrive/CSSI/Lecture

# set the directory
os.chdir("./drive/MyDrive/CSSI/Lecture")

# read rescience
rescience = pl.read_csv("./data/ReScience_JCDL-23.csv")

# show shape and columns
print("-------------------------------")
print(f"Data shape: {rescience.shape}")
print("-------------------------------")

"""
Data columns: ['author', 'title', 'doi', 'article_type', 'lang', 'pdf_url', 'keywords', 'review_url',
              'code_url', 'volume', 'issue', 'year', 'abstract', 'easy', 'difficult', 'gs_citations',
              'gs_scholar_url', 'original_pdf_url', 'original_article_url', 'reason_for_easiness',
              'reason_for_difficulty', 'limitations_results', 'scope_of_reproducibility',
              'original_abstract', 'orig_art_sciparse_full_text', 'orig_art_pdfminer_full_text',
              'original_sections', 'no_hyp', 'no_alg', 'no_images', 'no_equations', 'no_tables',
              'is_meth_pres', 'is_intro_pres', 'link_to_code_available', 'mean_readability',
              'hyp_available_in_text', 'easiness_longform', 'difficult_longform',
              'list_for_limitations', 'list_for_diff', 'list_for_easiness', 'more_than_one_easy']
"""

# metadata
meta_data_columns = ["doi", "title", "review_url", "easy", "difficult",
                     "scope_of_reproducibility", "reason_for_easiness", "reason_for_difficulty"]
print(f"Rescience Metadata columns of interest: {meta_data_columns}")
print("-------------------------------")

# sneak peak of the data
print(rescience.select(meta_data_columns).head())

### 2. Prepare Preference Dataset for **Direct Preference Optimization(DPO)**

<img src="https://github.com/akhilpandey95/LLMSciSci/blob/main/media/LLMSciSci_DPO_dataset.png?raw=true" width=700 height=450>

In [None]:
# @title 2.1 Load raw preference data from `GPT`, `Gemini` and `Llama` responses
# read the gemini labelling data
gemini_effortly = pl.read_csv("./data/gemini_effortly_labels_gamma.csv")

# read the gpt labelling data
gpt_effortly = pl.read_csv("./data/gpt4_effortly_labels_beta.csv")

# read the llama labelling data
llama_effortly = pl.read_csv("./data/llama3_effortly_labels_beta.csv")

# show a preview of the response(s)
print("---------------------------")
print(f"Response from gemini")
print(gemini_effortly.select("easy_gemini_response")[0].item())
print("---------------------------")
print(f"Response from gpt")
print(gpt_effortly.select("easy_gpt_response")[0].item())
print("---------------------------")
print(f"Response from llama")
print(llama_effortly.select("easy_llama3_response")[0].item())
print("---------------------------")

In [None]:
# @title 2.2 helper class to build the $D_{ReproEffortDataset}$ preference dataset

# helper function to load/initalize the prompt
def process_prompt(raw_text, tokenizer, device, task="easy", prompt_type="prompt"):
    """
    Given raw input text generate a prompt that will
    be supplied to a preference dataset loader.

    Parameters
    ------------
    arg1 | raw_text: str
        Raw input text without prompt template
    arg2 | tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast
        Tokenizer from the model
    arg3 | device: str
        Device name for the inputs and attention masks to sit on
    arg4 | task: str[OPTIONAL]
        Task type "What was easy ?" or "What was difficult ?"
    arg5 | prompt_type: str[OPTIONAL]
        String flag to be applied at the top of messages to create "prompt"
        "chosen" or "rejected" chat responses for the preference dataset

    Returns
    ------------
        Text
    """
    # init
    prompt = None
    messages = []
    add_generation_prompt = True
    sys_prompt, user_prompt, input_text = None, None, None

    # init system prompt available
    sys_prompt = """
    You are a research assistant working on understanding the
    spectrum of outputs researchers outline when reproducing
    academic articles.
    """

    # what was easy ?
    if task == "easy":
      # init user prompt for the task
      user_prompt="""
      **Task:** You are given brief descriptions that made it easy for researcher
      to reproduce original articles. Your goal is to analyze the brief description
      and classify them into one or more from the following five categories,
      which include:

      1. Availability of Code
      2. Supporting Artifacts
      3. Readability of Full Text
      4. Experimental Setup or Environment
      5. Cannot extract concrete factors that Eased Reproducibility.
      """

      # init the prompt
      input_text = """
      **What was easy:**
      ```plaintext
      EASY_DESCRIPTION
      ```
      """
    else:
      # init user prompt for the task
      user_prompt="""
      **Task:** You are given brief descriptions that made it difficult for
      researcher to reproduce original articles. Your goal is to analyze the
      description and classify them into one or more from the following
      five categories:

      1. Missing Algorithm step or Architecture details
      2. Missing nuance details
      3. Unclear notation or documentation in the codebase
      4. Insufficient Math/Equations
      5. Cannot extract concrete factors that made it difficult for reproducibility.
      """

      # init the prompt
      input_text = """
      **What was difficult:**
      ```
      DIFF_DESCRIPTION
      ```
      """

    # apply chat template on the chosen/rejected response
    if prompt_type == "chosen":
      # set the chosen response for the preferences
      messages.append([{"role": "assistant", "content": raw_text}])

      # apply prompt template
      add_generation_prompt=False

      # apply prompt and remove the system prompt
      prompt = tokenizer.apply_chat_template(messages, \
                                            tokenize=False, \
                                            use_system_prompt=add_generation_prompt, \
                                            add_generation_prompt=add_generation_prompt)
    elif prompt_type == "rejected":
      # set the rejected response for the preferences
      messages.append([{"role": "assistant", "content": raw_text}])

      # apply prompt template
      add_generation_prompt=False

      # apply prompt and remove the system prompt
      prompt = tokenizer.apply_chat_template(messages, \
                                            tokenize=False, \
                                            use_system_prompt=add_generation_prompt, \
                                            add_generation_prompt=add_generation_prompt)
    else:
      # adjust and replace EASY_DESCRIPTION or DIFF_DESCRIPTION based on task
      if task == "easy":
        input_text = input_text.replace("EASY_DESCRIPTION", raw_text)
      else:
        input_text = input_text.replace("DIFF_DESCRIPTION", raw_text)

      # set the prompt for the preferences
      messages.append([
          {"role": "system", "content": sys_prompt},
          {"role": "user", "content": user_prompt + input_text}
      ])

      # apply prompt template
      prompt = tokenizer.apply_chat_template(messages, \
                                            tokenize=False, \
                                            use_system_prompt=add_generation_prompt, \
                                            add_generation_prompt=add_generation_prompt)

    # return the processed prompt
    return prompt

# utility class to create the preference dataset
class ReproEffortPrefDataset:
    def __init__(self, raw_data, tokenizer, device="cpu"):
        """
        Given raw text prepare preference dataset.

        Parameters
        ------------
        arg1 | raw_data: polars.DataFrame or pandas.DataFrame or List[dict]
            ML reproducibility challenge data processed with the following columns:
              - "easy": Raw prompt text for the easy task.
              - "easy_gpt_response": Chosen response for the easy task.
              - "easy_llama3_response": Rejected response for the easy task.
              - "difficult": Raw prompt text for the difficult task.
              - "difficult_gpt_response": Chosen response for the difficult task.
              - "diff_llama3_response": Rejected response for the difficult task.
        arg2 | tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast
            Tokenizer from the model to apply chat template.
        arg3 | device: str[OPTIONAL]
            Device name (e.g., "cpu" or "cuda") for processing.

        Returns
        ------------
            Text
        """
        # polars df ? convert it to a pandas DataFrame.
        if hasattr(raw_data, "to_pandas"):
            raw_data = raw_data.to_pandas()
        # pd df ? convert to list of dicts
        if isinstance(raw_data, pd.DataFrame):
            self.raw_data = raw_data.to_dict("records")
        elif isinstance(raw_data, list):
            self.raw_data = raw_data
        else:
            raise ValueError("ERR[ReproEffortPrefDataset]: Unsupported raw_data type; must be a pl.DataFrame, pd.DataFrame, or list[dict].")

        # init default arguments
        self.tokenizer = tokenizer
        self.device = device

    # helper function to build the dataset object
    def build_dataset(self, test_size=0.2, seed=2025):
        """
        Build a unified preference dataset with
        "What was easy ?" and "What was difficult ?" texts.

        Parameters
        ------------
        arg1 | test_size: float[OPTIONAL]
            Set the size of the test set, defaults to 0.2 or 20% of the dataset.
        arg2 | seed: int[OPTIONAL]
            Seed parameter for reproducibility, defaults to 2025.

        Returns
        ------------
            Dictionary {"prompt": str, "chosen": str, "rejected": str}
        """
        # inti list to store results
        records = []

        # set the keys for processing
        task_types = ["easy", "difficult"]
        chosen_types = ["easy_gpt_response", "diff_gpt_response"]
        # rejected_types = ["easy_llama3_response", "diff_llama3_response"]
        rejected_types = ["easy_gemini_response", "diff_gemini_response"]

        # iterate and combine "easy" and "difficult" tasks
        for sample in self.raw_data:
            # procdess for each task
            for idx, task in enumerate(task_types):
                # set the prompt
                prompt = process_prompt(sample[task], self.tokenizer, self.device, task=task, prompt_type="prompt")[0]

                # set the choosen response
                chosen = process_prompt(sample[chosen_types[idx]], self.tokenizer, self.device, task=task, prompt_type="chosen")[0]

                # set the rejected response
                rejected = process_prompt(sample[rejected_types[idx]], self.tokenizer, self.device, task=task, prompt_type="rejected")[0]

                # append the records
                records.append({
                    "prompt": prompt,
                    "chosen": chosen,
                    "rejected": rejected
                })

        # merge records
        combined_data = {
            "prompt": [r["prompt"] for r in records],
            "chosen": [r["chosen"] for r in records],
            "rejected": [r["rejected"] for r in records],
        }

        # init hf dataset and perform train/test split.
        dataset = Dataset.from_dict(combined_data)

        # shuffle the datset
        dataset = dataset.shuffle()

        # train test split
        dataset_split = dataset.train_test_split(test_size=test_size, seed=seed)

        # return final dataset
        return dataset_split

### 3. API & Local Models setup

For the commercial models you would need to setup your account and obtainan API key to run some of the experiments in this notebook.

<hr>

**Pre-requisites for commercial models**
<br>
**OpenAI**: https://platform.openai.com/settings/organization/api-keys
<hr>

**Pre-requisites for local models**
<br>
The experiments and widgets in the notebook require `data/` and `models/`. Since `data/` is loaded, we need model weights which can be downloaded here:
- [Models](https://drive.google.com/drive/folders/1aNT1SNA7Lz9kMgt5p1yGWST1T6D2Dmbd?usp=sharing)

In [None]:
# @title 3.1 Local model Catalog
!ls -lah models/

In [None]:
# @title 3.2 Load model client or model-tokenizer pair
# helper function to load/initalize the model
def load_model(model_name, device):
    """
    Given a model path, load tokenizer-model
    pair and return the objects tagged to the
    given device (cpu/cuda)

    Parameters
    ------------
    arg1 | model_name: str
        Use model catalog to load local model weights
    arg2 | device: str
        Hardware acceleration, defaults to "cpu" if any errors arise

    Returns
    ------------
        Tuple(AutoModel, AutoTokenizer) for local (model_client, model_name)
    """
    # device for acceleration
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.mps.is_available():
        device = "mps"
    else:
        device = "cpu"

    # local models
    local_models = ["llama3.2-1b", "llama3.2-3b", "llama3.1-8b", "qwen2.5-1.5b", "r1-distill-qwen-1.5b"]

    # pathlib for models
    model_path = pathlib.Path("/content/drive/MyDrive/CSSI/Lecture")

    # set the model-id
    model_catalog = {
        "llama-3.2-1b": model_path/f"models/Llama3.2-1B-Instruct/",
        "llama-3.2-3b": model_path/f"models/Llama3.2-3B-Instruct/hf/",
        "llama-3.1-8b": model_path/f"models/Meta-Llama-3.1-8B-Instruct/hf/",
        "gemma-3-4b": model_path/f"models/gemma-3-4b-it/",
        "qwen-2.5-1.5b": model_path/f"models/Qwen2.5-1.5B-Instruct/"
    }

    # set a model-id
    model_id = model_catalog[model_name]

    # log
    print("----------------------------------")
    print(f"Using {device} to load {model_id}")
    print("----------------------------------")

    # get model-tokenizer pair
    start = time.time()
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    # based on model size switch quantization config
    if model_name == "llama3.1-70b" or model_name == "r1-distill-llama-70b":
        # 4-bit quantization config
        bnb_4bit = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.bfloat16,
          bnb_4bit_quant_storage=torch.bfloat16
        )

        # 4 bit quantization
        model = AutoModelForCausalLM.from_pretrained(model_id, \
                                                   quantization_config=bnb_4bit, \
                                                   trust_remote_code=True, \
                                                   low_cpu_mem_usage=True, \
                                                   attn_implementation="sdpa", \
                                                   device_map=device)
    elif model_name == "gemma-3-4b":
        model = Gemma3ForCausalLM.from_pretrained(model_id, \
                                              trust_remote_code=True, \
                                              torch_dtype=torch.bfloat16, \
                                              low_cpu_mem_usage=True, \
                                              attn_implementation="sdpa", \
                                              device_map=device)
    else:
      # 4-bit quantization config
      bnb_4bit = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=torch.bfloat16
      )

      # load bfloat16 weights
      model = AutoModelForCausalLM.from_pretrained(model_id, \
                                                   trust_remote_code=True, \
                                                   torch_dtype=torch.bfloat16, \
                                                   low_cpu_mem_usage=True, \
                                                   attn_implementation="flash_attention_2", \
                                                   device_map=device)

    # is it a llama tokenizer ?
    if "llama" in model_name:
        # pad token if needed
        tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})
        print(f"Setting <|finetune_right_pad_id|> token for {model_id}")
        model.resize_token_embeddings(len(tokenizer))

        # llama prompt template
        llama_template = r"""
        {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
        """

        # set the chat template
        tokenizer.chat_template = llama_template

    # gemma eos token
    if "gemma" in model_name:
        # set the EOS tokenid
        tokenizer.eos_token_id = tokenizer.encode("<end_of_turn>")[0]

    # load time
    end = time.time()
    print(f"Model-tokenizer Load Time:, {end - start} seconds")
    print("----------------------------------")

    # return the pair
    return model, tokenizer

In [None]:
# @title 3.3 Load policy model for $\pi_{LLMSciSci}$
# load the model for policy update
# model, tokenizer = load_model("qwen-2.5-1.5b", "cuda")
model, tokenizer = load_model("llama-3.2-1b", "cuda")

### 4. Preference tuning via **Direct Preference Optimization(DPO)**

$$
L_{DPO}(\pi_{LLMSciSci}: \pi_{LLM-instruct})
\;=\; - \,\mathbb{E}{\bigl(x,\,r^+,\,r^-\bigr) \sim D_{ReproEffortDataset}}
\Bigl[
\log \,\sigma\!\Bigl(
r_\theta(x,r^+) \;-\; r_\theta(x,r^-)
\Bigr)
\Bigr]
$$

$$
r_\theta(x, r)
\;=\;
\beta \,\log \frac{\pi_{LLMSciSci}(r \,\vert\, x)}{\pi_{LLM-instruct}(r \,\vert\, x)}
$$

where the $r_{\theta}$ is computed
- using $r^+$(human preferred response), and $r^-$(rejected responses).
- for the models $\pi_{LLMSciSci}$ and $\pi_{LLM-instruct}$.
- $r_{\theta}$  captures the log-probability of the *chosen* vs *rejected* responses on $D_{ReproEffortDataset}$.
- $\pi_{LLM-instruct}$ is the instruct-tuned open weight reference model.
- $\pi_{LLMSciSci}$ is the final RL model intended to be preference-tuned on $D_{ReproEffortDataset}$.

In [None]:
# @title 4.1 Init `ReproEffortPrefDataset` dataset object, train/test split

# # combine the GPT and Llama datasets
# raw_data = gpt_effortly.join(
#     llama_effortly.select(["doi", "easy", "difficult", "easy_llama3_response", "easy_llama3_label", "diff_llama3_response", "diff_llama3_label"]),
#     on="doi",
#     how="inner"
# )

# combine the GPT and Gemini datasets
raw_data = gpt_effortly.join(
    gemini_effortly.select(["doi", "easy", "difficult", "easy_gemini_response", "easy_gemini_label", "diff_gemini_response", "diff_gemini_label"]),
    on="doi",
    how="inner"
)

# final shape of the raw data
print(f"-----------------------------------")
print(f"Shape of raw_data: {raw_data.shape}")
print(f"-----------------------------------")
print("Columns in raw_data:")
print(f"-----------------------------------")
pprint(raw_data.columns)
print(f"-----------------------------------")

# convert raw data to pandas
raw_data_pd = raw_data.to_pandas()

# create the preference dataset
dataset_obj = ReproEffortPrefDataset(raw_data_pd, tokenizer, device="cuda")
dataset = dataset_obj.build_dataset(test_size=0.2, seed=2025)
print(f"-----------------------------------")
print(f"ReproEffortPrefDataset:")
print(dataset)

In [None]:
# @title 4.2 Preview of the preference dataset for "What was easy ?" and "What was difficult ?" tasks

# print sample cut of the dataset
print("-----------------------------------")
print("Prompt for the preference dataset..")
print(dataset["train"][0]["prompt"])
print("-----------------------------------")
print("CHOSEN response:")
print(dataset["train"][0]["chosen"])
print("-----------------------------------")
print("REJECTED response:")
print(dataset["train"][0]["rejected"])
print("-----------------------------------")

In [None]:
# @title 4.3 Tokenomics to decide `max_seq_length` and `prompt_length`
# gather the train and test datasets
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# lets find the p95 length of the prompt
prompt_length = int(np.percentile([len(tokenizer(x)["input_ids"]) for x in train_dataset["prompt"]], 95))
max_seq_length_chosen = int(np.percentile([len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) for x in train_dataset], 95))
max_seq_length_rejected = int(np.percentile([len(tokenizer(x["prompt"] + x["rejected"])["input_ids"]) for x in train_dataset], 95))
max_seq_length = max(max_seq_length_chosen, max_seq_length_rejected)

# filter datasets to remove samples that are too long
train_dataset = train_dataset.filter(lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_length)
test_dataset = test_dataset.filter(lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_length)
print(f"len(train_dataset): {len(train_dataset)}")
print(f"len(test_dataset): {len(test_dataset)}")

# Up the lengths to next multiple of 2, why 2? Don't know
prompt_length = ((prompt_length + 1) // 2) * 2
max_seq_length = ((max_seq_length + 1) // 2) * 2
print(f"p95 prompt length: {prompt_length}")
print(f"p95 prompt + chosen length: {max_seq_length}")

# prompt_length = 512
# max_seq_length = 512

In [None]:
# @title 4.4 Train the first $\pi_{LLMSciSci}$ policy via $L_{DPO}$ using the $\sigma$ loss

# # LoRA config
# peft_config = LoraConfig(
#     lora_alpha=128,
#     lora_dropout=0.05,
#     r=256,
#     bias="none",
#     target_modules="all-linear",
#     task_type="CAUSAL_LM",
# )

# # dpo params
# dpo_args = {
#     "beta": 0.3,
#     "loss_type": "sigmoid"
# }

# # args
# training_args = DPOConfig(output_dir="llmscisci-DPO-sigmoid-beta-0.3", \
#                           run_name="rn-llmscisci-DPO-sigmoid-beta-0.3", \
#                           logging_steps=10, \
#                           num_train_epochs=10, \
#                           max_length=max_seq_length, \
#                           max_prompt_length=prompt_length, \
#                           beta=dpo_args["beta"], \
#                           loss_type=dpo_args["loss_type"], \
#                           label_names=["chosen", "rejected"])

# # init DPO trainer
# trainer = DPOTrainer(model=model, \
#                      peft_config=peft_config, \
#                      args=training_args, \
#                      processing_class=tokenizer, \
#                      train_dataset=train_dataset)

# # train
# trainer.train()

# # save model weights
# trainer.save_model()

In [None]:
# @title 4.5 Train the first $\pi_{LLMSciSci}$ policy via $L_{DPO}$ using the $WPO$ loss

# # LoRA config
# peft_config = LoraConfig(
#     lora_alpha=128,
#     lora_dropout=0.05,
#     r=256,
#     bias="none",
#     target_modules="all-linear",
#     task_type="CAUSAL_LM",
# )

# # dpo params
# dpo_args = {
#     "beta": 0.1,                            # The beta factor in DPO loss. Higher beta means less divergence
#     "loss_type": "sigmoid"                  # The loss type for DPO.
# }

# # args
# training_args = DPOConfig(output_dir="llmscisci-DPO-WPO-gamma", \
#                           run_name="rn-llmscisci-DPO-WPO-gamma", \
#                           use_weighting=True, \
#                           logging_steps=10, \
#                           num_train_epochs=10, \
#                           max_length=max_seq_length, \
#                           max_prompt_length=prompt_length, \
#                           beta=dpo_args["beta"], \
#                           loss_type=dpo_args["loss_type"], \
#                           label_names=["chosen", "rejected"])

# # init DPO trainer
# trainer = DPOTrainer(model=model, \
#                      peft_config=peft_config, \
#                      args=training_args, \
#                      processing_class=tokenizer, \
#                      train_dataset=train_dataset)

# # train
# trainer.train()

# # save model weights
# trainer.save_model()

In [None]:
# @title 4.6 Train the first $\pi_{LLMSciSci}$ policy via $L_{DPO}$ using the $rDPO$ loss with $\epsilon = 0.2$, label_smoothening=0.05

# # LoRA config
# peft_config = LoraConfig(
#     lora_alpha=128,
#     lora_dropout=0.05,
#     r=256,
#     bias="none",
#     target_modules="all-linear",
#     task_type="CAUSAL_LM",
# )

# # dpo params
# dpo_args = {
#     "beta": 0.3,
#     "loss_type": "robust"
# }

# # args
# training_args = DPOConfig(output_dir="llmscisci-DPO-rob_ep_0.3-tmp", \
#                           run_name="rn-llmscisci-DPO-rob_ep_0.3-tmp", \
#                           label_smoothing=0.05, \
#                           logging_steps=10, \
#                           num_train_epochs=10, \
#                           max_length=max_seq_length, \
#                           max_prompt_length=prompt_length, \
#                           beta=dpo_args["beta"], \
#                           loss_type=dpo_args["loss_type"], \
#                           label_names=["chosen", "rejected"])

# # init DPO trainer
# trainer = DPOTrainer(model=model, \
#                      peft_config=peft_config, \
#                      args=training_args, \
#                      processing_class=tokenizer, \
#                      train_dataset=train_dataset)

# # train
# trainer.train()

# # save model weights
# trainer.save_model()

In [None]:
# @title 4.7 Train the first $\pi_{LLMSciSci}$ policy via $L_{DPO}$ using the $hinge$ loss with $\beta = 0.05$

# LoRA config
peft_config = LoraConfig(
    lora_alpha=128,
    lora_dropout=0.05,
    r=256,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

# dpo params
dpo_args = {
    "beta": 0.05,
    "loss_type": "hinge"
}

# args
training_args = DPOConfig(output_dir="llmscisci-DPO-best", \
                          run_name="rn-llmscisci-DPO-best", \
                          logging_steps=10, \
                          num_train_epochs=10, \
                          max_length=max_seq_length, \
                          max_prompt_length=prompt_length, \
                          beta=dpo_args["beta"], \
                          loss_type=dpo_args["loss_type"], \
                          label_names=["chosen", "rejected"])

# init DPO trainer
trainer = DPOTrainer(model=model, \
                     peft_config=peft_config, \
                     args=training_args, \
                     processing_class=tokenizer, \
                     train_dataset=train_dataset)

# train
trainer.train()

# save model weights
trainer.save_model()

### 5. Inference check $\pi_{LLMSciSci}$ against $\pi_{LLM-instruct}$ on sample outputs

In [None]:
# @title 5.1 Load reference model for $\pi_{LLM-Instruct}$
# load reference model without policy update
ref_model, _ = load_model("llama-3.2-1b", "cuda")

In [None]:
# @title 5.2 Sample inference check on $D_{test}$
from IPython.display import display, Latex

# seed for reproducibility
set_seed(2025)

# set top_p and temperature to none
ref_model.generation_config.temperature=None
ref_model.generation_config.top_p=None
trainer.model.generation_config.temperature=None
trainer.model.generation_config.top_p=None

# inputs, attention mask, and shape
idx = 0
input_encoded = tokenizer(test_dataset["prompt"][idx], padding=True, return_tensors="pt")
input_encoded_ids = input_encoded["input_ids"].to("cuda")
input_encoded_attn_mask = input_encoded["attention_mask"].to("cuda")
ref_input_encoded_ids = input_encoded["input_ids"].to("cuda")
ref_input_encoded_attn_mask = input_encoded["attention_mask"].to("cuda")
input_shape = len(input_encoded["input_ids"][0])

# model outputs on test prompts
outputs = trainer.model.generate(
    input_ids=input_encoded_ids,
    attention_mask=input_encoded_attn_mask,
    max_new_tokens=512,
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.encode("<|eot_id|>")
)

# reference model outputs on test prompts
ref_outputs = ref_model.generate(
    input_ids=ref_input_encoded_ids,
    attention_mask=ref_input_encoded_attn_mask,
    max_new_tokens=512,
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.encode("<|eot_id|>")
)

print("------------------------------------------------------")
# print("Input:")
# print(test_dataset["prompt"][idx])
# print("------------------------------------------------------")
print("Policy model response")
display(Latex(r'\pi_{LLMSciSci}:'))
output = tokenizer.decode(outputs[0][input_shape:], skip_special_tokens=True)
print(output)
print("------------------------------------------------------")
print("Reference model response (llama-3.2-1b)")
display(Latex(r'\pi_{LLM-instruct}:'))
ref_output = tokenizer.decode(ref_outputs[0][input_shape:], skip_special_tokens=True)
print(ref_output)
print("------------------------------------------------------")
print("Correct response:")
print(test_dataset["chosen"][idx])
print("------------------------------------------------------")

In [None]:
# del model, tokenizer, trainer
# torch.cuda.empty_cache()
# gc.collect()