## Large Language models for Scientometrics

**Large Language Models:**

The capabilities of Large Language Models (**LLM's**) to process data from different modalities and excel at different tasks ranging from information extraction, question and answering, math, coding, and recently reasoning simply shows the potential of this technology. Intuitively the complexities of training these models on different datasets/data mixes, opting different architectural choices, choosing different alignment strategies **[1]** seemingly could suggest picking a specific model for each task, but **LLM's** are geared towards being considered as general task solvers.

For this hands-on session we are going to use the Reproducibility dataset from the paper <u>Laying Foundations to Quantify the "Effort of Reproducibility"</u> **[2]** to preference tune answers using the **Group Relative Policy Optimization(GRPO)** algorithm. *GRPO* **[3]** was introduced by Deepseek to tackle mathenatical reasoning

**References**(s):
<br>
**[1]** [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223)
<br>
**[2]** [Laying Foundations to Quantify the “Effort of Reproducibility”](https://ieeexplore.ieee.org/abstract/document/10266070)
<br>
**[3]** [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/pdf/2402.03300)

**Other Resources**:

<img src="https://images.ctfassets.net/cnu0m8re1exe/sIyPeDxgpIluQqQWK8nhS/67004d28ebbce2ca1f654a7a0afd92b3/SciSci.png" align="center" width=400 height=500>

>(Credit: Davide Bonazzi) from [*Discover Magazine*](https://www.discovermagazine.com/the-sciences/the-science-of-science)

**Table of Contents**:
- Setup
- Prepare Preference Dataset for **Group Relative Policy Optimization(GRPO)**
- API & Local Models setup
- Preference tuning via **Group Relative Policy Optimization(GRPO)**

### 1. Setup

In [None]:
# @title 1.1 Install necessary libraries

# install outlines
print(f"Installing latest transformers...")
!pip install -q git+https://github.com/huggingface/transformers

# install tiktoken
print(f"Installing tiktoken...")
!pip install -q tiktoken

# install outlines
print(f"Installing outlines...")
!pip install -q outlines

# install huggingface-trl
print(f"Installing huggingface-trl...")
!pip install -q trl

# install flash attention
print(f"Installing flash-attention-2...")
!pip install -q flash-attn --no-build-isolation

# install bitsandbytes
print(f"Installing bitsandbytes...")
!pip install -q -U bitsandbytes

# install openai
print(f"Installing openai...")
!pip install -q openai

In [None]:
# @title 1.2 Import necessary libraries
# This Source Code Form is subject to the terms of the MIT
# License. If a copy of the same was not distributed with this
# file, You can obtain one at
# https://github.com/Northwestern-CSSI/LLMSciSci/blob/main/LICENSE.

import os
import gc
import bs4
import time
import json
import torch
import urllib3
import pickle
import pathlib
import tiktoken
import transformers
import numpy as np
import pandas as pd
import polars as pl
import openai as oai
import seaborn as sns
from google import genai
from pprint import pprint
from peft import LoraConfig
from ast import literal_eval
from google.genai import types
from tqdm.notebook import tqdm
from pydantic import BaseModel
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup as BS
from collections import defaultdict
from outlines import models, generate
from typing import List, Optional, Union
from datasets import Dataset, DatasetDict
from collections import Counter, OrderedDict
from transformers import BitsAndBytesConfig, set_seed
from pydantic import BaseModel, create_model, RootModel
from transformers import (
    AutoTokenizer,
    Gemma3ForCausalLM,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    set_seed
)
from trl import (
    GRPOConfig,
    GRPOTrainer,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

In [None]:
# @title 1.3 Load `ReScience` dataset - [Download Data](https://drive.google.com/drive/folders/1qLCC5ZiDWoRtMQyBTeMxPrPlJLxcVgMN?usp=sharing)
!ls -lah ./drive/MyDrive/CSSI/Lecture

# set the directory
os.chdir("./drive/MyDrive/CSSI/Lecture")

# read rescience
rescience = pl.read_csv("./data/ReScience_JCDL-23.csv")

# show shape and columns
print("-------------------------------")
print(f"Data shape: {rescience.shape}")
print("-------------------------------")

"""
Data columns: ['author', 'title', 'doi', 'article_type', 'lang', 'pdf_url', 'keywords', 'review_url',
              'code_url', 'volume', 'issue', 'year', 'abstract', 'easy', 'difficult', 'gs_citations',
              'gs_scholar_url', 'original_pdf_url', 'original_article_url', 'reason_for_easiness',
              'reason_for_difficulty', 'limitations_results', 'scope_of_reproducibility',
              'original_abstract', 'orig_art_sciparse_full_text', 'orig_art_pdfminer_full_text',
              'original_sections', 'no_hyp', 'no_alg', 'no_images', 'no_equations', 'no_tables',
              'is_meth_pres', 'is_intro_pres', 'link_to_code_available', 'mean_readability',
              'hyp_available_in_text', 'easiness_longform', 'difficult_longform',
              'list_for_limitations', 'list_for_diff', 'list_for_easiness', 'more_than_one_easy']
"""

# metadata
meta_data_columns = ["doi", "title", "review_url", "easy", "difficult",
                     "scope_of_reproducibility", "reason_for_easiness", "reason_for_difficulty"]
print(f"Rescience Metadata columns of interest: {meta_data_columns}")
print("-------------------------------")

# sneak peak of the data
print(rescience.select(meta_data_columns).head())

### 2. Prepare Preference Dataset for **Group Relative Policy Optimization(GRPO)**

<img src="https://github.com/akhilpandey95/LLMSciSci/blob/main/media/LLMSciSci_GRPO_dataset.png?raw=true" width=700 height=450>

In [None]:
# @title 2.1 Load raw preference data from `GPT`, `Gemini` and `Llama` responses
# read the gemini labelling data
gemini_effortly = pl.read_csv("./data/gemini_effortly_labels_gamma.csv")

# read the gpt labelling data
gpt_effortly = pl.read_csv("./data/gpt4_effortly_labels_beta.csv")

# read the llama labelling data
llama_effortly = pl.read_csv("./data/llama3_effortly_labels_beta.csv")

# show a preview of the response(s)
print("---------------------------")
print(f"Response from gemini")
print(gemini_effortly.select("easy_gemini_response")[0].item())
print("---------------------------")
print(f"Response from gpt")
print(gpt_effortly.select("easy_gpt_response")[0].item())
print("---------------------------")
print(f"Response from llama")
print(llama_effortly.select("easy_llama3_response")[0].item())
print("---------------------------")

In [None]:
# @title 2.2 helper functions to generate synthetic COT using `Gemini`
# structured response for synthetic COT traces
class ReproEffortCOT(BaseModel):
    cot_trace: str

# main routine function to generate synthetic COT
def easy_diff_COT_generate(text: str, label: str, task: str):
    """
    Processing routine to take a raw text and
    label to generate and return synthetic COT
    from reasoning models.

    --------------------
    Parameters:
        arg1 | text: str
            Raw text from "What was easy ?" or "What was difficult ?" sections
        arg2 | label: str
            Parsed ground truth label for "What was easy ?" or "What was difficult ?" sections
        arg3 | task: str
            Task type "easy" or "difficult"

    --------------------
    Returns:
        Dictionary
            dict
    """
    # default system prompt
    sys_prompt = """
    You are a intelligent research assistant working on elaborating
    your thought process about evaluating the effort of reproducing
    academic articles.
    """

    # default rules
    rules = """
    **Rules:**
    1. PUT ALL of your reasoning trace within steps, like step 1, step 2 etc.
    2. The generated reasoning trace must connect how humans arrived at the "Ground truth" sections.
    3. Ideally, the generated chain of thought should look like internal monologue and not robotic steps.
    """

    # init user prompt
    user_prompt, input_text, response_data = None, None, None

    # what was easy ?
    if task == "easy":
        # init user prompt for the task
        user_prompt="""
        **Task:** You are given brief descriptions that made it easy for researcher
        to reproduce original articles. Your goal is to analyze the brief description
        and generate reasoning chains of thought that helped people classify them
        into one or more from the following five categories, which include:

        1. Availability of Code
        2. Supporting Artifacts
        3. Readability of Full Text
        4. Experimental Setup or Environment
        5. Cannot extract concrete factors that Eased Reproducibility.
        """

        # init the prompt
        input_text = """
        **What was easy:**
        ```plaintext
        EASY_DESCRIPTION
        ```

        **Ground truth:**
        ```plaintext
        EASY_LABEL
        ```
        """

        # replace inputs
        input_text = input_text.replace("EASY_DESCRIPTION", text).replace("EASY_LABEL", label)
    else:
        # init user prompt for the task
        user_prompt="""
        **Task:** You are given brief descriptions that made it difficult for
        researcher to reproduce original articles. Your goal is to analyze the
        brief description and generate reasoning chains of thought that helped
        people classify them into one or more from the following five categories,
        which include:

        1. Missing Algorithm step or Architecture details
        2. Missing nuance details
        3. Unclear notation or documentation in the codebase
        4. Insufficient Math/Equations
        5. Cannot extract concrete factors that made it difficult for reproducibility.
        """

        # init the prompt
        input_text = """
        **What was difficult:**
        ```
        DIFF_DESCRIPTION
        ```

        **Ground truth:**
        ```plaintext
        DIFF_LABEL
        ```
        """

        # replace inputs
        input_text = input_text.replace("DIFF_DESCRIPTION", text).replace("DIFF_LABEL", label)

    # choose the model
    client = genai.Client(api_key="AIzaSyBwbNneEfuXuLO4qrOw2DWByDDFKH2aBeU")

    # config to run model
    generate_content_config = types.GenerateContentConfig(
            max_output_tokens=2048,
            safety_settings=[
                types.SafetySetting(
                    category="HARM_CATEGORY_CIVIC_INTEGRITY",
                    threshold="OFF",
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_HARASSMENT",
                    threshold="OFF",
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_HATE_SPEECH",
                    threshold="OFF",
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    threshold="OFF",
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_DANGEROUS_CONTENT",
                    threshold="OFF",
                ),
            ],
            response_mime_type="text/plain",
        )

    # capture prompt response from Gemini
    response = client.models.generate_content(model = 'gemini-2.0-flash-thinking-exp-01-21', \
                                              contents = sys_prompt + user_prompt + input_text + rules, \
                                              config=generate_content_config
                                              )

    # check if the response could be fetched
    if not response.text:
        # throw an exception
        raise ValueError("[ERR]easy_diff_COT_generate: Empty response from the model")

    # update the response
    response_data = response.text.strip()

    # return the response
    return response_data

In [None]:
# @title 2.3 Run `easy_diff_COT_generate()`, pickle results

# init dict to store results
results = defaultdict(dict)
records = []

"""
- run easy_diff_COT_generate() on all entries in gpt_effortly dataframe
- save the results into a pickled object and save it to `./data/`
"""
# # set the keys for processing
# task_types = ["easy", "difficult"]
# chosen_types = ["easy_gpt_response", "diff_gpt_response"]

# # iterate over all of the samples in easy
# for idx in tqdm(range(len(gpt_effortly))):

#     for i, task in enumerate(task_types):
#         # get the text
#         text = gpt_effortly.select(task)[idx].item()
#         label = gpt_effortly.select(chosen_types[i])[idx].item()

#         # generate reasoning trace
#         cot = easy_diff_COT_generate(text, label, task)

#         # store the results
#         records.append({
#             "idx": (2*idx) + i,
#             "doi": gpt_effortly.select("doi")[idx].item(),
#             "task": task,
#             "label": label,
#             "cot": cot
#         })

# load the pickle file
with open("./data/easy_diff_COT.pkl", "rb") as f:
    records = pickle.load(f)

# shape of the records
print(f"# of records: {len(records)}")

In [None]:
# @title 2.4 helper class to build the $D_{ReproEffortDataset}$ preference dataset

# helper function to load/initalize the prompt
def process_prompt(raw_text, tokenizer, device, task="easy", prompt_type="prompt"):
    """
    Given raw input text generate a prompt that will
    be supplied to a preference dataset loader.

    Parameters
    ------------
    arg1 | raw_text: str
        Raw input text without prompt template
    arg2 | tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast
        Tokenizer from the model
    arg3 | device: str
        Device name for the inputs and attention masks to sit on
    arg4 | task: str[OPTIONAL]
        Task type "What was easy ?" or "What was difficult ?"
    arg5 | prompt_type: str[OPTIONAL]
        String flag to be applied at the top of messages to create "prompt"
        or "completions" chat responses for the GRPO preference dataset

    Returns
    ------------
        Text
    """
    # init
    prompt = None
    messages = []
    add_generation_prompt = True
    sys_prompt, user_prompt, input_text = None, None, None

    # init system prompt available
    sys_prompt = """You are a research assistant working on understanding the
    spectrum of outputs researchers outline when reproducing
    academic articles.
    Respond in the following format:
    <think>
    ...
    </think>
    <label>
    ...
    </label>
    """

    # what was easy ?
    if task == "easy":
      # init user prompt for the task
      user_prompt="""**Task:** You are given brief descriptions that made it easy for researcher
      to reproduce original articles. Your goal is to analyze the brief description
      and classify them into one or more from the following five categories,
      which include:

      1. Availability of Code
      2. Supporting Artifacts
      3. Readability of Full Text
      4. Experimental Setup or Environment
      5. Cannot extract concrete factors that Eased Reproducibility.
      """

      # init the prompt
      input_text = """
      **What was easy:**
      <text>
      EASY_DESCRIPTION
      </text>
      """
    else:
      # init user prompt for the task
      user_prompt="""**Task:** You are given brief descriptions that made it difficult for
      researcher to reproduce original articles. Your goal is to analyze the
      description and classify them into one or more from the following
      five categories:

      1. Missing Algorithm step or Architecture details
      2. Missing nuance details
      3. Unclear notation or documentation in the codebase
      4. Insufficient Math/Equations
      5. Cannot extract concrete factors that made it difficult for reproducibility.
      """

      # init the prompt
      input_text = """
      **What was difficult:**
      <text>
      DIFF_DESCRIPTION
      </text>
      """

    # apply chat template on the chosen/rejected response
    if prompt_type == "completion":
      # init the completion prompt
      input_text = "<think>\nCOT_REASONING_TRACE\n</think>\n<label>\nGROUND_TRUTH_LABEL\n</label>"

      # add the input text
      input_text = input_text.replace("COT_REASONING_TRACE", raw_text[0]).replace("GROUND_TRUTH_LABEL", raw_text[1])

      # gemma needs user/assistant and not just assistant
      if isinstance(tokenizer, transformers.models.gemma.tokenization_gemma_fast.GemmaTokenizerFast):
          # set the chosen response for the preferences
          messages.append([{"role": "user", "content": ""}, {"role": "assistant", "content": input_text}])
      else:
          # set the chosen response for the preferences
          messages.append([{"role": "assistant", "content": input_text}])

      # apply prompt template
      add_generation_prompt=False

      # apply prompt and remove the system prompt
      prompt = tokenizer.apply_chat_template(messages, \
                                            tokenize=False, \
                                            use_system_prompt=add_generation_prompt, \
                                            add_generation_prompt=add_generation_prompt)
      # prompt = messages
    else:
      # adjust and replace EASY_DESCRIPTION or DIFF_DESCRIPTION based on task
      if task == "easy":
        input_text = input_text.replace("EASY_DESCRIPTION", raw_text)
      else:
        input_text = input_text.replace("DIFF_DESCRIPTION", raw_text)

      # set the prompt for the preferences
      messages.append([
          {"role": "system", "content": sys_prompt},
          {"role": "user", "content": user_prompt + input_text}
      ])

      # apply prompt template
      prompt = tokenizer.apply_chat_template(messages, \
                                            tokenize=False, \
                                            add_generation_prompt=add_generation_prompt)
      # prompt = messages

    # return the processed prompt
    return prompt

# utility class to create the preference dataset
class ReproEffortPrefDataset:
    def __init__(self, raw_data, tokenizer, records, device="cpu"):
        """
        Given raw text prepare preference dataset.

        Parameters
        ------------
        arg1 | raw_data: polars.DataFrame or pandas.DataFrame or List[dict]
            ML reproducibility challenge data processed with the following columns:
              - "easy": Raw prompt text for the easy task.
              - "easy_gpt_response": Chosen response for the easy task.
              - "easy_llama3_response": Rejected response for the easy task.
              - "difficult": Raw prompt text for the difficult task.
              - "difficult_gpt_response": Chosen response for the difficult task.
              - "diff_llama3_response": Rejected response for the difficult task.
        arg2 | tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast
            Tokenizer from the model to apply chat template.
        arg3 | device: str[OPTIONAL]
            Device name (e.g., "cpu" or "cuda") for processing.

        Returns
        ------------
            Text
        """
        # polars df ? convert it to a pandas DataFrame.
        if hasattr(raw_data, "to_pandas"):
            raw_data = raw_data.to_pandas()
        # pd df ? convert to list of dicts
        if isinstance(raw_data, pd.DataFrame):
            self.raw_data = raw_data.to_dict("records")
        elif isinstance(raw_data, list):
            self.raw_data = raw_data
        else:
            raise ValueError("ERR[ReproEffortPrefDataset]: Unsupported raw_data type; must be a pl.DataFrame, pd.DataFrame, or list[dict].")

        # init default arguments
        self.tokenizer = tokenizer
        self.device = device

    # helper function to build the dataset object
    def build_dataset(self, test_size=0.2, seed=2025):
        """
        Build a unified preference dataset with
        "What was easy ?" and "What was difficult ?" texts.

        Parameters
        ------------
        arg1 | test_size: float[OPTIONAL]
            Set the size of the test set, defaults to 0.2 or 20% of the dataset.
        arg2 | seed: int[OPTIONAL]
            Seed parameter for reproducibility, defaults to 2025.

        Returns
        ------------
            Dictionary {"prompt": str, "completion": str}
        """
        # inti list to store results
        results = []

        # set the keys for processing
        task_types = ["easy", "difficult"]
        chosen_types = ["y_easy_gpt4", "y_diff_gpt4"]

        # iterate and combine "easy" and "difficult" tasks
        for idx, sample in enumerate(self.raw_data):
            # procdess for each task
            for i, task in enumerate(task_types):
                # set the prompt
                prompt = process_prompt(sample[task], self.tokenizer, self.device, task=task, prompt_type="prompt")[0]

                # set the completions
                chosen = process_prompt((records[(2*idx) + i]["cot"], sample[chosen_types[i]]), \
                                        self.tokenizer, self.device, task=task, prompt_type="completion")[0]

                # append the records
                results.append({
                    "prompt": prompt,
                    "completion": chosen,
                    "task": task,
                    "type": chosen_types[i],
                    "idx": sample["doi"]
                })

        # merge records
        combined_data = {
            "prompt": [r["prompt"] for r in results],
            "completion": [r["completion"] for r in results],
            "task": [r["task"] for r in results],
            "type": [r["type"] for r in results],
            "idx": [r["idx"] for r in results]
        }

        # init hf dataset and perform train/test split.
        dataset = Dataset.from_dict(combined_data)

        # shuffle the datset
        dataset = dataset.shuffle()

        # train test split
        dataset_split = dataset.train_test_split(test_size=test_size, seed=seed)

        # return final dataset
        return dataset_split

### 3. API & Local Models setup

For the commercial models you would need to setup your account and obtainan API key to run some of the experiments in this notebook.

<hr>

**Pre-requisites for commercial models**
<br>
**OpenAI**: https://platform.openai.com/settings/organization/api-keys
<hr>

**Pre-requisites for local models**
<br>
The experiments and widgets in the notebook require `data/` and `models/`. Since `data/` is loaded, we need model weights which can be downloaded here:
- [Models](https://drive.google.com/drive/folders/1aNT1SNA7Lz9kMgt5p1yGWST1T6D2Dmbd?usp=sharing)

In [None]:
# @title 3.1 Local model Catalog
!ls -lah models/

In [None]:
# @title 3.2 Load model client or model-tokenizer pair
# helper function to load/initalize the model
def load_model(model_name, device):
    """
    Given a model path, load tokenizer-model
    pair and return the objects tagged to the
    given device (cpu/cuda)

    Parameters
    ------------
    arg1 | model_name: str
        Use model catalog to load local model weights
    arg2 | device: str
        Hardware acceleration, defaults to "cpu" if any errors arise

    Returns
    ------------
        Tuple(AutoModel, AutoTokenizer) for local (model_client, model_name)
    """
    # accelerator centric attention implementtion
    attn_implementation = "sdpa"

    # device for acceleration
    if torch.cuda.is_available():
        device = "cuda"
        # attn_implementation = "flash_attention_2"
    elif torch.mps.is_available():
        device = "mps"
    else:
        device = "cpu"

    # local models
    local_models = ["llama3.2-1b", "llama3.2-3b", "llama3.1-8b", "qwen2.5-1.5b", "r1-distill-qwen-1.5b"]

    # pathlib for models
    model_path = pathlib.Path("/content/drive/MyDrive/CSSI/Lecture")

    # set the model-id
    model_catalog = {
        "llama-3.2-1b": model_path/f"models/Llama3.2-1B-Instruct/",
        "llama-3.2-3b": model_path/f"models/Llama3.2-3B-Instruct/hf/",
        "llama-3.1-8b": model_path/f"models/Meta-Llama-3.1-8B-Instruct/hf/",
        "gemma-3-1b": model_path/f"models/gemma-3-1b-it/",
        "gemma-3-4b": model_path/f"models/gemma-3-4b-it/",
        "qwen-2.5-1.5b": model_path/f"models/Qwen2.5-1.5B-Instruct/"
    }

    # set a model-id
    model_id = model_catalog[model_name]

    # log
    print("----------------------------------")
    print(f"Using {device} to load {model_id}")
    print("----------------------------------")

    # get model-tokenizer pair
    start = time.time()
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    tokenizer.padding_side  = 'left'

    # based on model size switch quantization config
    if model_name == "llama3.1-70b" or model_name == "r1-distill-llama-70b":
        # 4-bit quantization config
        bnb_4bit = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.bfloat16,
          bnb_4bit_quant_storage=torch.bfloat16
        )

        # 4 bit quantization
        model = AutoModelForCausalLM.from_pretrained(model_id, \
                                                   quantization_config=bnb_4bit, \
                                                   trust_remote_code=True, \
                                                   low_cpu_mem_usage=True, \
                                                   attn_implementation=attn_implementation, \
                                                   device_map=device)
    elif model_name == "gemma-3-4b" or model_name == "gemma-3-1b":
        model = Gemma3ForCausalLM.from_pretrained(model_id, \
                                              trust_remote_code=True, \
                                              torch_dtype=torch.bfloat16, \
                                              low_cpu_mem_usage=True, \
                                              attn_implementation=attn_implementation, \
                                              device_map=device)
    else:
      # 4-bit quantization config
      bnb_4bit = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=torch.bfloat16
      )

      # load bfloat16 weights
      model = AutoModelForCausalLM.from_pretrained(model_id, \
                                                   trust_remote_code=True, \
                                                   torch_dtype=torch.bfloat16, \
                                                   low_cpu_mem_usage=True, \
                                                   attn_implementation=attn_implementation, \
                                                   device_map=device)

    # is it a llama tokenizer ?
    if "llama" in model_name:
        # pad token if needed
        tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})
        print(f"Setting <|finetune_right_pad_id|> token for {model_id}")
        model.resize_token_embeddings(len(tokenizer))

        # llama prompt template
        llama_template = r"""
        {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
        """

        # set the chat template
        tokenizer.chat_template = llama_template

    # gemma eos token
    if "gemma" in model_name:
        # set the EOS tokenid
        tokenizer.eos_token_id = tokenizer.encode("<end_of_turn>")[0]

        # set the chat template
        gemma_template = r"""
        {% if messages[0]['role'] == 'system' %}
            {% set system_message = messages[0]['content'] | trim + '\n\n' %}
            {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}

        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}

            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}

            {% if (message['role'] == 'assistant') %}
                {% set role = 'model' %}
            {% else %}
                {% set role = message['role'] %}
            {% endif %}

            {{ '<start_of_turn>' + role + '\n' + content | trim + '<end_of_turn>\n' }}
        {% endfor %}

        {% if add_generation_prompt %}
            {{'<start_of_turn>model\n'}}
        {% endif %}
        """

        # set the chat template
        tokenizer.chat_template = gemma_template

    # load time
    end = time.time()
    print(f"Model-tokenizer Load Time:, {end - start} seconds")
    print("----------------------------------")

    # return the pair
    return model, tokenizer

In [None]:
# @title 3.3 Load policy model for $\pi_{LLMSciSci}$

# set model and device name
# device = "cpu"
device = "cuda"
model_name = "gemma-3-1b"
# model_name = "qwen-2.5-1.5b"
# model_name = "llama-3.2-1b"

# load the model for policy update
model, tokenizer = load_model(model_name, device)

### 4. Preference tuning via **Group Relative Policy Optimization(GRPO):**

In [None]:
# @title 4.1 Init `ReproEffortPrefDataset` dataset object, train/test split

# final shape of the raw data
print(f"-----------------------------------")
print(f"Shape of raw_data: {gpt_effortly.shape}")
print(f"-----------------------------------")
print("Columns in raw_data:")
print(f"-----------------------------------")
pprint(gpt_effortly.columns)
print(f"-----------------------------------")

# convert raw data to pandas
raw_data_pd = gpt_effortly.to_pandas()

# create the preference dataset
dataset_obj = ReproEffortPrefDataset(raw_data_pd, tokenizer, records, device=device)
dataset = dataset_obj.build_dataset(test_size=0.2, seed=2025)
print(f"-----------------------------------")
print(f"ReproEffortPrefDataset:")
print(dataset)

In [None]:
# @title 4.2 Preview of the reasoning CoTs for "What was easy ?" and "What was difficult ?" tasks

# print sample cut of the dataset
print("-----------------------------------")
print("Prompt for the preference dataset..")
print(dataset["train"][-1]["prompt"])
print("-----------------------------------")
print("Reasoning:")
print(dataset["train"][-1]["completion"])
print("-----------------------------------")

In [None]:
# @title 4.3 Tokenomics to decide `max_seq_length` and `prompt_length`
# gather the train and test datasets
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# lets find the p95 length of the prompt
prompt_length = int(np.percentile([len(tokenizer(x)["input_ids"]) for x in train_dataset["prompt"]], 95))
max_seq_length = int(np.percentile([len(tokenizer(x["prompt"] + \
                                                  x["completion"])["input_ids"]) \
                                    for x in train_dataset], 95))

# filter datasets to remove samples that are too long
train_dataset = train_dataset.filter(lambda x: len(tokenizer(x["prompt"] + \
                                                             x["completion"])["input_ids"]) <= max_seq_length)
test_dataset = test_dataset.filter(lambda x: len(tokenizer(x["prompt"] + \
                                                           x["completion"])["input_ids"]) <= max_seq_length)
print(f"len(train_dataset): {len(train_dataset)}")
print(f"len(test_dataset): {len(test_dataset)}")

# # Up the lengths to next multiple of 2, why 2? Don't know
prompt_length = ((prompt_length + 1) // 2) * 2
max_seq_length = ((max_seq_length + 1) // 2) * 2
print(f"p95 prompt length: {prompt_length}")
print(f"p95 prompt + chosen length: {max_seq_length}")

In [None]:
# @title 4.4 Build rewards for $GRPO$
import re
from ast import literal_eval as le
from sklearn.metrics import hamming_loss as hl

# helper function for getting the label out
def extract_text(text: str, tag: str) -> str:
    label = text.split("<" + tag + ">")[-1]
    label = label.split("</" + tag + ">")[0]
    return label.strip()

# helper function to check one hot format of the extracted label
def is_valid_onehot(label_str: str, expected_length: int = 5) -> bool:
    try:
        label = le(label_str)
        if isinstance(label, list) and len(label) == expected_length:
            return all(isinstance(x, int) and x in (0, 1) for x in label)
    except Exception:
        pass
    return False

# hard reward function to check overall answer format
def format_reward_func(completions, **kwargs):
    # pattern = r"(?s).*<think>.*?</think>\s*<label>.*?</label>.*"
    pattern = r"(?sm).*<think>.*?</think>\s*<label>.*?</label>.*"
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]
    return [1.0 if match else 0.0 for match in matches]

# soft reward function to reward one hot labels
def label_reward_func(completions, **kwargs):
    # regex pattern to capture results
    # pattern = r"(?s).*<think>.*?</think>\s*<label>.*?</label>.*"
    pattern = r"(?sm).*<think>.*?</think>\s*<label>.*?</label>.*"

    # init results
    rewards = []

    # match format reward
    for content in completions:
        try:
            matched = re.match(pattern, content).string
            extracted_responses = extract_text(matched, "label")
            if is_valid_onehot(extracted_responses):
                rewards.append(0.5)
            else:
                rewards.append(0.0)
        except Exception:
            rewards.append(0.0)

    # return rewards
    return rewards

# stepwise label rewards function to encourage progress
def stepwise_label_reward_func(completions, **kwargs):
    """
    Stepwise reward:

    1) +0.125 if there exists text within <label>...</label>.
    2) +0.125 if that text consists only of 0's and 1's (ignoring brackets, commas, and whitespace).
    3) +0.125 if the text starts with '[' and ends with ']'.
    4) +0.625 if the text passes the is_valid_onehot() check.
    """

    # This pattern ensures we have something in <think>...</think>
    # followed by <label>...</label> in the content (dot matches newlines).
    pattern = r"(?sm).*<think>.*?</think>\s*<label>.*?</label>.*"

    rewards = []
    for content in completions:
        total_reward = 0.0
        try:
            # Make sure the content at least matches the structure
            # of having <label>...</label> after <think>...</think>
            if re.match(pattern, content):
                # Extract whatever is between <label> and </label>
                label_str = extract_text(content, "label")

                # 1) Check that we indeed have label text.
                if label_str is not None and label_str.strip():
                    total_reward += 0.125

                    # 2) Check if the extracted text has only 0's and 1's
                    #    (ignoring brackets, commas, and whitespace).
                    stripped_str = re.sub(r'[\[\],\s]', '', label_str)
                    if re.fullmatch(r'[01]+', stripped_str):
                        total_reward += 0.125

                    # 3) Check if the label_str starts with '[' and ends with ']'
                    if label_str.startswith('[') and label_str.endswith(']'):
                        total_reward += 0.125

                    # 4) Check if it passes the is_valid_onehot() condition
                    if is_valid_onehot(label_str):
                        total_reward += 0.625

            # accrue rewards
            rewards.append(total_reward)
        except Exception:
            # If something failed, we just assign zero
            rewards.append(0.0)

    # return rewards
    return rewards

# reward function to check correctness through hamming loss of the generated label
def hamming_loss_label_reward(prompts, completions, task, ltype, dois, **kwargs):
    # init results
    prompt_contents, completion_contents, rewards = [], [], []

    # regex pattern to capture results
    # pattern = r"(?s).*<think>.*?</think>\s*<label>.*?</label>.*"
    pattern = r"(?sm).*<think>.*?</think>\s*<label>.*?</label>.*"

    # iterate over all samples
    for prompt, completion, t, lt, doi in zip(prompts, completions, task, ltype, dois):
        prompt_contents.append(prompt)
        completion_contents.append(completion)

        # get prompt raw text
        prompt_content = prompt.split("<text>")[-1]
        prompt_content = prompt_content.split("</text>")[0]

        # get ground truth
        y_true = gpt_effortly.filter(pl.col("doi") == doi).select(lt).item()
        completion_content = re.match(pattern, completion).string
        y_hat = extract_text(completion_content, "label")
        if is_valid_onehot(y_hat):
            rewards.append(1 - hl( le(y_true), le(y_hat) ))
        else:
            rewards.append(0.0)

    # return rewards
    return rewards

In [None]:
# @title 4.5 Testing the rewards on samples

print(f"Format rewards for five examples in training set:")
print(format_reward_func(completions=train_dataset["completion"])[:5])
print(f"Label rewards for five examples in training set:")
print(label_reward_func(train_dataset["completion"])[:5])
print(f"Hamming loss reward for five examples in training set:")
print(hamming_loss_label_reward(train_dataset["prompt"], train_dataset["completion"], train_dataset["task"], train_dataset["type"], train_dataset["idx"])[:5])
print(f"Stepwise label reward for five examples in training set:")
print(stepwise_label_reward_func(train_dataset["completion"])[:5])

In [None]:
# cot length rewards function to encourage longer chain of thought
def cot_trace_length_reward(completions, **kwargs):
    # init rewards
    rewards = []

    # regex pattern
    pattern = r"(?sm).*<think>.*?</think>\s*<label>.*?</label>.*"

    # iterate over the completions to calculate length reward
    for completion in completions:
        try:
            # get prompt raw text
            cot_content = re.match(pattern, completion).string
            cot_content = extract_text(cot_content, "think")

            # adjust reward for COT length
            token_length = len(tokenizer(cot_content, padding=True, return_tensors="pt")["input_ids"][0])
            rewards.append(token_length)
        except Exception:
            # empty rewards
            rewards.append(0)

    # return rewards
    return rewards

# combine stepwise rewards and cot length reward
def cond_cot_steplabel_reward(completions, **kwargs):
    """
    Only add a length-based reward if the label meets a given validity threshold.
    If the label score is below this threshold, length reward is 0.
    final_reward = label_reward + alpha * length_reward (when label is valid)
                 = label_reward (when label is invalid).

    :param completions: list of generated outputs
    :param alpha: scaling factor for length reward
    :param validity_threshold: minimum label correctness score to unlock length reward
    :param kwargs: any extra arguments your sub-reward functions need
    :return: a list of float final rewards
    """
    # set params
    alpha = 0.1
    validity_threshold = 1.0

    # 1) Get the label reward for each completion
    label_scores = stepwise_label_reward_func(completions, **kwargs)

    # 2) Get the chain-of-thought length reward for each completion
    length_scores = cot_trace_length_reward(completions, **kwargs)

    # 3) Combine them conditionally
    final_rewards = []
    for label_score, length_score in zip(label_scores, length_scores):
        if label_score >= validity_threshold:
            # Label is valid enough -> add length reward
            final_reward = label_score + alpha * length_score
        else:
            # Label isn't valid -> no length reward
            final_reward = label_score
        final_rewards.append(final_reward)

    return final_rewards

print(f"COT trace length completion reward:")
print(cot_trace_length_reward(train_dataset["completion"])[:5])

print(f"Combining COT length reward with Stepwise label reward:")
print(cond_cot_steplabel_reward(train_dataset["completion"])[:5])

In [None]:
# @title 4.6 Ablations and safety testing to mitigate reward hacking
a = """
-----------------------------------
Reasoning:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>assistant
<think>
```
step 1: Okay, let's break down this description to see why it's considered easy to reproduce.  First sentence: "The original paper is well‐structured and easy to follow, with the principal ideas behind the proposed algorithms being very intuitive."  Hmm, "well-structured and easy to follow" clearly speaks to how readable the paper is.  It's not about code or data yet, but how the paper itself is written.  This sounds like it directly addresses category 3, 'Readability of Full Text'.  Intuitive algorithms also contribute to understandability, reinforcing this point. So, for Readability, it's a definite 'Yes'.

step 2: Moving to the second sentence: "Additionally, the datasets used in the ex‐ periments are publicly available, small in size, and the authors provide their code on GitHub."  Ah, datasets and code! These are tangible things needed for reproduction beyond just reading the paper.  "Datasets publicly available" is great.  Publicly available datasets are definitely 'Supporting Artifacts' (category 2).  The fact they are "small in size" is an added bonus for ease of use, but the key is their availability.  So, Supporting Artifacts gets a 'Yes'.

step 3:  And then, "authors provide their code on GitHub."  This is super explicit.  'Availability of Code' (category 1) is a direct match.  GitHub is a common platform for sharing code, making it easily accessible.  So, 'Availability of Code' is also a 'Yes'.

step 4: Now, let's think about 'Experimental Setup or Environment' (category 4). The description mentions datasets and code, which are components of an experiment, but it doesn't actually describe *how* the experiment was set up, what environment was used (software versions, hardware, specific settings beyond the code itself).  It's more about the *inputs* (data) and the *implementation* (code) being available, rather than a detailed description of the experimental process or environment itself in the description provided.  So, based on *this* description, I can't say there's information specifically easing reproduction related to experimental setup *as a distinct category*. Therefore, for 'Experimental Setup or Environment', it's a 'No'.

step 5: Finally, 'Cannot extract concrete factors that Eased Reproducibility' (category 5).  Wait a minute, we've already extracted several concrete factors!  We found 'Readability of Full Text', 'Availability of Code', and 'Supporting Artifacts (datasets)'.  These are all very concrete reasons why this paper is easy to reproduce based on the description.  So, it's definitely *not* the case that we cannot extract concrete factors.  Therefore, for 'Cannot extract concrete factors that Eased Reproducibility', it's a 'No'.

step 6: Let me summarize.  Based on the description:
- 'Availability of Code': Yes (code on GitHub)
- 'Supporting Artifacts': Yes (publicly available datasets)
- 'Readability of Full Text': Yes (well-structured, easy to follow)
- 'Experimental Setup or Environment': No (no specific details mentioned in this description)
- 'Cannot extract concrete factors that Eased Reproducibility': No (we extracted several factors)

This matches the ground truth perfectly! Confidence level: 5/5.  The description is indeed packed with reasons why reproducibility is easy, hitting multiple categories.
```
</think>
<label>
[1, 1, 1, 0, 0]
</label><|im_end|>
"""

test = [a]
stepwise_label_reward_func(test)

In [None]:
# @title 4.7 Train $\pi_{LLMSciSci}$ using the $L_{GRPO}$ policy optimization
# from trl import GRPOConfig, GRPOTrainer
# from transformers import GenerationConfig
# from peft import LoraConfig, get_peft_model

# # LoRA config
# alpha = 4
# r = 4
# peft_config = LoraConfig(
#     lora_alpha=alpha,
#     lora_dropout=0.05,
#     r=r,
#     bias="none",
#     target_modules="all-linear",
#     task_type="CAUSAL_LM",
# )

# # args
# training_args = GRPOConfig(
#     output_dir="llmscisci-GRPO-gemma", \
#     run_name="GRPO/rn-llmscisci-GRPO-gemma", \
#     num_generations = 4,
#     learning_rate = 5e-6,
#     max_prompt_length = prompt_length,
#     max_completion_length = max_seq_length - prompt_length,
#     logging_steps=1, \
#     report_to=["wandb"], \
#     num_train_epochs=1, \
#     max_grad_norm = 20, \
#     label_names=["completion"]
# )

# # init GRPO trainer
# trainer = GRPOTrainer(
#     model = model,
#     processing_class = tokenizer,
#     reward_funcs = [
#         format_reward_func,
#         label_reward_func,
#         stepwise_label_reward_func
#     ],
#     args = training_args,
#     train_dataset = train_dataset,
# )

In [None]:
# train
# trainer.train()

# save model weights
# trainer.save_model()

### 5. Inference check $\pi_{LLMSciSci}$ against $\pi_{LLM-instruct}$ on sample outputs

In [None]:
# sample inference
# @title 5.1 Load reference model for $\pi_{LLM-Instruct}$
# load reference model without policy update
# ref_model, _ = load_model(model_name, device)

In [None]:
# @title 5.2 Sample inference check on $D_{test}$
# from IPython.display import display, Latex

# # seed for reproducibility
# set_seed(2025)

# # set top_p and temperature to none
# ref_model.generation_config.temperature=None
# ref_model.generation_config.top_p=None
# trainer.model.generation_config.temperature=None
# trainer.model.generation_config.top_p=None

# # inputs, attention mask, and shape
# idx = 0
# input_encoded = tokenizer(test_dataset["prompt"][idx], padding=True, return_tensors="pt")
# input_encoded_ids = input_encoded["input_ids"].to("cuda")
# input_encoded_attn_mask = input_encoded["attention_mask"].to("cuda")
# ref_input_encoded_ids = input_encoded["input_ids"].to("cuda")
# ref_input_encoded_attn_mask = input_encoded["attention_mask"].to("cuda")
# input_shape = len(input_encoded["input_ids"][0])

# # model outputs on test prompts
# outputs = trainer.model.generate(
#     input_ids=input_encoded_ids,
#     attention_mask=input_encoded_attn_mask,
#     max_new_tokens=512,
#     do_sample=False,
#     pad_token_id=tokenizer.pad_token_id,
#     eos_token_id=tokenizer.encode("<|im_end|>")
# )

# # reference model outputs on test prompts
# ref_outputs = ref_model.generate(
#     input_ids=ref_input_encoded_ids,
#     attention_mask=ref_input_encoded_attn_mask,
#     max_new_tokens=512,
#     do_sample=False,
#     pad_token_id=tokenizer.pad_token_id,
#     eos_token_id=tokenizer.encode("<|im_end|>")
# )

# print("LLMSciSci-GRPO Policy model response")
# output = tokenizer.decode(outputs[0][input_shape:], skip_special_tokens=True)
# print(output)
# print("------------------------------------------------------")
# print("Reference model response (qwen-2.5-1.5b)")
# ref_output = tokenizer.decode(ref_outputs[0][input_shape:], skip_special_tokens=True)
# print(ref_output)
# print("------------------------------------------------------")
# print("Correct response:")
# print(test_dataset["completion"][idx])
# print("------------------------------------------------------")

In [None]:
# del model, tokenizer, trainer
# torch.cuda.empty_cache()
# gc.collect()