<a href="https://colab.research.google.com/github/amanzoni1/fine_tuning/blob/main/LLama3_RL_GRPO_Reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qwen3 Reinforcement Learning & GRPO with Reasoning

In [1]:
#@title Colab Install { display-mode: "form" }
%%capture
# Install Unsloth + vLLM (pinned versions)
!pip install --no-deps unsloth vllm==0.8.5.post1

# Core dependencies for LoRA, TRL, and bitsandbytes on Colab
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo

# Common NLP libraries
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub transformers==4.51.3

# vLLM extra requirements (skip numpy/transformers/xformers to avoid conflicts)
import requests, re
reqs = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
filtered = re.sub(rb"(transformers|numpy|xformers)[^\n]*\n", b"", reqs)
with open("vllm_requirements.txt","wb") as f:
    f.write(filtered)
!pip install -r vllm_requirements.txt

In [2]:
import torch
import numpy as np
import pandas as pd
from unsloth import FastLanguageModel
from datasets import load_dataset, Dataset
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-05 07:56:57 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-05 07:56:57 [__init__.py:239] Automatically detected platform cuda.


In [3]:
model       = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
sft_dataset = "openai/gsm8k"
rl_dataset  = "xlangai/spider"

In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name            = model,
    max_seq_length        = 512,
    load_in_4bit          = True,
    fast_inference        = True,
    max_lora_rank         = 16,
)

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-bnb-4bit with actual GPU utilization = 49.53%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 1.2 GB. Also swap space = 0 GB.
INFO 06-05 07:57:35 [config.py:717] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

INFO 06-05 07:57:41 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-05 07:57:41 [cuda.py:289] Using XFormers backend.
INFO 06-05 07:57:42 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-05 07:57:42 [model_runner.py:1108] Starting to load model unsloth/meta-llama-3.1-8b-bnb-4bit...
INFO 06-05 07:57:43 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 06-05 07:57:43 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

INFO 06-05 07:58:41 [weight_utils.py:281] Time spent downloading weights for unsloth/meta-llama-3.1-8b-bnb-4bit: 57.630114 seconds
INFO 06-05 07:58:41 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 06-05 07:59:34 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-05 07:59:35 [model_runner.py:1140] Model loading took 5.4442 GiB and 112.470382 seconds
INFO 06-05 07:59:47 [worker.py:287] Memory profiling takes 10.87 seconds
INFO 06-05 07:59:47 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 06-05 07:59:47 [worker.py:287] model weights take 5.44GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.59GiB; the rest of the memory reserved for KV Cache is 1.24GiB.
INFO 06-05 07:59:48 [executor_base.py:112] # cuda blocks: 632, # CPU blocks: 0
INFO 06-05 07:59:48 [executor_base.py:117] Maximum concurrency for 512 tokens per request: 19.75x
INFO 06-05 07:59:48 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out

Capturing CUDA graph shapes:   0%|          | 0/19 [00:00<?, ?it/s]

INFO 06-05 08:00:34 [model_runner.py:1592] Graph capturing finished in 46 secs, took 0.47 GiB
INFO 06-05 08:00:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 58.30 seconds
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'k_norm', 'pre_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'k_norm', 'pre_feedforward_layernorm']


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    random_state               = 111,
    r                          = 16,
    lora_alpha                 = 32,
    bias                       = "none",
    use_gradient_checkpointing = "unsloth",
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj",
                                  "gate_proj", "up_proj", "down_proj"],
)

Unsloth 2025.5.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
# Define special tokens and system prompt
reasoning_start = "<REASONING>"
reasoning_end   = "</REASONING>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = (
    "You are given a problem.\n"
    "Think over it and describe your step‐by‐step reasoning.\n"
    f"Enclose reasoning between {reasoning_start} and {reasoning_end}.\n"
    f"Finally, give your answer between {solution_start} and {solution_end}"
)

In [7]:
# Build and assign chat_template to the tokenizer

chat_template = (
    # If the very first message is a SYSTEM role, print it + <eos>:
    "{% if messages[0]['role'] == 'system' %}"
      "{{ messages[0]['content'] + eos_token }}"
      "{% set rest = messages[1:] %}"
    "{% else %}"
      # Otherwise, inject our system_prompt + <eos>:
      "{{ '{system_prompt}' + eos_token }}"
      "{% set rest = messages %}"
    "{% endif %}"

    # Now loop over the remaining messages (either user or assistant):
    "{% for m in rest %}"
      "{% if m['role'] == 'user' %}"
        "{{ m['content'] }}"
      "{% else %}"  # assistant
        "{{ m['content'] + eos_token }}"
      "{% endif %}"
    "{% endfor %}"

    # If we asked for “add_generation_prompt,” append <REASONING> to the end:
    "{% if add_generation_prompt %}"
      "{{ '{reasoning_start}' }}"
    "{% endif %}"
)

chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")

tokenizer.chat_template = chat_template

In [8]:
# Quick sanity check of the template
example_messages = [
    {"role": "user",
     "content": "Which country has the highest population density?"},
    {"role": "assistant",
     "content": (
         f"{reasoning_start}"
         "I know that country X is small in area but has a huge population, "
         "so its people per square kilometer is extremely high."
         f"{reasoning_end}"
         f"{solution_start}Monaco{solution_end}"
     )},
    {"role": "user",
     "content": "Which planet is farthest from the Sun?"},
]


print("Rendered example:\n")
print(tokenizer.apply_chat_template(example_messages, tokenize=False, add_generation_prompt = True))

Rendered example:

You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Which country has the highest population density?<REASONING>I know that country X is small in area but has a huge population, so its people per square kilometer is extremely high.</REASONING><SOLUTION>Monaco</SOLUTION><|end_of_text|>Which planet is farthest from the Sun?<REASONING>


In [9]:
# Load the dataset
dataset = load_dataset(sft_dataset, "main", split="train")
dataset

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [10]:
print("=== Raw GSM8K columns ===")
print(dataset.column_names)
print("\n=== First raw example ===")
print(dataset[0])

=== Raw GSM8K columns ===
['question', 'answer']

=== First raw example ===
{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}


In [11]:
# Define formatting + token-count function
def format_and_count_gsm8k(example):
    question = example["question"].strip()
    reasoning = example["answer"].split("####")[0].replace("\n", " ").strip()
    final_ans = example["answer"].split("####")[1].strip()

    messages = [
        {"role": "system",    "content": system_prompt},
        {"role": "user",      "content": question},
        {"role": "assistant", "content": (
            f"{reasoning_start}{reasoning}{reasoning_end}"
            f"{solution_start}{final_ans}{solution_end}"
        )}
    ]

    enc = tokenizer.apply_chat_template(messages, tokenize=True)
    if isinstance(enc, dict):
        token_len = len(enc["input_ids"])
    else:
        token_len = len(enc)

    text_str = tokenizer.apply_chat_template(messages, tokenize=False)

    return {
        "token_len": token_len,
        "text": text_str,
        "Messages": messages
    }

In [12]:
# Apply formatting to the dataset
dataset = dataset.map(
    format_and_count_gsm8k,
    remove_columns=dataset.column_names,
)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [13]:
# Sanity check: print a few "text" examples before filtering
print("\n=== Few formatted text examples (first 3) ===")
for i in range(3):
    print(f"\n--- Example {i} ---")
    print(dataset[i]["text"])


=== Few formatted text examples (first 3) ===

--- Example 0 ---
You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<REASONING>Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.</REASONING><SOLUTION>72</SOLUTION><|end_of_text|>

--- Example 1 ---
You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?<REASONING>Weng earns 12/60 = $<<12/60=0.2>

In [14]:
# Token-length statistics and dataset filtered for training
lengths = np.array(dataset["token_len"])
print("\nToken-length percentiles (50/90/99):", np.percentile(lengths, [50, 90, 99]))

threshold = 200
sft_ds_filtered     = dataset.filter(lambda ex: ex["token_len"] <= threshold)
sft_ds_filtered     = sft_ds_filtered.select(range(100))
sft_ds_filtered_out = dataset.filter(lambda ex: ex["token_len"] >  threshold)

print(f"\nRemaining for training (≤{threshold} tokens): {len(sft_ds_filtered)} / {len(dataset)}")


Token-length percentiles (50/90/99): [210. 299. 390.]


Filter:   0%|          | 0/7473 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7473 [00:00<?, ? examples/s]


Remaining for training (≤200 tokens): 100 / 7473


In [15]:
# Drop extra columns so dataset contains only "text"
sft_dataset = sft_ds_filtered.remove_columns(["token_len", "Messages"])
print("\n=== Final dataset ===")
print(sft_dataset)


=== Final dataset ===
Dataset({
    features: ['text'],
    num_rows: 100
})


In [16]:
# Define training arguments
sft_config = SFTConfig(
    seed                        = 111,
    do_train                    = True,
    num_train_epochs            = 2,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 2,
    learning_rate               = 2e-4,
    lr_scheduler_type           = "linear",
    warmup_ratio                = 0.03,
    weight_decay                = 0.01,
    logging_strategy            = "steps",
    logging_steps               = 5,
    report_to                   = "none",
)

In [17]:
# Instantiate SFTTrainer
trainer = SFTTrainer(
    model         = model,
    args          = sft_config,
    train_dataset = sft_dataset,
    tokenizer     = tokenizer,
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [18]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 2 | Total steps = 50
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,1.6672
10,0.8701
15,0.7716
20,0.6991
25,0.7737
30,0.6378
35,0.5837
40,0.5606
45,0.5789
50,0.519


TrainOutput(global_step=50, training_loss=0.7661662244796753, metrics={'train_runtime': 162.7359, 'train_samples_per_second': 1.229, 'train_steps_per_second': 0.307, 'total_flos': 893591679762432.0, 'train_loss': 0.7661662244796753})

In [19]:
# Pick one example’s first two “system + user” messages
prompt_messages = sft_ds_filtered_out[0]["Messages"][:2]

# Render into a single string and append <REASONING> for generation:
text = tokenizer.apply_chat_template(
    prompt_messages,
    tokenize=False,
    add_generation_prompt=True,  # append the final <REASONING>
)

In [20]:
# Stream the model’s generations (CoT + solution)
streamer = TextStreamer(tokenizer, skip_prompt=False)

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    temperature    = 0.0,
    max_new_tokens = 512,
    streamer       = streamer,
)

<|begin_of_text|>You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?<REASONING>Yesterday, Julie read 12*2=<<12*2=24>>24 pages. So far, she has read 12+24=36 pages. She has 120-36=<<120-36=84>>84 pages left. Half of 84 is 84/2=<<84/2=42>>42 pages. So, Julie should read 42 pages tomorrow.</REASONING><SOLUTION>42</SOLUTION><|end_of_text|>


In [21]:
del dataset
del sft_ds_filtered
del sft_ds_filtered_out
del sft_dataset

import gc
gc.collect()
torch.cuda.empty_cache()

In [22]:
dataset = load_dataset(rl_dataset, split = "train")
dataset

README.md:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/831k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/126k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1034 [00:00<?, ? examples/s]

Dataset({
    features: ['db_id', 'query', 'question', 'query_toks', 'query_toks_no_value', 'question_toks'],
    num_rows: 7000
})

In [23]:
# 1) Peek at three examples to make sure "question" and "query" look right.
for i in range(3):
    print(f"\n─ Example {i} ─")
    print("Question :", dataset[i]["question"])
    print("Gold SQL :", dataset[i]["query"])


─ Example 0 ─
Question : How many heads of the departments are older than 56 ?
Gold SQL : SELECT count(*) FROM head WHERE age  >  56

─ Example 1 ─
Question : List the name, born state and age of the heads of departments ordered by age.
Gold SQL : SELECT name ,  born_state ,  age FROM head ORDER BY age

─ Example 2 ─
Question : List the creation year, name and budget of each department.
Gold SQL : SELECT creation ,  name ,  budget_in_billions FROM department


In [24]:
# Convert each row into a simple {"prompt": ..., "gold_sql": ...} pair
def to_grpo_input(ex):
    return {
        "prompt": ex["question"].strip(),       # the NL question
        "gold_sql": ex["query"].strip()         # the “correct” SQL
    }

dataset = dataset.map(to_grpo_input, remove_columns=dataset.column_names)

print("\nColumns now:", dataset.column_names)
print("Example 0:", dataset[0])

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]


Columns now: ['prompt', 'gold_sql']
Example 0: {'prompt': 'How many heads of the departments are older than 56 ?', 'gold_sql': 'SELECT count(*) FROM head WHERE age  >  56'}


In [25]:
# Render each “prompt” through your chat template → produce a new "text" field
def add_chat_text(ex):
    messages = [
        {"role": "system",    "content": system_prompt},
        {"role": "user",      "content": ex["prompt"]},
    ]
    # Render into a single string + append "<REASONING>"
    text_str = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return {"text": text_str, "gold_sql": ex["gold_sql"]}

rl_dataset = dataset.map(add_chat_text, remove_columns=["prompt"])

print("\nColumns now:", rl_dataset.column_names)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]


Columns now: ['gold_sql', 'text']


In [26]:
# Final sanity check
print("\n=== Final GRPO-ready dataset ===")
print(rl_dataset)

# Inspect three rows to confirm everything:
for i in range(3):
    print(f"\n── GRPO Entry {i} ──")
    print("  text     :", repr(rl_dataset[i]["text"]))
    print("  gold_sql :", rl_dataset[i]["gold_sql"])


=== Final GRPO-ready dataset ===
Dataset({
    features: ['gold_sql', 'text'],
    num_rows: 7000
})

── GRPO Entry 0 ──
  text     : 'You are given a problem.\nThink over it and describe your step‐by‐step reasoning.\nEnclose reasoning between <REASONING> and </REASONING>.\nFinally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>How many heads of the departments are older than 56 ?<REASONING>'
  gold_sql : SELECT count(*) FROM head WHERE age  >  56

── GRPO Entry 1 ──
  text     : 'You are given a problem.\nThink over it and describe your step‐by‐step reasoning.\nEnclose reasoning between <REASONING> and </REASONING>.\nFinally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>List the name, born state and age of the heads of departments ordered by age.<REASONING>'
  gold_sql : SELECT name ,  born_state ,  age FROM head ORDER BY age

── GRPO Entry 2 ──
  text     : 'You are given a problem.\nThink over it and describe your step‐by‐step reasoning.\nEn

In [None]:
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from vllm import SamplingParams
import sqlite3, re


In [None]:
# 1) Build an “</SOLUTION> + optional EOS/whitespace” pattern
solution_end_regex = (
    r"</SOLUTION>"
  + r"[\s]*"
  + "(?:" + re.escape(tokenizer.eos_token) + ")?"
)

# 2) Build a single regex that matches “</REASONING> … <SOLUTION> … </SOLUTION>” at the end
match_full_format = re.compile(
    rf"{reasoning_end}.*?"
    rf"{solution_start}(.+?){solution_end_regex}"
    rf"[\s]*$",
    flags = re.MULTILINE | re.DOTALL
)

def reward_format(completions, **kwargs):
    """
    Returns a list of floats, one per generated completion.
    We award:
      • +3.0 if the pattern '</REASONING>…<SOLUTION>…</SOLUTION>' appears exactly once
        and ends the string.
      • +0.5 if it has exactly one </REASONING>, exactly one <SOLUTION>, exactly one </SOLUTION>,
        but maybe not in the perfect order/position.
      • -1.0 otherwise (no format tags or too many of them).
    """
    scores = []
    for c in completions:
        resp = c[0]["content"]

        # Case A: Perfect “</REASONING> … <SOLUTION> … </SOLUTION>” at string end
        if match_full_format.search(resp):
            scores.append(3.0)
            continue

        # Case B: Maybe tags appear but not in ideal order or multiple times
        cnt_rend = resp.count(reasoning_end)
        cnt_sst  = resp.count(solution_start)
        cnt_send = resp.count(solution_end)
        if cnt_rend == 1 and cnt_sst == 1 and cnt_send == 1:
            scores.append(0.5)
        else:
            scores.append(-1.0)
    return scores

In [None]:
def run_sql_on_db(db_id: str, sql: str):
    """ Return list of row‐tuples if valid, else raise. """
    conn = sqlite3.connect(f"./spider_databases/{db_id}.sqlite")
    try:
        c = conn.execute(sql)
        rows = c.fetchall()
        conn.close()
        return rows
    except Exception:
        conn.close()
        raise

def reward_sql_correctness(completions, batch, seq_ids, **kwargs):
    """
    Single function that returns one score per example:
      1) 0.0 if no valid <SOLUTION>…</SOLUTION> extraction or SQL execution fails.
      2) Otherwise:
         • base_points = 5.0 if gen_res == gold_res, 0.0 otherwise.
         • exact_string_bonus = 2.0 if stripped SQL matches exactly.
         • efficiency_bonus = +0.5 if generated SQL is shorter than gold; –0.5 if > 20% longer.
    """
    scores = []
    for response, gold_sql, db_id in zip(
        [c[0]["content"] for c in completions],
        batch["gold_sql"],
        batch["db_id"]
    ):
        # 1) Extract what’s inside <SOLUTION>…</SOLUTION>
        m = match_full_format.search(response)
        if m is None:
            scores.append(0.0)
            continue
        gen_sql = m.group(1).strip()

        # 2) Execute both on the same Spider DB
        try:
            gold_res = run_sql_on_db(db_id, gold_sql)
            gen_res  = run_sql_on_db(db_id, gen_sql)
        except Exception:
            # Invalid syntax or runtime error → zero reward
            scores.append(0.0)
            continue

        # 3) Compare results
        base_points = 5.0 if (gold_res == gen_res) else 0.0

        # 4) Exact‐string bonus
        exact_str_bonus = 2.0 if (gen_sql.strip() == gold_sql.strip()) else 0.0

        # 5) Efficiency: compare lengths
        len_gen  = len(gen_sql.split())  # or len(gen_sql) if you prefer char count
        len_gold = len(gold_sql.split())
        if len_gen < len_gold:
            efficiency = 0.5
        elif len_gen > 1.2 * len_gold:
            efficiency = -0.5
        else:
            efficiency = 0.0

        total = base_points + exact_str_bonus + efficiency
        scores.append(total)

    return scores