## HW2: Deep Personalization

In the previous homework, you experimented with alignment methods including few-shot prompting, instruction tuning, RLHF, and DPO. Those experiments used general chat data to align a pre-trained model into a generally helpful assistant.

However, helpfulness is often subjective. For example, while RLHF models tend to produce longer outputs, you may prefer short and concise responses when generating emails. In this assignment, you'll explore machine learning techniques to personalize model behavior to individual preferences.

### Overview of LLM Personalization

LLM Personalization encompasses the methodical application of machine learning techniques to customize large language model behavior based on user-specific data.

<img src="assets/overview.jpg" alt="Overview" width="800">

A classic methodology is supervised fine-tuning (SFT) from CS224N‚Äîremember the sonnet generation assignment? üôÇ This is a widely used method especially before ChatGPT.

<img src="assets/sft.jpg" alt="SFT" width="800">

The problem of the SFT method is that it requires the user to label desired outputs. This process is not only time-consuming but often infeasible for non-technical users. In this assignment, you'll explore alternative personalization methods that reduce this human effort barrier. Through hands-on experimentation, you'll analyze the pros and cons of different approaches.

You will use **Tinker**, the cutting-edge training API developed by Thinking Machine Labs for this assignment. Besides following this handout, **we highly recommend checking out [Tinker Cookbook](https://tinker-docs.thinkingmachines.ai/) to understand Tinker abstraction**.

**Note that Tinker handles the heavy computation for forward and backward passes. As a result, the remaining code can run on your laptop even during model training.**

You also need to use [wandb](https://wandb.ai/site) to monitor the training process. Create an account if you don't have one yet, and make sure to add your API key to the `.env` file.

### Setup & Data Collection

1. **Install Tinker**: Run the following cells to install the Tinker API.

2. **Collect your personalized data**: Make a copy of `data.example.json` and rename this file to `data.json`. Add 10 data points in the format `{"input": "...", "output": "..."}` for email generation. The outputs should come from your own emails to better reflect your personal writing style.
  
    **Recall the lecture on "Data, Data, and Data", the similar logic applies here. Please ensure the emails you choose are representative to your style and cover diverse topic. Ensure the "input" instruction provides all necessary context (e.g., name, intent), similar to your request to LLM when you want to use LLM to get your things done.**
   
   - **Bonus (10 points):** While we expect most students to work on the email generation task, we encourage you to explore creative data sources for personalization. If you choose to collect `{"input": "...", "output": "..."}` pairs from a different domain (not email generation), you may need to modify some parts of the provided code below.
   
   - **If you're attempting the bonus**, describe your data source here:
     
     **TODO: Add your answer**

Hint: You may want to use a programmatic approach to convert inputs and outputs into the required JSON format, which will help you avoid the hassle of dealing with \n escape characters.

In [None]:
%pip install wandb
%pip install tinker
%pip install git+https://github.com/thinking-machines-lab/tinker-cookbook.git

In [26]:
import json
import re
import os
import tinker
from dotenv import load_dotenv

In [None]:
load_dotenv()
service_client = tinker.ServiceClient()
print("Available models:")
for item in service_client.get_server_capabilities().supported_models:
    print("- " + item.model_name)

Available models:
- meta-llama/Llama-3.1-70B
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B
- meta-llama/Llama-3.2-3B
- meta-llama/Llama-3.3-70B-Instruct
- Qwen/Qwen3-235B-A22B-Instruct-2507
- Qwen/Qwen3-30B-A3B
- Qwen/Qwen3-30B-A3B-Base
- Qwen/Qwen3-30B-A3B-Instruct-2507
- Qwen/Qwen3-32B
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-8B
- Qwen/Qwen3-8B-Base


In [None]:
demonstrations = json.load(open("data.json"))
print(len(demonstrations), "demonstrations loaded.")

5 demonstrations loaded.


### Get Baseline Results (10 points total)

**Implement inference code (10 points)**

In the following experiments, we will use `Qwen/Qwen3-4B-Instruct-2507` as the baseline policy. After finishing the assignment, we highly recommend trying out larger models available on Tinker.

Use the inference code to obtain baseline results by prompting `Qwen/Qwen3-4B-Instruct-2507` with your inputs directly. `Qwen/Qwen3-4B-Instruct-2507` is a model that has been aligned with general human feedback. Examine its outputs to see whether they satisfy your personal preferences.

In [None]:
from tinker import types
from tinker_cookbook import renderers
from tinker_cookbook.model_info import get_recommended_renderer_name
from tinker_cookbook.tokenizer_utils import get_tokenizer


class TinkerSampler():
    """A simple wrapper around Tinker ServiceClient to do sampling."""
    def __init__(
        self,
        model_name: str,
        model_path: str | None = None,  # tinker://..., obtained from Tinker training job
        temperature: float = 0.9,
        max_tokens=1024,
        top_p=1,
        top_k=-1,  # -1 means no limit
    ):
        tokenizer = get_tokenizer(model_name)
        renderer_name = get_recommended_renderer_name(model_name)
        # Read https://tinker-docs.thinkingmachines.ai/rendering to understand what renderer is
        self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
        self.sampling_params = types.SamplingParams(
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=self.renderer.get_stop_sequences(),
        )
        self.sampling_client = service_client.create_sampling_client(
            model_path=model_path,
            base_model=model_name,
        )
        
    async def generate(self, messages: list[renderers.Message]) -> renderers.Message:
        # TODO: add your code here (10 points)
        pass

In [None]:
baseline_sampler = TinkerSampler(
    model_name="Qwen/Qwen3-4B-Instruct-2507",
)

baseline_results = []
for i, item in enumerate(demonstrations):
    input_text, expected_output = item["input"], item["output"]
    print(f"Sampling {i+1}/{len(demonstrations)}")
    messages = [renderers.Message(role="user", content=input_text)]
    output = await baseline_sampler.generate(messages)
    baseline_results.append({"input": input_text, "expected_output": expected_output, "output": output["content"]})

print("Input: ", baseline_results[0]["input"])
print(("=" * 50))
print("Expected Output: ", baseline_results[0]["expected_output"])
print(("=" * 50))
print("Output: ", baseline_results[0]["output"])

# The cell only prints the first example. Read through all baseline results in results/baseline_results.json
os.makedirs("results", exist_ok=True)
with open("results/baseline_results.json", "w") as f:
    json.dump(baseline_results, f, indent=2)

### Method 1: Prompt Engineering (10 points total)

<img src="assets/prompting.jpg" alt="Prompting" width="800">

Prompt engineering requires no model training. Instead, you craft instructions that guide the model to produce outputs matching your preferences by describing your desired behavior or providing in-context examples.

If you need to refresh your understanding of prompt engineering techniques (such as few-shot examples, chain-of-thought prompting, etc.), we recommend revisiting HW1.

**Engineer a system prompt (5 points)**

Craft a system prompt that guides the model to generate emails in your personal style (or for your own selected task). Experiment with different prompt strategies to see which produces the most personalized outputs.

In [None]:
system_prompt = """
TODO: add your system prompt here
"""

# Explicitly require the model to only output the answer without any extra text
system_prompt += "\n\nMake sure to follow the instructions carefully and do not output anything else (such as \"Sure! Here's ...\", \"If you want ...\")."

prompt_engineering_results = []
for i, item in enumerate(demonstrations):
    input_text, expected_output = item["input"], item["output"]
    print(f"Sampling {i+1}/{len(demonstrations)}")
    messages = [
        renderers.Message(role="system", content=system_prompt),
        renderers.Message(role="user", content=input_text),
    ]
    output = await baseline_sampler.generate(messages)
    prompt_engineering_results.append({"input": input_text, "expected_output": expected_output, "output": output["content"]})

print("Input: ", prompt_engineering_results[0]["input"])
print(("=" * 50))
print("Expected Output: ", prompt_engineering_results[0]["expected_output"])
print(("=" * 50))
print("Output: ", prompt_engineering_results[0]["output"])

# The cell only prints the first example. Read through all prompt engineering results in results/prompt_engineering_results.json
os.makedirs("results", exist_ok=True)
with open("results/prompt_engineering_results.json", "w") as f:
    json.dump(prompt_engineering_results, f, indent=2)

**Analyze the Pros & Cons of Personalization with Prompting. (5 points)**

Give your answer in writeup.md

### Method 2: SFT with Synthetic Data (20 points total)

<img src="assets/sft_with_synthetic_data.jpg" alt="SFT with Synthetic Data" width="800">

We can leverage more powerful models to synthesize training data for smaller models. This approach allows us to transfer the capabilities of large models into smaller, more efficient models without requiring manually labeled data. Even for your engineered prompt, larger models usually follow it better and they can generate high-quality outputs that serve as training targets for personalizing smaller models.

**Step 1 - Synthesize inputs (5 points):** Use the provided code snippet to synthesize 100 input prompts similar to those in your collected data. Carefully review the quality of the generated prompts and adjust the synthesis parameters as needed before proceeding.

*Note: If you're using a data source different from email generation (bonus track), you'll need to modify the code snippet accordingly.*

**Step 2 - Synthesize outputs:** Now use the system prompt you engineered in Method 1 and a large LLM (e.g., `Qwen/Qwen3-235B-A22B-Instruct-2507`) to generate synthetic outputs for these input prompts. If the synthetic output quality is inadequate, consider implementing advanced techniques such as [chain-of-thought prompting](https://www.promptingguide.ai/techniques/cot), [self-critique](https://arxiv.org/abs/2305.11738), or other approaches that allocate more test-time compute to improve quality.

Before proceeding, carefully review several samples to check data quality. Address any systematic issues you identify.

**Step 3 - Train via SFT (15 points):** Complete `sft.py` to fine-tune `Qwen/Qwen3-4B-Instruct-2507` using supervised fine-tuning on your synthesized dataset.

**Step 4 - Evaluate the checkpoint:** You may be surprised by how quickly training completes! Note that the model you trained is significantly smaller than the model used for data synthesis. How well can this smaller model generate personalized outputs? Use your inference code to sample outputs from the SFT checkpoint and evaluate its performance.

In [None]:
# Step 1 (Synthesize inputs)
# You may need to change SEED and SYNTHESIZE_INPUT_PROMPT if you choose your own task rather than email writing.

SEEDS = [
    "research project communication",
    "turn down request",
    "open source outreach",
    "job & career context",
    "social events",
    "time-sensitive crisis communications",
    "reimbursement request",
    "communication with health care providers (e.g., dentist, insurance companies, etc.)",
    "legal & compliance",
    "cold emailing"
]

SYNTHESIZE_INPUT_PROMPT = """\
Generate 10 new email writing instruction prompts based on the provided examples and the given seed. Give your answer in JSON format as follows. Do not output any other text.

Output format:
```json
{
    "prompt_1": "Prompt 1",
    "prompt_2": "Prompt 2",
    "prompt_3": "Prompt 3",
    "prompt_4": "Prompt 4",
    "prompt_5": "Prompt 5",
    "prompt_6": "Prompt 6",
    "prompt_7": "Prompt 7",
    "prompt_8": "Prompt 8",
    "prompt_9": "Prompt 9",
    "prompt_10": "Prompt 10"
}
```

----
Examples:

<examples>

Seed: <seed>

Output:
"""
MAX_RETRIES = 3

data_synthesis_sampler = TinkerSampler(
    model_name="Qwen/Qwen3-235B-A22B-Instruct-2507",  # Use a stronger model for data synthesis
    temperature=1.0,
    max_tokens=2048,
)

def collect_synthetic_input(output_str: str):
    s = output_str.split("Output:")[-1].strip("<|endoftext|>").strip()
    regex=r"```(?:[a-zA-Z0-9_+-]*\n)?([\s\S]*?)```"
    result = re.search(regex, s)
    if result:
        s = result.group(1).strip()
    else:
        raise ValueError("No JSON found in the output")
    result = json.loads(s)
    return [result[f"prompt_{i}"] for i in range(1, 11)]

examples = "\n".join([
    f"Input: {d['input']}\nOutput: {d['output']}" for d in demonstrations[:3]
])  # Use first 3 demonstrations as examples

synthetic_inputs = []
for seed in SEEDS:
    print(f"Collecting synthetic inputs for seed: {seed}")
    for _ in range(MAX_RETRIES):
        messages = [renderers.Message(
            role="user",
            content=SYNTHESIZE_INPUT_PROMPT.replace("<examples>", examples).replace("<seed>", seed)
        )]
        output = await data_synthesis_sampler.generate(messages)
        try:
            synthetic_inputs.extend(collect_synthetic_input(str(output["content"])))
            break
        except Exception as e:
            print(f"Error collecting synthetic input: {e}")
            continue

print(f"Collected {len(synthetic_inputs)} synthetic inputs")
print("Examples:")
print(synthetic_inputs[0])
print(synthetic_inputs[-1])

In [None]:
# Step 2 (Synthesize outputs)

from concurrent.futures import ThreadPoolExecutor
import asyncio

MAX_CONCURRENT_REQUESTS = 30

async def process_all_prompts_threadpool(synthetic_inputs):
    def sync_process_prompt(prompt):
        try:
            messages = [
                renderers.Message(role="system", content=system_prompt),  # Add the system prompt you have engineered that guides the model to output in the desired style
                renderers.Message(role="user", content=prompt)
            ]
            output = asyncio.run(data_synthesis_sampler.generate(messages))
            return (prompt, output["content"].strip("<|endoftext|>").strip())
        except Exception as e:
            return (prompt, f"ERROR: {str(e)}")
    
    with ThreadPoolExecutor(max_workers=MAX_CONCURRENT_REQUESTS) as executor:
        futures = [executor.submit(sync_process_prompt, prompt) for prompt in synthetic_inputs]
        
        results = []
        for future in futures:
            results.append(future.result())
    
    return results
    
synthetic_input_output_pairs = await process_all_prompts_threadpool(synthetic_inputs)

print(f"\nTotal successful pairs: {len(synthetic_input_output_pairs)}")
print("Examples:")
print(f"Prompt: {synthetic_input_output_pairs[0][0]}")
print(f"Output: {synthetic_input_output_pairs[0][1]}\n\n")
print(f"Prompt: {synthetic_input_output_pairs[-1][0]}")
print(f"Output: {synthetic_input_output_pairs[-1][1]}")

In [None]:

# Fix systematic issues in the synthetic data.
# For example, if you find the model often outputs "Sure! Here's ..." at the beginning of the output, you can add code to remove that.
# Or if you find the model forgets to include "Best regards, [Your Name]" at the end of the email, you can add code to append that.
# Overall, it's a good practice to go through the data and improve its quality as you can.
def fix_output(output: str):
    # TODO: add your code here if needed
    return output

synthetic_input_output_pairs = [(prompt, fix_output(output)) for prompt, output in synthetic_input_output_pairs]

print("Examples:")
print(f"Prompt: {synthetic_input_output_pairs[0][0]}")
print(f"Output: {synthetic_input_output_pairs[0][1]}\n\n")
print(f"Prompt: {synthetic_input_output_pairs[-1][0]}")
print(f"Output: {synthetic_input_output_pairs[-1][1]}")

In [35]:
synthetic_data_path = "results/synthetic_personalized_data.jsonl"
with open(synthetic_data_path, "w") as f:
    for prompt, output in synthetic_input_output_pairs:
        messages = {
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                },
                {
                    "role": "assistant",
                    "content": output
                }
            ]
        }
        f.write(json.dumps(messages) + "\n")

print(f"Synthetic data saved to {synthetic_data_path}")

Synthetic data saved to results/synthetic_personalized_data.jsonl


**Step 3: SFT**

Complete sft.py and train "Qwen/Qwen3-4B-Instruct-2507" with the synthetic data before you proceed to the next cell.

Under the root directory, launch the training with `python -m scripts.sft {other arguments}`.

The training takes a few minutes given we only synthesize a small number of training data. But you shall already be able to see the model behavior change!

**Add the wandb link for your run in writeup.md (15 points).**

In [None]:
# Step 4: Launch the checkpoint for sampling
# After running SFT with scripts/sft.py, we will see output like this:
# tinker_cookbook.checkpoint_utils:75 [INFO] Saved checkpoints: {'state_path': 'tinker://61ac731e-53e2-43de-b76f-ab1fa1c6b0cc/weights/final', 'sampler_path': 'tinker://61ac731e-53e2-43de-b76f-ab1fa1c6b0cc/sampler_weights/final'}
# The link after "sampler_path" is the model_path we will use below.

sft_model_name = "Qwen/Qwen3-4B-Instruct-2507"
sft_model_path = "tinker://"  # TODO: add your model path here

sft_model_sampler = TinkerSampler(
    model_name=sft_model_name,
    model_path=sft_model_path,
    temperature=1.0,
    max_tokens=2048,
)

sft_model_outputs = []
for i, item in enumerate(demonstrations):
    input_text, expected_output = item["input"], item["output"]
    print(f"Sampling {i+1}/{len(demonstrations)}")
    messages = [
        renderers.Message(role="user", content=input_text),
    ]
    output = await sft_model_sampler.generate(messages)
    sft_model_outputs.append({"input": input_text, "expected_output": expected_output, "output": output["content"]})

print("Input: ", sft_model_outputs[0]["input"])
print(("=" * 50))
print("Expected Output: ", sft_model_outputs[0]["expected_output"])
print(("=" * 50))
print("Output: ", sft_model_outputs[0]["output"])

os.makedirs("results", exist_ok=True)
with open("results/sft_model_outputs.json", "w") as f:
    json.dump(sft_model_outputs, f, indent=2)

**Analyze the Pros & Cons of Personalization SFT w/ Synthetic Data. (5 points)**

Add your answer in writeup.md

### Method 3: Reinforcement Learning (60 points total)

<img src="assets/rl.jpg" alt="RL" width="800">

In HW1, you experimented with one approach to RL-based personalization by labeling your own preference pairs and training the model with DPO. Here, we introduce **RLAIF (Reinforcement Learning from AI Feedback)**, a method that replaces human preference labels with AI-generated feedback. Instead of manually comparing outputs, we train a reward model to automatically evaluate which outputs better match our criteria. This dramatically reduces human labeling effort while still enabling preference-based learning.

#### Step 1: Create a Reward Function using LLM-as-a-Judge

For subjective tasks like email generation, a typical approach is to use an LLM-as-a-judge as the reward function. We will use the [pairwise preference collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) from [Prometheus Eval](https://github.com/prometheus-eval/prometheus-eval). Unlike other pairwise preference datasets, Prometheus Eval includes explicit rubrics that ground the preference judgments. This is particularly suitable for personalization tasks, since personalized preferences may differ from general preferences‚Äîand we can specify our personalized requirements directly in the rubric.

**Step 1.1 - Train the reward model (10 points):** We have defined preference data types in `rubric_preference_types.py` (**you DON'T need to change it**). Train the reward model based on `Qwen/Qwen3-30B-A3B-Instruct-2507` using `train_rubric_rm.py` by running `python -m scripts.train_rubric_rm {other arguments}` under the root directory.

‚ö†Ô∏è **Important:** Training the reward model takes over 1 hour, as the dataset contains 199,760 instances. We strongly recommend running the script in [tmux](https://tmuxcheatsheet.com/) to ensure your job continues running even if you disconnect. **Don't leave this step until the last minute!**

**Step 1.2 - Design and validate your rubric (10 points):** Design a rubric to evaluate your personalized emails. Use your trained reward model to compare the baseline output and SFT checkpoint output. Verify that the reward model's judgments align with your own preferences. If the reward model performs poorly, consider adjusting your rubric or returning to Step 1.1 to tune hyperparameters. 

*Note: The Tinker API uses LoRA for parameter-efficient fine-tuning. If you're interested in learning more about tuning hyperparameters in LoRA setups, check out this [blog post](https://thinkingmachines.ai/blog/lora/).*

In [None]:
# Step 1.1: After training, you will see output like this:
# tinker_cookbook.checkpoint_utils:75 [INFO] Saved checkpoints: {'state_path': 'tinker://aaef6db5-0e20-41ec-8023-7df145aa30b8/weights/final', 'sampler_path': 'tinker://aaef6db5-0e20-41ec-8023-7df145aa30b8/sampler_weights/final'}

# Step 1.2: Design rubric to evaluate personalized email.

from rubric_preference_types import PrometheusEvalComparisonRendererFromChatRenderer, PrometheusEvalComparison

rubric = """
TODO: add your personalized rubric here
"""

rm_model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
rm_model_path = "tinker://"  # TODO: add your model path here

grader_sampler = TinkerSampler(
    model_name=rm_model_name,
    model_path=rm_model_path,
    temperature=1.0,
    max_tokens=2048,
)
pairwise_renderer = PrometheusEvalComparisonRendererFromChatRenderer(convo_renderer=grader_sampler.renderer)

baseline_results = json.load(open("results/baseline_results.json"))
sft_model_outputs = json.load(open("results/sft_model_outputs.json"))

ai_graded_preference = []
for i, (baseline_result, sft_model_result) in enumerate(zip(baseline_results, sft_model_outputs)):
    print(f"Evaluating {i+1}/{len(baseline_results)}")
    prompt = baseline_result["input"]
    baseline_output = baseline_result["output"]
    sft_output = sft_model_result["output"]
    
    messages = pairwise_renderer._comparison_to_convo(
        PrometheusEvalComparison(
            prompt_conversation=[renderers.Message(role="user", content=prompt)],
            completion_A=[renderers.Message(role="assistant", content=baseline_output)],
            completion_B=[renderers.Message(role="assistant", content=sft_output)],
            rubric=rubric,
            reference=None
        )
    )
    response = await grader_sampler.generate(messages)
    response = str(response["content"]).strip("<|endoftext|>").strip()
    preference = 1 if "[RESULT] A" in response else (-1 if "[RESULT] B" in response else 0)
    ai_graded_preference.append({
        "input": prompt,
        "baseline_output": baseline_output,
        "sft_output": sft_output,
        "preference": preference,
        "grader_response": response
    })

print(f"Baseline preferred: {sum(1 for r in ai_graded_preference if r['preference'] == 1)}")
print(f"SFT preferred: {sum(1 for r in ai_graded_preference if r['preference'] == -1)}")

#### Step 2: Synthesize More Prompts

In RLAIF, we don't need to provide ground truth outputs for input prompts. Similar to Method 2, synthesize additional inputs and split them into train and dev sets.

In [None]:
# Step 2: synthesize more prompts
# Note #1: If you have chosen to use your data source, you may need to modify the data synthesis prompt accordingly.
# Note #2: This cell takes around 5 minutes to run.
import random

async def synthesize_more_inputs_threadpool(run_count: int):
    def sync_synthesize_inputs():
        try:
            selected_seed = random.choice(SEEDS)
            selected_example_dict = random.choice(demonstrations)
            selected_example = f"Input: {selected_example_dict['input']}\nOutput: {selected_example_dict['output']}"
            messages = [renderers.Message(
                role="user",
                content=SYNTHESIZE_INPUT_PROMPT.replace("<examples>", selected_example).replace("<seed>", selected_seed)
                )
            ]
            output = asyncio.run(data_synthesis_sampler.generate(messages))
            return collect_synthetic_input(str(output["content"]))
        except Exception as e:
            return None
    
    with ThreadPoolExecutor(max_workers=MAX_CONCURRENT_REQUESTS) as executor:
        futures = [executor.submit(sync_synthesize_inputs) for _ in range(run_count)]
        results = []
        for future in futures:
            result = future.result()
            if result:
                print(f"Collected {len(result)} inputs")
                results.extend(result)
    
    return results

more_synthetic_inputs = await synthesize_more_inputs_threadpool(200)
print(more_synthetic_inputs[0])
print(more_synthetic_inputs[-1])


In [None]:
rl_data = {
    "train": {
        "data": more_synthetic_inputs[:-500],
        "output_file": "results/rl_train_data.jsonl"
    },
    "dev": {
        "data": more_synthetic_inputs[-500:],
        "output_file": "results/rl_dev_data.jsonl"
    }
}

for split in ["train", "dev"]:
    with open(rl_data[split]["output_file"], "w") as f:
        for prompt in rl_data[split]["data"]:
            if split == "train":
                # Add the system prompt to give the policy a better prior in RL training
                prompt_conversation = [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt}
                ]
            else:
                prompt_conversation = [
                    {"role": "user", "content": prompt}
                ]
            d = {
                "prompt_conversation": prompt_conversation,
                "reference": None,
                "rubric": rubric
            }
            f.write(json.dumps(d) + "\n")
    print(f"{len(rl_data[split]['data'])} {split} data saved to {rl_data[split]['output_file']}")

#### Step 3: Run the RLAIF Loop (20 points)

**Complete rubric_preference_env.py to define the RL logic. (10 points)**

For a given prompt $p$, the RLAIF procedure operates as follows:
1. **Sample Generation**: The current policy samples `group_size` outputs $o_1, \ldots, o_g$.
2. **Pairwise Tournament**: We employ a tournament structure to compare each pair of outputs within the group using the rubric-based reward model (RM).
3. **Scoring**: For each pairwise comparison, the winning output receives a score of 1, while the losing output receives a score of -1.
4. **Reward Aggregation**: The final reward for a sample $o_i$ is its accumulated score across all tournament matches.

Note: Even though you don't need to change `rubric_preference_types.py`, we suggest you reading it carefully as this will help you complete `rubric_preference_env.py`.


**Complete rl_with_rubric_rm.py by writing an evaluator to monitor the model behavior change during the training stage. (10 points)** 

We use the reward model to compare the output from the initial model checkpoint and the output from the current checkpoint to see whether the policy is generating more personalized outputs as the RL run goes. You can monitor the dev set reward curve and look at the actual policy output to tune hyperparameters.

Launch the training by running `python -m scripts.rl_with_rubric_rm {other parameters}` under the root directory.

**Add the wandb link for your run in writeup.md (10 points).**

#### Step 4: Evaluate the Checkpoint

Use your inference code to sample outputs from the RL checkpoint and evaluate the personalization quality.

In [None]:
# Step 4: Launch the checkpoint for sampling
# After running RLAIF with tinker_scripts/sft.py, we will see output like this:
# tinker_cookbook.checkpoint_utils:75 [INFO] Saved checkpoints: {'state_path': 'tinker://24b18c15-0234-4bc0-9bd3-58eba5bfc210/weights/final', 'sampler_path': 'tinker://24b18c15-0234-4bc0-9bd3-58eba5bfc210/sampler_weights/final'}
# The link after "sampler_path" is the model_path we will use below.

rl_model_name = "Qwen/Qwen3-4B-Instruct-2507"
rl_model_path = "tinker://"  # TODO: add your model path here

rl_model_sampler = TinkerSampler(
    model_name=rl_model_name,
    model_path=rl_model_path,
    temperature=1.0,
    max_tokens=2048,
)

rl_model_outputs = []
for i, item in enumerate(demonstrations):
    input_text, expected_output = item["input"], item["output"]
    print(f"Sampling {i+1}/{len(demonstrations)}")
    messages = [
        renderers.Message(role="user", content=input_text),
    ]
    output = await rl_model_sampler.generate(messages)
    rl_model_outputs.append({"input": input_text, "expected_output": expected_output, "output": output["content"]})

print("Input: ", rl_model_outputs[0]["input"])
print(("=" * 50))
print("Expected Output: ", rl_model_outputs[0]["expected_output"])
print(("=" * 50))
print("Output: ", rl_model_outputs[0]["output"])

os.makedirs("results", exist_ok=True)
with open("results/rl_model_outputs.json", "w") as f:
    json.dump(rl_model_outputs, f, indent=2)

**Analyze the Pros & Cons of Personalization with RLAIF (10 points)**

Provide a thorough analysis that addresses the following questions:

1. Did you observe any reward hacking behavior where the policy produces outputs that score highly according to the reward model but don't actually match your personal preferences? Provide specific examples if observed. 
   
   *For more background on reward hacking, we recommend reading this [blog post](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/).*

2. Describe the approaches you tried to improve the results. What worked well? What didn't? Consider discussing rubric refinements, hyperparameter adjustments, or prompt engineering changes.

3. Based on your experiments, what inherent limitations did you identify with RLAIF-based personalization? Consider factors such as data efficiency, scalability, alignment quality, or the challenge of specifying preferences through rubrics.

**Add your answer in writeup.md**

### Bonus: Demonstration-Iterated Task Optimization (DITTO) (30 points)

Congratulations on completing the main assignment! You've explored three personalization approaches: prompt engineering, supervised fine-tuning with synthetic data, and reinforcement learning. However, developing a trustworthy reward model remains challenging, which has motivated approaches like Direct Preference Optimization (DPO) that eliminate the need for explicit reward models. The primary challenge with applying DPO to personalization lies in obtaining sufficient pairwise comparisons from individual users‚Äîas you may have experienced in HW1, the labeling process can be tedious and mentally demanding.

Our recent work [1] introduces **DITTO (Demonstration-Iterated Task Optimization)**, which addresses this limitation by efficiently generating online comparison data. DITTO treats users' demonstrations as preferred examples and contrasts them against outputs from the base LLM and its intermediate training checkpoints, thereby creating the necessary preference pairs without extensive manual labeling.

<img src="assets/ditto.jpg" alt="DITTO" width="800">

**Your task:** Implement DITTO (Algorithm 1 in the paper) using Tinker in `scripts/ditto_dpo.py` and launch a training run with your collected seed data. Compare DITTO's performance against the three methods you implemented earlier (prompt engineering, SFT with synthetic data, and RLAIF). In your analysis, discuss:
- How does DITTO's performance compare to other methods?
- What advantages does DITTO offer in terms of data efficiency and ease of use?
- Are there any limitations or trade-offs you observed?

**Add your answer in writeup.md**

**Hints:**
1. **Implementation Scope:** You only need to implement the DPO component. For the initialization step "$\pi_0 \leftarrow \text{SFT}(\pi_{\text{ref}}, \mathcal{D}_{E}), t = 0$", use `scripts/sft.py` you completed previously. When configuring the SFT step, follow the paper's guidance on hyperparameters: "For a dataset, we train with SFT until BCE train loss on a given batch approaches 1.00 (early stopping); ideally, we want an LLM to not overfit entirely to demos before DPO."
2. **Core Algorithm:** The key innovation of DITTO is constructing preference pairs dynamically during training. As detailed in Section 3.2 ("A Practical Algorithm"), implement the following preference pair distribution:
 - 70% on-policy comparisons (i.e., ground truth v.s. current policy output)
 - 20% replay comparisons (i.e., ground truth v.s. old checkpoint output)
 - 10% intermodel comparisons (i.e., new checkpooint output v.s. old checkpoint output)
 Your main task is implementing this data construction logic. The DPO update mechanism itself is standardized and available through the [official implementation](https://github.com/thinking-machines-lab/tinker-cookbook/blob/main/tinker_cookbook/preference/train_dpo.py) in Tinker Cookbook.
3. **Training and Evaluation:** As its name implied, DITTO is designed for data efficiency, so train using only the 10 demonstrations you collected initially. When evaluating DITTO against your earlier implementations, ensure fair comparison by testing on new, previously unseen inputs.

**References:**

[1] Shaikh, O., Lam, M. S., Hejna, J., Shao, Y., Cho, H., Bernstein, M. S., & Yang, D. (2024). [Aligning Language Models with Demonstrated Feedback](https://arxiv.org/pdf/2406.00888). *ICLR 2025*.