In [1]:
!pip show torch transformers datasets accelerate

Name: torch
Version: 2.2.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/emmanuelochiba/venvs/myllm310/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate
---
Name: transformers
Version: 4.35.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /Users/emmanuelochiba/venvs/myllm310/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
---
Name: datasets
Versi

# Reinforcement Pre-Training (RPT) – Prototype on M1 Mac

This notebook is a lightweight prototype of **Reinforcement Pre-Training (RPT)**, inspired by the paper:

**"Reinforcement Pre-Training" — Qingxiu Dong et al., Microsoft Research, 2024**

---

## 🧠 What Is RPT?

Reinforcement Pre-Training reframes **next-token prediction** as a **reinforcement learning problem**. Instead of always optimizing via cross-entropy loss, RPT introduces **verifiable token-level rewards** — rewarding the model when it gets the next token right (or does something desirable).

- This bridges **language modeling** and **reinforcement learning**
- Enables better **reasoning**, **adaptivity**, and **scalability**
- Improves alignment with downstream tasks that don’t rely on cross-entropy

---

## 🧪 What This Notebook Does

This is a **minimal, educational prototype** designed to:
- Run efficiently on an M1 Mac (no GPUs required)
- Use a small `DistilGPT2` model from Hugging Face
- Train on a **toy dataset** (custom text examples)
- Apply **reward-based token learning**, simulating the RPT logic

⚠️ Note: This is not a full reproduction, but a learning-focused approximation to help internalize the core ideas of RPT and build intuition around token-level rewards in transformer pretraining.

### Load & Tokenize Dataset

We’re using a `.jsonl` file made from *The Big Bang Theory* transcripts, reformatted into prompt–completion pairs.

Each entry is designed to simulate a conversational prompt and Sheldon-like reply — to eventually fine-tune a fun LLM called **SheldonGPT** (more on that soon 😉). We're keeping things M1-friendly by loading in-memory, disabling dataset caching, and tokenizing each item individually.

In [2]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import json
import numpy as np
import warnings

# Suppress NumPy 2.0 copy warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT2 doesn't have a pad token, so we use eos

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
# Load JSONL data (SheldonGPT fine-tuning format)
with open("sheldon_finetune.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

### Build custom dataset class

In [4]:
class CustomTextDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.processed_data = []
        
        for item in data:
            text = f"{item['prompt']}<|endoftext|>{item['completion']}<|endoftext|>"
            encoded = tokenizer(
                text,
                truncation=True,
                padding="max_length",
                max_length=max_length,
                return_tensors="pt"
            )
            self.processed_data.append({
                "input_ids": encoded["input_ids"].squeeze(0),
                "attention_mask": encoded["attention_mask"].squeeze(0)
            })
    
    def __len__(self):
        return len(self.processed_data)
    
    def __getitem__(self, idx):
        return self.processed_data[idx]

### DataLoader + Safe Collation

In [5]:
def custom_collate_fn(batch):
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

dataset = CustomTextDataset(data, tokenizer)
dataloader = DataLoader(
    dataset, 
    batch_size=2, 
    shuffle=True,
    collate_fn=custom_collate_fn
)

print("✅ Dataset loaded successfully!")

✅ Dataset loaded successfully!


### Load Model & Set Up Training

In [6]:
DEVICE = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = GPT2LMHeadModel.from_pretrained("distilgpt2").to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

### Reinforcement Pre-Training Loop (Prototype)

In [7]:
for step, batch in enumerate(dataloader):
    if step > 200: break  # Quick loop for M1

    input_ids = batch["input_ids"].to(DEVICE)
    labels = input_ids.clone()

    outputs = model(input_ids, labels=labels)
    logits = outputs.logits

    # Predictions and reward shaping
    predictions = torch.argmax(logits, dim=-1)
    rewards = (predictions[:, 1:] == labels[:, 1:]).float()

    # Shift for next-token alignment
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = labels[:, 1:].contiguous()
    shift_rewards = rewards.contiguous()

    # Reward-weighted loss
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        reduction='none'
    )
    weighted_loss = (loss * shift_rewards.view(-1)).mean()

    weighted_loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if step % 20 == 0:
        print(f"Step {step}, RPT Loss: {weighted_loss.item():.4f}")

print("🏁 Training completed!")

Step 0, RPT Loss: 0.0138
Step 20, RPT Loss: 0.0069
Step 40, RPT Loss: 0.0069
Step 60, RPT Loss: 0.0135
Step 80, RPT Loss: 0.0000
Step 100, RPT Loss: 0.0000
Step 120, RPT Loss: 0.0000
Step 140, RPT Loss: 0.0000
Step 160, RPT Loss: 0.0450
Step 180, RPT Loss: 0.0000
Step 200, RPT Loss: 0.0107
🏁 Training completed!


### What We Did

✅ Built a mini prototype of **Reinforcement Pre-Training (RPT)** using `distilgpt2` and a custom dataset.

✅ Instead of pure next-token loss, we rewarded the model for **token-level correctness**, using the idea of "verifiable supervision."

✅ Used a lightweight M1-friendly loop with clean collation and batching (no HuggingFace Datasets cache errors).

---

💡 The final model isn't production-ready, but this experiment **demonstrates how we might inject RL-style rewards directly into language modeling pretraining**.

🧪 Next: Try larger models, smarter rewards (e.g., BLEU, BERTScore), and scale!

#llm #rpt #deep_learning #finetuning #openai #huggingface #research