# Training Agents with Reinforcement Learning

### About Me
- will brown from ai twitter (@willccbb)
- research lead @ prime intellect
- spent 2yrs @ morgan stanley doing LLM applications
- did phd @ columbia on multi-agent learning theory
- i work on agentic RL stuff (see [willccbb/verifiers](https://github.com/willccbb/verifiers) on github)

### About the Full Course (Production-Ready Agent Engineering: From MCP to RL)

https://maven.com/will-brown-kyle-corbitt/agents-mcp-rl

- runs june 16 - july 4
- co-teaching with kyle corbitt (@corbtt), ceo of openpipe
- agent stuff AND rl stuff
    - two sides of the same coin
    - course starts with practical 
    - builds towards RL finetuning for OSS models
    - most patterns have analogues which can be applied to closed/API models

### Today's Lightning Lesson
- revisiting search agent setup
- making local file search tools for agents
- trivia Q&A generation grounded in real-world wiki articles
- evaluating agents with LLM judges
- synthetic data for SFT warmup
- RL crash course via `verifiers`

In [109]:
# count number of files in data/wiki
import os
print(len(os.listdir("data/wiki")))
first_ten_files = os.listdir("data/wiki")[:10]
print(first_ten_files)

2590
['İlkay Gündoğan.md', 'Zoë Kravitz.md', 'Zodiac.md', 'Zlatko Dalić.md', 'Zinedine Zidane.md', 'Zambia.md', 'Zack Snyder.md', 'Zachary Levi.md', 'Zac Efron.md', 'ZZ Top.md']


In [18]:
# Initialize/load the collection
import re
import os
import chromadb
from chromadb.utils import embedding_functions
import hashlib

# Setup
WIKI_DIR = "data/wiki"  # Path relative to notebook location
CHROMA_DB_DIR = ".chroma_db"  # Directory for persistent ChromaDB storage

# Create persistent ChromaDB client
db_client = chromadb.PersistentClient(path=CHROMA_DB_DIR)

# Create embedding function using OpenAI
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ.get("OPENAI_API_KEY"),
    model_name="text-embedding-3-small"
)

def init_collection():
    """Initialize ChromaDB collection with wiki page titles"""
    try:
        # Try to get existing collection
        collection = db_client.get_collection("wiki_titles", embedding_function=openai_ef)
        return collection
    except:
        # Create new collection and index all titles
        collection = db_client.create_collection("wiki_titles", embedding_function=openai_ef)
        
        # Get all wiki files
        wiki_files = [f for f in os.listdir(WIKI_DIR) if f.endswith('.md')]
        
        # Add documents to collection
        documents = []
        ids = []
        metadatas = []
        
        for filename in wiki_files:
            # Create page ID from filename (remove .md extension)
            title = filename[:-3]
            # remove special characters
            page_id = title.replace(' ', '_').lower()
            
            documents.append(title)
            ids.append(page_id)
            metadatas.append({"page_id": page_id, "title": title})

        # Add in batches of 100
        batch_size = 100
        for i in range(0, len(documents), batch_size):
            collection.add(
                documents=documents[i:i+batch_size],
                ids=ids[i:i+batch_size],
                metadatas=metadatas[i:i+batch_size]
            )
        
        return collection

# Initialize collection on notebook load
collection = init_collection()

In [110]:
import chromadb

db_client = chromadb.PersistentClient(path=".chroma_db")

# count number of entries in wiki_titles collection
print(db_client.get_collection("wiki_titles").count())

# get all collections
print(db_client.list_collections())

# create collection
#collection = client.create_collection("wiki_titles", embedding_function=openai_ef)


2590
[Collection(name=wiki_titles)]


In [111]:
def search_pages(query: str) -> list[dict]:
    """Search for top 10 relevant articles using title embedding similarity.
    
    Args:
        query (str): The query to search for.

    Returns:
        list[dict]: A list of dicts with page_id and title.

    Examples:
        "basketball" -> [{"page_id": "basketball", "title": "Basketball"}, {"page_id": "basketball_rules", "title": "Basketball Rules"}, ...]
    """
    results = collection.query(
        query_texts=[query],
        n_results=10
    )
    
    # Format results
    output = []
    for i in range(len(results['ids'][0])):
        output.append({
            "page_id": results['ids'][0][i],
            "title": results['metadatas'][0][i]['title'] # type: ignore
        })
    
    return output

# test search_pages
print(search_pages("basketball"))

[{'page_id': 'basketball_positions', 'title': 'Basketball positions'}, {'page_id': 'baseball', 'title': 'Baseball'}, {'page_id': 'basketball_wives', 'title': 'Basketball Wives'}, {'page_id': 'reggie_jackson__basketball,_born_1990', 'title': 'Reggie Jackson _basketball, born 1990'}, {'page_id': "united_states_men's_national_basketball_team", 'title': "United States men's national basketball team"}, {'page_id': 'chicago_bulls', 'title': 'Chicago Bulls'}, {'page_id': 'blake_griffin', 'title': 'Blake Griffin'}, {'page_id': 'jeremy_lin', 'title': 'Jeremy Lin'}, {'page_id': 'lamelo_ball', 'title': 'LaMelo Ball'}, {'page_id': '1984_nba_draft', 'title': '1984 NBA draft'}]


In [112]:
def view_sections(page_id: str) -> list[dict]:
    """View the sections of a page.
    
    Args:
        page_id (str): The ID of the page to view.

    Returns:
        list[dict]: A list of dicts with section_id and section_name.

    Examples:
        "basketball" -> [{"section_id": "basketball:history", "section_name": "History"}, ...]
    """
    # Find the file for this page_id
    results = collection.get(ids=[page_id])
    if not results['ids']:
        raise ValueError(f"Page not found: {page_id}")
    
    filename = results['metadatas'][0]['title'] + '.md'  # type: ignore
    filepath = os.path.join(WIKI_DIR, filename) # type: ignore
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    sections = []
    
    lines = content.split('\n')
    for i, line in enumerate(lines):
        if line.startswith('#'):
            # Extract section name (remove # and whitespace)
            section_name = line.lstrip('#').strip()
            # Create section ID
            section_id = f"{page_id}:{section_name.lower().replace(' ', '_')}"
            sections.append({
                "section_id": section_id,
                "section_name": section_name,
                "start_line": i
            })
    
    # If no sections found, return the whole page as one section
    if not sections:
        sections.append({
            "section_id": f"{page_id}:full",
            "section_name": "Full Page",
            "start_line": 0
        })
    
    return [{"section_id": s["section_id"], "section_name": s["section_name"]} 
            for s in sections]


# test view_sections
view_sections("baseball")

[{'section_id': 'baseball:baseball', 'section_name': 'Baseball'},
 {'section_id': 'baseball:rules_and_gameplay',
  'section_name': 'Rules and gameplay'},
 {'section_id': 'baseball:personnel', 'section_name': 'Personnel'},
 {'section_id': 'baseball:players', 'section_name': 'Players'},
 {'section_id': 'baseball:managers_and_coaches',
  'section_name': 'Managers and coaches'},
 {'section_id': 'baseball:umpires', 'section_name': 'Umpires'},
 {'section_id': 'baseball:strategy', 'section_name': 'Strategy'},
 {'section_id': 'baseball:tactics', 'section_name': 'Tactics'},
 {'section_id': 'baseball:pitching_and_fielding',
  'section_name': 'Pitching and fielding'},
 {'section_id': 'baseball:batting_and_baserunning',
  'section_name': 'Batting and baserunning'},
 {'section_id': 'baseball:history', 'section_name': 'History'},
 {'section_id': 'baseball:in_the_united_states',
  'section_name': 'In the United States'},
 {'section_id': 'baseball:establishment_of_professional_leagues',
  'section_nam

In [113]:
def read_section(section_id: str) -> str:
    """Read a section of a page.
    
    Args:
        section_id (str): The ID of the section to read.

    Returns:
        str: The content of the section.
        
    Examples:
        "baseball:finnish_baseball" -> "Finnish baseball is a sport that is played in Finland..."
    """
    # Parse section_id
    if ':' not in section_id:
        raise ValueError("Invalid section_id format. Expected: page_id:section_name")
    
    page_id, section_name_id = section_id.split(':', 1)
    
    # Get the file
    results = collection.get(ids=[page_id])
    if not results['ids']:
        raise ValueError(f"Page not found: {page_id}")
    
    filename = results['metadatas'][0]['title'] + '.md' # type: ignore
    filepath = os.path.join(WIKI_DIR, filename)
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    lines = content.split('\n')
    
    # Special case for "full" section
    if section_name_id == "full":
        return content
    
    # Find the section
    section_start = None
    section_end = None
    
    for i, line in enumerate(lines):
        if line.startswith('#'):
            current_section = line.lstrip('#').strip().lower().replace(' ', '_')
            if current_section == section_name_id and section_start is None:
                section_start = i
            elif section_start is not None and section_end is None:
                section_end = i
                break
    
    # If section found
    if section_start is not None:
        if section_end is None:
            section_end = len(lines)
        return '\n'.join(lines[section_start:section_end])
    else:
        raise ValueError(f"Section not found: {section_id}")
    
print(read_section("baseball:finnish_baseball"))

#### Finnish baseball

Finnish baseball, known as pesäpallo, is a combination of traditional ball-batting team games and North American baseball, invented by ["Tahko" Pihkala](Lauri)(Lauri Pihkala) in the 1920s. The basic idea of pesäpallo is similar to that of baseball: the offense tries to score by hitting the ball successfully and running through the bases, while the defense tries to put the batter and runners out. One of the most important differences between pesäpallo and baseball is that the ball is pitched vertically, which makes hitting the ball, as well as controlling the power and direction of the hit, much easier. This gives the offensive game more variety, speed, and tactical aspects compared to baseball.



In [52]:
import random
import json
import asyncio
import nest_asyncio
from tqdm.notebook import tqdm
from openai import AsyncOpenAI

# Enable asyncio in Jupyter
nest_asyncio.apply()

# Initialize async OpenAI client
openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(3)

async def generate_questions_for_file(filepath: str, n_questions: int = 5) -> list[dict]:
    """
    Generate N question-answer pairs for a given wiki file using gpt-4.1.
    Returns list of dicts with question, answer, and filename.
    """
    async with semaphore:  # Limit concurrent requests
        # Read file content directly
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read()
        
        filename = os.path.basename(filepath)
        
        # Prompt for GPT-4.1
        prompt = f"""Given the following article content, generate {n_questions} question-answer pairs.

Requirements:
- Questions should be one sentence about a specific fact contained in the article
    - They should be framed as a general trivia question (the question reader will not see the article OR title OR any other information about the article)
    - Questions should be "fair game" for advanced pub trivia -- requiring potentially deep obscure knowledge or factual recall or search, but "self-contained" (without making reference to the article)
- Answers should be just a few words (1-5 words typically)
- Return as a JSON object with a "questions" list containing dicts with "question" and "answer" fields

Article content:
{content[:50000]}

Schema: 
{{
    "questions": [
        {{
            "question": "question text",
            "answer": "answer text"
        }},
        ...
    ]
}}

Return ONLY the JSON object, no other text."""
        
        # Call GPT-4.1
        response = await openai_client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that generates factual question-answer pairs from articles."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.7
        )
        
        # Parse response
        try:
            response_content = response.choices[0].message.content
            if not response_content:
                return []
            response_json = json.loads(response_content)
            # Handle different possible JSON structures
            if isinstance(response_json, list):
                qa_pairs = response_json
            elif isinstance(response_json, dict):
                # Try common keys
                qa_pairs = response_json.get('pairs', response_json.get('questions', response_json.get('data', [])))
                if not isinstance(qa_pairs, list):
                    qa_pairs = []
            else:
                qa_pairs = []
        except json.JSONDecodeError:
            return []
        
        # Add metadata to each pair
        results = []
        for pair in qa_pairs:
            results.append({
                "question": pair["question"],
                "answer": pair["answer"],
                "filename": filename
            })
        
        return results

async def generate_random_questions(n_pages: int = 3, questions_per_page: int = 3) -> list[dict]:
    """
    Generate questions for N random wiki pages using parallel processing.
    Works directly with files, no database needed.
    Returns consolidated list of all question-answer pairs.
    """
    # Get all wiki files directly from directory
    wiki_files = [f for f in os.listdir(WIKI_DIR) if f.endswith('.md')]
    
    # Sample random files
    selected_files = random.sample(wiki_files, min(n_pages, len(wiki_files)))
    
    # Create tasks for parallel processing
    tasks = []
    for filename in selected_files:
        filepath = os.path.join(WIKI_DIR, filename)
        task = generate_questions_for_file(filepath, questions_per_page)
        tasks.append(task)
    
    # Execute all tasks in parallel with progress bar
    all_results = []
    with tqdm(total=len(tasks), desc="Generating questions") as pbar:
        for coro in asyncio.as_completed(tasks):
            try:
                result = await coro
                all_results.append(result)
                pbar.update(1)
            except Exception as e:
                pbar.update(1)
                continue

    # Flatten results
    all_questions = []
    for questions in all_results:
        all_questions.extend(questions)

    return all_questions

# Example usage
async def main():
    n_pages = 150
    questions = await generate_random_questions(n_pages=n_pages, questions_per_page=5)
    print(f"\nGenerated {len(questions)} total questions")
    
    for i, q in enumerate(questions): 
        print(f"\n{i+1}. Q: {q['question']}")
        print(f"   A: {q['answer']}")
        print(f"   File: {q['filename']}")
    
    return questions

# Run the async function
questions = await main()

Generating questions:   0%|          | 0/150 [00:00<?, ?it/s]


Generated 605 total questions

1. Q: What is the primary artery that supplies blood to the vagina?
   A: Vaginal artery
   File: Vagina.md

2. Q: Which gland in females is a homologue of the male prostate and is implicated in the debated existence of the G-spot?
   A: Skene's gland
   File: Vagina.md

3. Q: What is the primary type of bacteria dominating the vaginal microbiota in healthy women of reproductive age?
   A: Lactobacillus
   File: Vagina.md

4. Q: What is the name of the thin mucous membrane that partially covers the vaginal opening in humans?
   A: Hymen
   File: Vagina.md

5. Q: Which nerve provides the main nerve supply to the lower part of the vagina?
   A: Pudendal nerve
   File: Vagina.md

6. Q: What is the birth name of the singer known professionally as Sade Adu?
   A: Helen Folasade Adu
   File: Sade _singer.md

7. Q: In which Nigerian city was Sade Adu born?
   A: Ibadan
   File: Sade _singer.md

8. Q: Which Sade album was the best-selling debut by a British fema

In [None]:
old_questions = questions

In [78]:
len(old_questions)

2885

In [82]:
# safely deduplicate list of dicts
def deduplicate_dicts(list_of_dicts: list[dict]) -> list[dict]:
    """Deduplicate a list of dicts by a key."""
    return list(set(tuple(d.items()) for d in list_of_dicts)) # type: ignore

# deduplicate questions
all_questions = deduplicate_dicts(old_questions)

# back to dicts
all_questions = [dict(d) for d in all_questions]

# to hf dataset 
from datasets import Dataset
dataset = Dataset.from_list(all_questions)
dataset.push_to_hub("wiki-trivia-questions")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/willcb/wiki-trivia-questions/commit/bd2138213717fa69d5df34514169d6bec0f5b6b3', commit_message='Upload dataset', commit_description='', oid='bd2138213717fa69d5df34514169d6bec0f5b6b3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/willcb/wiki-trivia-questions', endpoint='https://huggingface.co', repo_type='dataset', repo_id='willcb/wiki-trivia-questions'), pr_revision=None, pr_num=None)

In [83]:
all_questions[0]

{'question': "What is the meaning of the term 'Utsuro-bune' in Japanese?",
 'answer': 'Hollow ship',
 'filename': 'Utsuro-bune.md'}

In [None]:
# to hf dataset 
from datasets import Dataset

dataset = Dataset.from_list(all_questions)
# drop duplicates
dataset.push_to_hub("wiki-trivia-questions")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/347 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/willcb/wiki-trivia-questions/commit/5352198b0c2c3274d0963a8822e22a7343c9cdd8', commit_message='Upload dataset', commit_description='', oid='5352198b0c2c3274d0963a8822e22a7343c9cdd8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/willcb/wiki-trivia-questions', endpoint='https://huggingface.co', repo_type='dataset', repo_id='willcb/wiki-trivia-questions'), pr_revision=None, pr_num=None)

In [61]:
import importlib

# reload verifiers
import verifiers as vf
importlib.reload(vf)

<module 'verifiers' from '/workspace/verifiers/verifiers/__init__.py'>

In [91]:
import verifiers as vf

system_prompt = """
You are a search agent who has access to the following tools for searching over a set of Wikipedia articles:
- search_pages(query: str) -> list[str]: Search the wiki for pages that match the query.
- view_sections(page_id: str) -> list[str]: View the sections of a page.
- read_section(section_id: str) -> str: Read a section of a page.

{tool_descriptions}

You will be given a question, and you must use the tools to find the answer to the question.

You may make up to 10 tool calls before giving your final answer.

In each turn, respond in the following format:

<think>
[your thoughts here]
</think>
<tool>
{{
    "name": "search_pages", # name of the tool to call
    "args": {{
        "query": "query" # arguments to pass to the tool
    }}
}}
</tool>

When you have found the answer, respond in the following format:
<think>
[your thoughts here]
</think>
<answer>
[final answer here]
</answer>
"""

tools = [
    search_pages,
    view_sections,
    read_section,
]


from openai import OpenAI
from verifiers.rubrics.judge_rubric import JudgeRubric
judge_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
judge_model = "gpt-4.1-nano"
judge_rubric = JudgeRubric(
    judge_client=judge_client,
    judge_model=judge_model
)

vf_env = vf.ToolEnv(
    dataset=dataset,
    system_prompt=system_prompt,
    tools=tools,
    max_turns=11,
)
vf_env.rubric = vf.RubricGroup(rubrics=[judge_rubric, vf_env.rubric])

Setting TOKENIZERS_PARALLELISM=false for forked processes.


return_description: list[dict]: A list of dicts with page_id and title. (list)
return_description: list[dict]: A list of dicts with section_id and section_name. (list)
return_description: str: The content of the section. (str)


Map (num_proc=32):   0%|          | 0/2261 [00:00<?, ? examples/s]

2025-06-12 20:03:56 - verifiers.rubrics.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics
2025-06-12 20:03:56 - verifiers.rubrics.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics
2025-06-12 20:03:56 - verifiers.rubrics.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv("DEEPINFRA_API_KEY"), base_url=os.getenv("DEEPINFRA_API_URL"))
model = "deepseek-ai/DeepSeek-V3-0324"

results = vf_env.evaluate(
    client=client,
    model=model,
    num_samples=10,
    max_concurrent=10
)

2025-06-12 20:04:22 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset
2025-06-12 20:04:22 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset
2025-06-12 20:04:22 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset


Running 10 rollouts: 100%|██████████| 10/10 [00:26<00:00,  2.69s/it]
Evaluating 10 rollouts: 100%|██████████| 10/10 [00:00<00:00, 14.92it/s]
Evaluating 10 rollouts: 100%|██████████| 10/10 [00:00<00:00, 234.09it/s]


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv("DEEPSEEK_API_KEY"), base_url=os.getenv("DEEPSEEK_API_URL"))
model = "deepseek-chat"

results = vf_env.evaluate(
    client=client,
    model=model,
    num_samples=200,
    max_concurrent=20
)
dataset_dsv3 = vf_env.make_dataset(results)
# save to hub
dataset_dsv3.push_to_hub("V3-wiki-trivia-tool-use")

In [106]:
# Results 

print("\nRewards:")
for k, v in results.items():
    if "reward" in k:
        print(k, v)

question = results["question"][0]
answer = results["answer"][0]
rollout = results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nAnswer: {answer}")
print(f"\nRollout:")
for i, msg in enumerate(rollout):
    print(f"{i+1}. {msg['role']}: {msg['content']}")



Rewards:
judge_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0]
reward [3.4, 3.4, 3.4, 3.4, 3.4, 2.4, 3.4, 3.4, 0.4, 3.4]
correct_answer_reward_func [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
tool_execution_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
format_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
search_pages_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
search_pages_count_reward_func [1.0, 5.0, 4.0, 1.0, 4.0, 3.0, 2.0, 2.0, 1.0, 1.0]
search_pages_attempt_reward_func [1.0, 5.0, 4.0, 1.0, 4.0, 3.0, 2.0, 2.0, 1.0, 1.0]
view_sections_reward_func [1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
view_sections_count_reward_func [1.0, 0.0, 1.0, 1.0, 2.0, 4.0, 1.0, 1.0, 1.0, 1.0]
view_sections_attempt_reward_func [1.0, 0.0, 1.0, 1.0, 2.0, 4.0, 1.0, 1.0, 1.0, 1.0]
read_section_reward_func [1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
read_section_count_reward_func [1.0, 0.0, 4.0, 1.0, 4.0, 1.0, 2.0, 2.0, 1.

In [89]:
judge_client = OpenAI(api_key=os.getenv("DEEPINFRA_API_KEY"), base_url=os.getenv("DEEPINFRA_API_URL"))
judge_model = "google/gemma-3-12b-it"

# test inference
response = judge_client.chat.completions.create(
    model=judge_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

print(response.choices[0].message.content)

The capital of France is **Paris**. 🇫🇷



In [103]:
client = OpenAI(api_key=os.getenv("DEEPSEEK_API_KEY"), base_url=os.getenv("DEEPSEEK_API_URL"))
model = "deepseek-chat"

results = vf_env.evaluate(
    client=client,
    model=model,
    num_samples=10,
    max_concurrent=10
)


2025-06-12 20:15:24 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset
2025-06-12 20:15:24 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset
2025-06-12 20:15:24 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset
Running 10 rollouts: 100%|██████████| 10/10 [01:26<00:00,  8.63s/it]
Evaluating 10 rollouts: 100%|██████████| 10/10 [00:02<00:00,  3.73it/s]
Evaluating 10 rollouts: 100%|██████████| 10/10 [00:00<00:00, 383.40it/s]


In [104]:
# Results 

print("\nRewards:")
for k, v in results.items():
    if "reward" in k:
        print(k, v)

question = results["question"][0]
answer = results["answer"][0]
rollout = results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nAnswer: {answer}")
print(f"\nRollout:")
for i, msg in enumerate(rollout):
    print(f"{i+1}. {msg['role']}: {msg['content']}")



Rewards:
judge_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]
reward [3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 3.4, 0.4, 3.4]
correct_answer_reward_func [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
tool_execution_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
format_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
search_pages_reward_func [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
search_pages_count_reward_func [1.0, 5.0, 2.0, 1.0, 4.0, 2.0, 2.0, 4.0, 1.0, 1.0]
search_pages_attempt_reward_func [1.0, 5.0, 2.0, 1.0, 4.0, 2.0, 2.0, 4.0, 1.0, 1.0]
view_sections_reward_func [1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
view_sections_count_reward_func [1.0, 0.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0]
view_sections_attempt_reward_func [1.0, 0.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0]
read_section_reward_func [1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0]
read_section_count_reward_func [1.0, 0.0, 4.0, 1.0, 3.0, 2.0, 1.0, 0.0, 1.

## COURSE DISCOUNT

### Use promo code 'LIGHTNING' for 20% off 

https://maven.com/will-brown-kyle-corbitt/agents-mcp-rl?promoCode=LIGHTNING

- Course runs June 16 - July 4
- Lectures on Tuesdays/Thursdays @ 5PM ET
- Office hours throughout, more will be added
- Lectures are recorded, can be watched async any time (including after the course)
- Weekly take-home projects
- Agents, MCP, Evals, Tool Calling, Reinforcement Learning, GRPO, and more

![PROMOCODE](images/lightning_qr.png)