# Final Project Starting Guide

Hello everyone, welcome to the final project! This notebook is provided to you to reiterate the rules and guidelines, and give you some starting points.

### What we provide

In this project, we will provide you with 
- This starting guide
- A working API that you can access under ASU network (i.e., on campus or with VPN)
- A starting development data that you can use to develop your agent. It contains 1,000 instances with {domain, input, expected_output}

### Your goal

In this project, you will implement an inference-time agent to solve reasoning requests, as those provided in the development data. The grading of this project will be effort-based and you will get full credit if you produce the minimum deliverables below, with subject to the rules and requirements below.

#### Minimum Deliverables

1. A working agent loop (in the form of a Github project) that the TA can run, and implements *at least three* inference-time algorithms or techniques.
2. Outputs from your agent on the released test data (see important dates). 
3. A short one-page report on how your agent works, and pointer to important techniques (referece to code blocks).

#### Rules and Requirements
1. You must only use our provided API call to access LLMs; meaning that you cannot use any other LLMs in any other way within your agent loop. Some exceptions may be made if you call certain external tools (e.g., Google search) that use some LLMs internally. Please discuss any exceptions with us to avoid penalties up to 50% of the project grade.
2. You must not hardcode a full delegation to an external tool (e.g., google_search(input_problem)). Such delegations must be automatically selected/decided by your agent. Hardcode delegations will lead to a zero.
3. You cannot use Cursor or any AI coding aids to implement the final project. You can, however, ask LLMs (or other online resources) for conceptual clarification or code examples. Your final project should not contain any blocks of code (i.e., > 3 lines) that are written by AI. Violations will lead to a zero.
4. Your agent should be able to run efficiently, with <20 LLM calls per question. Exceptions may be made when you have a complicated agent but please discuss with us. Up to 10% of the project grade may be deducted if we observe very inefficient LLM usages that do not clearly benefit the performance.
5. Your agent must run without any requests to any paid services (paid is defined by if the TA has to pay to run it, regardless of whether you actuallly pay for it or not.) Violations will lead to a zero.
6. You must submit a Github project link as your code submission. All changes must be tracked and any commits should be within 100 lines of +/- with good messages. Points will be deducted to up to 25% of the project grade if we observe "magic commits" or too few commits. 


### Suggestions
1. Start early, please.
2. You should consider how you can evaluate whether your output is good enough compared to the provided expected_outputs, and we will not release how we will actually evaluate your outputs; meaning that you have to try to predict how we will evaluate things.
3. Start with a basic implementation, and iterate based on mistakes/feedbacks.
4. Find more development data, or create your own cases to stree-test your agent. 
5. You are free to modify any provided code in this starting guide, or not using any of these code at all.

### Important dates
- **Release of final test data**: 11/25/2025
- **Deadline for submitting all deliverables**: 12/05/2025

### Extra Credit. 
The top 20 projects (ranked by performance metrics on the test data and at the TA's discretion of implementation quality) will be given extra credits. The actual credits will be between 1% to 7.5% depending on the ranking.

In [7]:
# %% Minimal setup
# If needed (uncomment in a notebook):pip install requests python-dotenv

# to create an agent loop like Openai we need tool calling first 

import os, json, textwrap, re, time
import requests

API_KEY  = os.getenv("OPENAI_API_KEY", "cse476")
API_BASE = os.getenv("API_BASE", "http://10.4.58.53:41701/v1")  
MODEL    = os.getenv("MODEL_NAME", "bens_model")              

def call_model_chat_completions(prompt: str,
                                system: str = "You are a helpful assistant. Reply with only the final answer—no explanation.",
                                model: str = MODEL,
                                temperature: float = 0.0,
                                timeout: int = 60) -> dict:
    """
    Calls an OpenAI-style /v1/chat/completions endpoint and returns:
    { 'ok': bool, 'text': str or None, 'raw': dict or None, 'status': int, 'error': str or None, 'headers': dict }
    """
    url = f"{API_BASE}/chat/completions"
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type":  "application/json",
    }
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user",   "content": prompt}
        ],
        "temperature": temperature,
        "max_tokens": 128,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=timeout)
        status = resp.status_code
        hdrs   = dict(resp.headers)
        if status == 200:
            data = resp.json()
            text = data.get("choices", [{}])[0].get("message", {}).get("content", "")
            return {"ok": True, "text": text, "raw": data, "status": status, "error": None, "headers": hdrs}
        else:
            # try best-effort to surface error text
            err_text = None
            try:
                err_text = resp.json()
            except Exception:
                err_text = resp.text
            return {"ok": False, "text": None, "raw": None, "status": status, "error": str(err_text), "headers": hdrs}
    except requests.RequestException as e:
        return {"ok": False, "text": None, "raw": None, "status": -1, "error": str(e), "headers": {}}


In [15]:
def agent_loop(question: str, tools: list, history: list = None, max_steps: int = 6):
    if history is None:
        history = []
    
    # System prompt that teaches model about tools and final answer format
    tool_descriptions = "\n".join([f"- {t['name']}: {t.get('description', 'No description')}" for t in tools])
    system_prompt = f"""You are a helpful assistant with access to these tools:
{tool_descriptions}

When you want to use a tool, say: USE TOOL: tool_name(arguments)
When you have the final answer, say: FINAL ANSWER: your answer here

IMPORTANT: Always end with FINAL ANSWER: followed by just the answer, nothing else.
"""
    
    for step in range(max_steps):
        # Construct the prompt with history
        if history:
            history_text = "Previous steps:\n" + "\n".join(history)
            prompt = f"Question: {question}\n\n{history_text}\n\nWhat do you want to do next? Remember to say FINAL ANSWER: when done."
        else:
            prompt = f"Question: {question}\n\nWhat do you want to do next?"
        
        # Call the model
        response = call_model_chat_completions(prompt, system=system_prompt)
        if not response['ok']:
            print(f"Model call failed: {response['error']}")
            return None
        
        model_output = response['text'].strip()
        print(f"Step {step + 1}: {model_output}")
        
        # Check for final answer (case insensitive)
        lower_output = model_output.lower()
        if "final answer:" in lower_output:
            # Extract everything after "final answer:"
            idx = lower_output.find("final answer:")
            final_answer = model_output[idx + len("final answer:"):].strip()
            print(f">>> Final answer: {final_answer}")
            return final_answer
        
        # Check if the model wants to call a tool
        tool_called = False
        for tool in tools:
            if tool['name'].lower() in lower_output:
                tool_result = tool['function'](model_output)
                history.append(f"Step {step + 1}: Called {tool['name']}, got result: {tool_result}")
                print(f"    Tool result: {tool_result}")
                tool_called = True
                break
        
        if not tool_called:
            # No tool called - check if output contains a number (likely the answer)
            # Extract just the number if present
            numbers = ""
            for char in model_output:
                if char in "0123456789.-":
                    numbers += char
                elif numbers:
                    break
            
            if numbers and len(numbers) > 0:
                print(f">>> Extracted answer: {numbers}")
                return numbers
            else:
                print(f">>> Treating as final answer: {model_output}")
                return model_output
    
    print("Max steps reached without final answer")
    # Return the last tool result if available
    if history:
        last = history[-1]
        if "result:" in last:
            return last.split("result:")[-1].strip()
    return None

In [10]:
def calculator(text):
    """Extract and evaluate a math expression from the model output."""
    # Simple approach: look for "calculator(" and extract until ")"
    text_lower = text.lower()
    
    if "calculator(" in text_lower:
        # Find start position after "calculator("
        start = text_lower.find("calculator(") + len("calculator(")
        # Find the closing parenthesis
        end = text.find(")", start)
        if end != -1:
            expr = text[start:end].strip()
        else:
            expr = ""
    else:
        # Fallback: just try to find numbers and operators
        expr = ""
        for char in text:
            if char in "0123456789+-*/.() ":
                expr += char
        expr = expr.strip()
    
    if not expr:
        return "Error: No expression found"
    
    try:
        # Replace ^ with ** for exponentiation
        expr = expr.replace("^", "**")
        result = eval(expr)
        return str(result)
    except Exception as e:
        return "Error: " + str(e)


# Tool 2: Python script runner - executes Python code safely
def python_runner(text):
    """Extract and run Python code from the model output."""
    code = ""
    
    # first : Look for ```python and ```
    if "```python" in text.lower():
        # Find where the code starts adfter ```python
        start_marker = "```python"
        start = text.lower().find(start_marker) + len(start_marker)
        # Find where it ends 
        end = text.find("```", start)
        if end != -1:
            code = text[start:end].strip()
    
    # Method 2: Look for "python_runner:" 
    elif "python_runner:" in text.lower():
        start = text.lower().find("python_runner:") + len("python_runner:")
        code = text[start:].strip()
    
    # Method 3: Look for single backticks `code`
    elif "`" in text:
        start = text.find("`") + 1
        end = text.find("`", start)
        if end != -1:
            code = text[start:end].strip()
    
    if not code:
        return "Error: No code found"
    
    # Run the extracted code safely
    try:
        local_vars = {}
        safe_builtins = {
            "print": print, 
            "len": len, 
            "range": range,
            "int": int, 
            "float": float, 
            "str": str,
            "list": list, 
            "dict": dict, 
            "sum": sum,
            "min": min, 
            "max": max, 
            "abs": abs,
            "round": round, 
            "sorted": sorted
        }
        exec(code, {"__builtins__": safe_builtins}, local_vars)
        if "result" in local_vars:
            return str(local_vars["result"])
        elif local_vars:
            return str(local_vars)
        else:
            return "Code ran successfully (no output)"
    except Exception as e:
        return "Error: " + str(e)


# Define the tools list
tools = [
    {
        "name": "calculator",
        "description": "Evaluates math expressions. Use: calculator(2 + 3 * 4)",
        "function": calculator
    },
    {
        "name": "python_runner", 
        "description": "Runs Python code. Use: ```python your_code_here ```",
        "function": python_runner
    }
]

print("Testing calculator:", calculator("calculator(412 * 4 + 10)"))
print("Testing python_runner:", python_runner("```python\nresult = sum([1,2,3,4,5])\n```"))

Testing calculator: 1658
Testing python_runner: 15


In [16]:
# testing loop
print(agent_loop("calculate (123*23+23+5)", tools=tools))

Step 1: USE TOOL: calculator(123*23+23+5)
FINAL ANSWER: 2919
>>> Final answer: 2919
2919


## 1) Smoke test: direct inference

We’ll do a single request with a strict instruction to answer briefly.  
*If you see an auth error, set `OPENAI_API_KEY` and (if needed) `API_BASE`/`MODEL_NAME`.*


In [2]:
# %% Direct call example
demo_prompt = "What is 17 + 28? Answer with just the number."
result = call_model_chat_completions(demo_prompt)
print("OK:", result["ok"], "HTTP:", result["status"])
print("MODEL SAYS:", (result["text"] or "").strip())

# Optional: Inspect rate-limit headers if your provider exposes them
for k in ["x-ratelimit-remaining-requests", "x-ratelimit-limit-requests", "x-request-id"]:
    if k in result["headers"]:
        print(f"{k}: {result['headers'][k]}")


OK: True HTTP: 200
MODEL SAYS: 45


## 2) A tiny test set (3 questions)

We’ll cover:
1. **Math reasoning** — inequality solving,
2. **Common sense** — buoyancy/ice & water,
3. **Logic** — a classic race-position puzzle.

We also tightly constrain the required answer forms to enable simple auto‑grading.


In [17]:
# %% Define three tests: input + expected
tests = [
    {
        "id": "math_inequality",
        "type": "numeric",  # grader will prefer numeric extraction
        "prompt": "Solve for the smallest integer n such that 3n + 5 > 26. Answer with just the integer.",
        "expected": "8",    # Because 3n > 21 => n > 7, smallest integer is 8
    },
    {
        "id": "commonsense_ice",
        "type": "text",
        "prompt": (
            "You place an ice cube in a glass of water and mark the water level. "
            "After the ice melts, does the water level rise, fall, or stay the same? "
            "Answer with exactly one of: 'rise', 'fall', 'stay the same'."
        ),
        "expected": "stay the same",
    },
    {
        "id": "logic_race",
        "type": "text",
        "prompt": (
            "In a race, you pass the person in second place. What position are you now in? "
            "Answer with a single word like 'first', 'second', 'third'."
        ),
        "expected": "second",
    },
]


In [18]:
for test in tests:
    print(agent_loop(test, tools))

Step 1: USE TOOL: calculator(3*n + 5 > 26)
FINAL ANSWER: 8
>>> Final answer: 8
8
Step 1: The ice cube displaces water equal to its own weight when it is floating. When it melts, the volume of water it produces is the same as the volume of water it displaced while floating. Therefore, the water level remains the same.

FINAL ANSWER: stay the same
>>> Final answer: stay the same
stay the same
Step 1: The ice cube displaces water equal to its own weight when it is floating. When it melts, the volume of water it produces is the same as the volume of water it displaced while floating. Therefore, the water level remains the same.

FINAL ANSWER: stay the same
>>> Final answer: stay the same
stay the same
Step 1: In a race, if you pass the person in second place, you take their position, which is second. Therefore, you are now in second place.

FINAL ANSWER: second
>>> Final answer: second
second
Step 1: In a race, if you pass the person in second place, you take their position, which is secon

## 3) Minimal evaluator

We provide some example code to decide whether the agent outputs match the expected outputs, just to give you an idea of how evaluations can be done. You are free to use this code, or not.

In [20]:
# %% Simple normalization and evaluation helpers
def normalize_text(s: str) -> str:
    s = (s or "").strip().lower()
    # Remove surrounding punctuation and extra whitespace
    s = re.sub(r"[^\w\s\-']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()

    # Map common synonyms used in these tests
    synonyms = {
        "unchanged": "stay the same",
        "no change": "stay the same",
        "same": "stay the same",
        "second place": "second",
        "2nd": "second",
        "first place": "first",
        "third place": "third",
    }
    return synonyms.get(s, s)

def extract_number(s: str):
    # Returns first number occurrence as string if found, else None
    if not s:
        return None
    m = re.search(r"[-+]?\d+(\.\d+)?", s)
    return m.group(0) if m else None

def grade(expected: str, got: str, kind: str) -> bool:
    if kind == "numeric":
        exp_num = extract_number(expected)
        got_num = extract_number(got)
        return (exp_num is not None) and (got_num == exp_num)
    else:
        return normalize_text(got) == normalize_text(expected)

def evaluate_tests_with_agent(tests, tools, verbose=True):
    rows = []
    for t in tests:
        print(f"\n--- Testing: {t['id']} ---")
        
        # Use agent_loop instead of direct call
        got = agent_loop(t["prompt"], tools=tools, history=None)
        got = (got or "").strip()
        
        is_correct = grade(t["expected"], got, t["type"])
        rows.append({
            "id": t["id"],
            "expected": t["expected"],
            "got": got,
            "correct": is_correct,
        })
        
        # Small delay between tests
        time.sleep(0.2)

    # Print a summary report
    print("\n" + "="*50)
    print("RESULTS SUMMARY")
    print("="*50)
    correct = sum(1 for x in rows if x["correct"])
    print(f"Score: {correct}/{len(rows)} correct")
    for x in rows:
        mark = "✅" if x["correct"] else "❌"
        print(f"{mark} {x['id']}: expected={x['expected']!r}, got={x['got']!r}")
    return rows

# Run evaluation with agent loop
results = evaluate_tests_with_agent(tests, tools)


--- Testing: math_inequality ---
Step 1: USE TOOL: calculator(3*n + 5 > 26)
    Tool result: Error: name 'n' is not defined
Step 1: USE TOOL: calculator(3*n + 5 > 26)
    Tool result: Error: name 'n' is not defined
Step 2: USE TOOL: python_runner
```python
# Solve for the smallest integer n such that 3n + 5 > 26
n = 1
while 3 * n + 5 <= 26:
    n += 1
n
```
    Tool result: {'n': 8}
Step 2: USE TOOL: python_runner
```python
# Solve for the smallest integer n such that 3n + 5 > 26
n = 1
while 3 * n + 5 <= 26:
    n += 1
n
```
    Tool result: {'n': 8}
Step 3: The smallest integer $ n $ that satisfies the inequality $ 3n + 5 > 26 $ is $ n = 8 $.

FINAL ANSWER: 8
>>> Final answer: 8

--- Testing: commonsense_ice ---
Step 3: The smallest integer $ n $ that satisfies the inequality $ 3n + 5 > 26 $ is $ n = 8 $.

FINAL ANSWER: 8
>>> Final answer: 8

--- Testing: commonsense_ice ---
Step 1: The ice cube displaces water equal to its own weight when it is floating. When it melts, the volume of

In [21]:
def self_evaluate(question, prediction, expected_answer, model=MODEL):
    """
    Use the model itself as a strict grader.
    Returns True if the model says the prediction matches the expected answer; else False.
    Falls back to a simple normalized string compare if the model's reply is malformed.
    """
    import re

    system = "You are a strict grader. Reply with exactly True or False. No punctuation. No explanation."
    prompt = f"""You are grading a question-answer pair.

Return exactly True if the PREDICTION would be accepted as correct for the EXPECTED_ANSWER.
Otherwise, return False.

QUESTION:
{question}

PREDICTION:
{prediction}

EXPECTED_ANSWER:
{expected_answer}

Answer with exactly: True or False
"""

    r = call_model_chat_completions(
        prompt,
        system=system,
        model=model,
        temperature=0.0,
    )

    reply = (r.get("text") or "").strip().lower()
    if reply.startswith("true"):
        return True
    if reply.startswith("false"):
        return False

    # Fallback: simple normalization-based equality
    norm = lambda s: re.sub(r"\s+", " ", (s or "").strip().lower())
    return norm(prediction) == norm(expected_answer)


In [22]:
def self_evaluate_tests(tests, model=MODEL, grader_model=None, sleep_sec=0.2, verbose=True):
    """
    Run the tests by querying the model for each prompt, then use LLM-as-a-judge
    (self_evaluate) to determine correctness.

    Args:
        tests: list of dicts with keys: id, prompt, expected (and optionally type)
        model: model used to generate predictions
        grader_model: model used to judge correctness (defaults to `model` if None)
        sleep_sec: small delay between calls to be polite to the API
        verbose: if True, print a summary line per test

    Returns:
        rows: list of dicts with fields:
              id, expected, got, correct, status, error
    """
    import time

    judge_model = grader_model or model
    rows = []

    for t in tests:
        # 1) Get model prediction
        r = call_model_chat_completions(
            t["prompt"],
            system="You are a careful solver. Reply ONLY with the final answer, nothing else.",
            model=model,
            temperature=0.0,
        )
        got = (r.get("text") or "").strip()

        # 2) LLM-as-a-judge: strict True/False
        is_correct = self_evaluate(
            question=t["prompt"],
            prediction=got,
            expected_answer=t["expected"],
            model=judge_model,
        )

        row = {
            "id": t.get("id", "<unnamed>"),
            "expected": t["expected"],
            "got": got,
            "correct": bool(is_correct),
            "status": r.get("status"),
            "error": r.get("error"),
        }
        rows.append(row)

        if verbose:
            mark = "✅" if is_correct else "❌"
            print(f"{mark} {row['id']}: expected={row['expected']!r}, got={row['got']!r} (HTTP {row['status']})")
            if row["error"]:
                print("   error:", row["error"])

        if sleep_sec:
            time.sleep(sleep_sec)

    return rows

# Example:
results_llm_judge = self_evaluate_tests(tests, verbose=True, model=MODEL, grader_model=MODEL)


❌ math_inequality: expected='8', got='4' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)
