# Prompt Engineering â€” Example Code Review and Grading

- Author: Martin Fockedey with the help of copilote

In this notebook, we progressively improve a task using prompt engineering techniques. The task is to review a student's code, assign a grade, and provide improvement advice.

We will compare outputs across six techniques:

1. Clear and explicit instruction
2. Provide sufficient context (requirements + rubric)
3. Use a reviewer persona (supportive, rigorous CS teacher)
4. Structured Output for Automation
5. Few-shot examples
6. Fractioning into subtasks with structured output
7. Concise rationale (CoT-style bullets)

## Environment Setup

Set up the environment.

[Note]
- You'll need a Mistral AI API key (see your Teams group channel).
- Store your API key in a `.env` file as `MISTRAL_API_KEY`.

In [19]:
# Imports and environment setup
import os
from dotenv import load_dotenv

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_mistralai import ChatMistralAI

load_dotenv(override=True)

if not os.getenv("MISTRAL_API_KEY"):
    print("Warning: MISTRAL_API_KEY is not set. Set it in your .env to run LLM calls.")

MODEL_NAME = os.getenv("MISTRAL_MODEL", "mistral-small")
TEMPERATURE = float(os.getenv("MISTRAL_TEMPERATURE", "0.3"))

llm = ChatMistralAI(model=MODEL_NAME, temperature=TEMPERATURE)
parser = StrOutputParser()

# Helper to run a template with variables

def run_chain(template: str, variables: dict):
    prompt = PromptTemplate.from_template(template)
    chain = prompt | llm | parser
    return chain.invoke(variables)

In [20]:
# Sample task context and student submission

task_description = (
    "Write a function `average(nums)` that returns the arithmetic mean "
    "of a list of numbers. If the list is empty, raise `ValueError`."
)

# A deliberately buggy student implementation
student_code = r"""
# Student code to review

def average(nums):
    total = 0
    for i in range(len(nums)):
        total += i  
    return total / len(nums)  
"""
#Two issues: Missing empty-list handling + adds index, not value
grading_rubric = (
    "Assess on: (1) Correctness (40%), (2) Robustness/Edge Cases (25%), "
    "(3) Code Quality/Clarity (20%), (4) Efficiency (10%), (5) Style (5%)."
)

print("Task and student code loaded.")

Task and student code loaded.


## 0) A Weak Prompt (What Not To Do)
This deliberately poor prompt shows common mistakes:
- Vague request ("fix stuff")
- No task requirement restatement
- No grading rubric or desired structure
- Ambiguous output format (mixes prose + code)
- No explicit keys for downstream parsing

Observe how the model output may be: incomplete, inconsistent, missing a grade, or providing superficial advice.

Goal: Use this as a baseline to compare improvements in Sections 1â€“6.

In [7]:
# Weak prompt example (DO NOT COPY AS BEST PRACTICE)
weak_prompt_template = (
    "Here is some code. Fix it and say what you think.\n"  # vague instruction
    f"{student_code}\n"  # raw code dumped without requirement
    "Give me feedback."  # no structure, no rubric
)

# Uncomment to run once API key is set:
print(run_chain(weak_prompt_template, {}))


Here's the fixed code with explanations:

```python
def average(nums):
    if not nums:  # Handle empty list case
        return 0  # or raise an exception, depending on requirements
    total = 0
    for num in nums:  # Iterate through values, not indices
        total += num
    return total / len(nums)
```

Key issues fixed:
1. **Index vs Value**: The original code added the index `i` instead of the actual number `nums[i]`. I fixed this by iterating directly through the values using `for num in nums`.
2. **Empty List Handling**: The original code would raise a `ZeroDivisionError` if an empty list was passed. I added a check at the beginning to handle this case.
3. **Readability**: The fixed version is more Pythonic by iterating through values directly rather than using indices.

Additional considerations:
- You might want to return `None` or raise a `ValueError` instead of returning 0 for empty lists, depending on your requirements.
- For very large lists, you could use `sum(nums) /

## 1) Clear and Explicit Instruction
We start with a very clear instruction: what the assistant must deliver.

In [6]:
baseline_template = (
    "You are reviewing a student's code. You need to correct the code, assign a grade, and provide advice.\n"
    "Student code:\n{student_code}\n"
)

print(run_chain(baseline_template, {"student_code": student_code}))

Here's my review of the student's code:

### Corrected Code:
```python
def average(nums):
    if not nums:  # Handle empty list case
        return 0  # or raise an exception, depending on requirements
    total = 0
    for num in nums:  # Iterate through values, not indices
        total += num
    return total / len(nums)
```

### Grade: 6/10
- **Deductions:**
  - 2 points for the major bug (adding indices instead of values)
  - 1 point for missing empty list handling
  - 1 point for less Pythonic iteration (using indices when values are needed)

### Feedback:
1. **Main Issues:**
   - The code adds array indices (`i`) instead of the actual numbers (`nums[i]` or `num` in a direct iteration)
   - No handling for empty lists (which would cause division by zero)

2. **Improvements:**
   - Use direct iteration (`for num in nums`) instead of index-based access
   - Add input validation for empty lists
   - Consider using `sum(nums) / len(nums)` for a more Pythonic solution

3. **Additional

The code is corrected but the grade won't be consistent between different calls ...

## 2) Provide Sufficient Context
Add the assignment requirements so that the model knows what was expected from the student and a grading rubric. This helps the model judge correctness, prioritize what matters and to be consistent.

In [9]:
context_template = (
    "You are reviewing a student's code for an assignment.\n" \
    "Assignment requirement: {task_description}\n" \
    "Grading rubric: {grading_rubric}\n" \
    "Deliverables: (1) corrected_code, (2) grade (0-100), (3) advice (array).\n" \
    "Student code:\n{student_code}\n"
)

print(run_chain(context_template, {
    "task_description": task_description,
    "grading_rubric": grading_rubric,
    "student_code": student_code
}))

Here's the review of the student's code:

### (1) Corrected Code:
```python
def average(nums):
    if not nums:  # Check for empty list
        raise ValueError("Cannot compute average of an empty list")
    total = 0
    for num in nums:  # Iterate over values, not indices
        total += num
    return total / len(nums)
```

### (2) Grade: 50/100
- **Correctness (40%)**: 0/40 (Major bug: adds indices instead of values, no empty list handling)
- **Robustness/Edge Cases (25%)**: 0/25 (Fails on empty list, no other edge case handling)
- **Code Quality/Clarity (20%)**: 10/20 (Simple structure but critical logical error)
- **Efficiency (10%)**: 10/10 (O(n) time is optimal for this problem)
- **Style (5%)**: 5/5 (Good variable naming, but buggy logic)

### (3) Advice:
1. **Correct the logical error**: The loop should iterate over the values (`for num in nums`) rather than indices.
2. **Handle empty list**: Add a check at the start to raise `ValueError` if the list is empty.
3. **Consider 

The result is quite good but the tone is a bit cold and doesn't take a real pedagogical approach.

## 3) Use a Persona
Adopt a specific reviewer persona (e.g., a supportive but rigorous CS teacher). Personae guide tone, structure, and priorities.

In [15]:
persona_template = (
    "Act as a supportive but rigorous computer science teacher.\n" \
    "Your tone should be constructive, specific, and pedagogical.\n" \
    "You are reviewing a student's code for an assignment.\n" \
    "Assignment requirement: {task_description}\n" \
    "Grading rubric: {grading_rubric}\n" \
    "Deliverables: corrected_code, grade (0-100), advice (array).\n" \
    "\n" \
    "Student code:\n{student_code}\n"
)

print(run_chain(persona_template, {
    "task_description": task_description,
    "grading_rubric": grading_rubric,
    "student_code": student_code
}))


### **Code Review & Feedback**

#### **1. Correctness (0/40)**
The function does not correctly compute the arithmetic mean. The main issue is that it sums the **indices** (`i`) instead of the **values** (`nums[i]`). Additionally, it fails to handle the empty-list case, which should raise a `ValueError`.

#### **2. Robustness/Edge Cases (0/25)**
- **Empty list**: No error is raised when the input list is empty, which violates the assignment requirements.
- **Non-numeric inputs**: The function does not check if the input list contains non-numeric values (e.g., strings), which could lead to runtime errors.

#### **3. Code Quality/Clarity (10/20)**
- The variable name `total` is clear, but the loop logic is incorrect.
- The function lacks comments or docstrings, which would improve readability.

#### **4. Efficiency (10/10)**
The function is **O(n)** in time complexity, which is optimal for this task. No further optimization is needed.

#### **5. Style (3/5)**
- The function name `average`

## 4) Structured Output for Automation
To enable automated processing (e.g., collecting points per rubric), require a strict JSON structure. This reduces ambiguity and makes results machine-usable.

Requested JSON shape:
- corrected_code: string
- grade: number (0â€“100)
- scores: object with keys {correctness (0â€“40), robustness (0â€“25), quality (0â€“20), efficiency (0â€“10), style (0â€“5)}
- advice: array of short strings

Rules:
- Return a single JSON object only â€” no extra prose.
- Ensure each score is an integer within its cap; grade equals the sum of scores.

In [17]:
json_template = (
    "Act as a supportive but rigorous computer science teacher.\n"
    "You are reviewing a student's code for an assignment.\n"
    "Assignment requirement: {task_description}\n"
    "Grading rubric: {grading_rubric}\n"
    "Return ONLY a JSON object with these keys:\n"
    "- corrected_code: string containing the fixed code\n"
    "- grade: number (0-100), must equal sum of all scores\n"
    "- scores: object with correctness (0-40), robustness (0-25), quality (0-20), efficiency (0-10), style (0-5)\n"
    "- advice: array of short, actionable improvement suggestions\n"
    "Constraints: grade MUST equal correctness+robustness+quality+efficiency+style. No extra text outside the JSON.\n\n"
    "Student code:\n{student_code}\n"
)

print(run_chain(json_template, {
    "task_description": task_description,
    "grading_rubric": grading_rubric,
    "student_code": student_code
}))

```json
{
  "corrected_code": "def average(nums):\n    if not nums:\n        raise ValueError(\"List cannot be empty\")\n    return sum(nums) / len(nums)",
  "grade": 65,
  "scores": {
    "correctness": 0,
    "robustness": 25,
    "quality": 20,
    "efficiency": 10,
    "style": 10
  },
  "advice": [
    "Fix the bug where you're adding indices instead of values",
    "Add empty list handling to raise ValueError",
    "Consider using built-in sum() for cleaner code",
    "Add docstring to explain function purpose",
    "Use more descriptive variable names if needed"
  ]
}
```


## 5) Examples and Few-Shot Learning
Provide one or two compact examples of the desired input-output behavior. Examples calibrate style, structure, and depth.

In [22]:
fewshot_template = (
    "Act as a supportive but rigorous computer science teacher.\n" \
    "You are reviewing a student's code for an assignment.\n" \
    "Assignment requirement: {task_description}\n" \
    "Grading rubric: {grading_rubric}\n" \
    "Deliverables: corrected_code, grade (0-100), advice (array). Return JSON.\n" \
    "\n" \
    "Example 1 â€” Input code:\n"
    "def add(a,b):\n    return a-b\n" \
    "Example 1 â€” Desired JSON output:\n"
    "{{\n"
    "  \"corrected_code\": \"def add(a, b):\\n    return a + b\",\n"
    "  \"grade\": 70,\n"
    "  \"advice\": [\"Add docstrings\", \"Use spaces around operators\"]\n"
    "}}\n" \
    "\n" \
    "Example 2 â€” Input code:\n"
    "def divide(a,b):\n    return a/b\n" \
    "Example 2 â€” Desired JSON output:\n"
    "{{\n"
    "  \"corrected_code\": \"def divide(a, b):\\n    if b == 0:\\n        raise ZeroDivisionError(\'division by zero\')\\n    return a / b\",\n"
    "  \"grade\": 85,\n"
    "  \"advice\": [\"Handle division by zero\", \"Add type hints\"]\n"
    "}}\n" \
    "\n" \
    "Now review this Student code and return JSON as above:\n{student_code}\n"
)

print(run_chain(fewshot_template, {
    "task_description": task_description,
    "grading_rubric": grading_rubric,
    "student_code": student_code
}))


```json
{
  "corrected_code": "def average(nums):\n    if not nums:\n        raise ValueError('Cannot compute average of an empty list')\n    total = 0\n    for num in nums:\n        total += num\n    return total / len(nums)",
  "grade": 60,
  "advice": [
    "Handle empty list case by raising ValueError",
    "Use the list elements directly in the loop instead of indices",
    "Add docstring to explain the function's purpose and parameters",
    "Consider adding type hints for better code clarity",
    "Use meaningful variable names (e.g., 'total' is fine, but 'num' could be more descriptive)"
  ]
}
```


## 6) Fractioning into Subtasks
Explicitly break the task into steps and ask for a structured, multi-part result. This reduces ambiguity and improves coverage of requirements.

In [24]:
subtasks_template = (
    "Role: supportive, rigorous CS teacher.\n" \
    "Requirement: {task_description}\n" \
    "Rubric: {grading_rubric}\n" \
    "Follow these subtasks:\n" \
    "1) Validate the implementation against the requirement (incl. empty list).\n" \
    "2) List specific defects you find (line- or behavior-based).\n" \
    "3) Provide corrected_code that fully meets the requirement.\n" \
    "4) Assign a grade (0-100) using the rubric.\n" \
    "5) Provide actionable advice as an array of short, concrete suggestions.\n" \
    "Output your reasoning for each task and at the end give a JSON with keys: checks (array), bugs (array), corrected_code (string), grade (number), advice (array).\n" \
    "\n" \
    "Student code:\n{student_code}\n"
)

print(run_chain(subtasks_template, {
    "task_description": task_description,
    "grading_rubric": grading_rubric,
    "student_code": student_code
}))


### Task 1: Validate the implementation against the requirement
The student's function `average(nums)` is intended to compute the arithmetic mean of a list of numbers. However, it has several issues:
1. **Correctness (40%)**: The function incorrectly sums the indices (`i`) of the list elements instead of the elements themselves (`nums[i]`). This will always return a wrong result.
2. **Robustness/Edge Cases (25%)**: The function does not handle the empty list case as required. It will raise a `ZeroDivisionError` instead of a `ValueError`.
3. **Code Quality/Clarity (20%)**: The code is not clear because it uses `i` (an index) instead of the actual list elements. The variable name `total` is misleading since it doesn't represent the sum of the numbers.
4. **Efficiency (10%)**: The function is inefficient because it uses `range(len(nums))` to iterate, which is less Pythonic than directly iterating over the list.
5. **Style (5%)**: The style is not ideal due to the incorrect logic and lack 

## 7) CoT Reasoning 
The goal of CoT is to ask the LLM to give a reasoning, it improves performance (it replace a direct guess by a longer causal reasoning) and explainability/transparancy as it shows why the LLM gave this response allowing us to audit it. 

The CoT is nowdays embedded in the last training phase of the models and it's usually not necessary to explicit it in the prompt anymore. You can still see the improvement with or without reasoning by asking the AI not to give a reasoning

In [None]:
print("Without the chain of thought reasoning:")
noCoT =  "A train leaves Station A at 8:00 AM traveling at 60 km/h. Another train leaves Station B at 9:00 AM " \
"traveling at 80 km/h toward Station A. If the stations are 300 km apart, at what time will the two trains meet? " \
"Don't think step-by-step and give no reasoning. Just give the response."
print(run_chain(noCoT, {}))

print("With the chain of thought reasoning:")
CoT =  "A train leaves Station A at 8:00 AM traveling at 60 km/h. Another train leaves Station B at 9:00 AM " \
"traveling at 80 km/h toward Station A. If the stations are 300 km apart, at what time will the two trains meet? "
print(run_chain(CoT, {}))


Without the chain of thought reasoning:
10:30 AM.
With the chain of thought reasoning:
10:30 AM.
With the chain of thought reasoning:
Alright, let's tackle this problem step by step. I'm going to think about the two trains moving towards each other and figure out when and where they'll meet.

### Understanding the Problem

We have two stations, Station A and Station B, which are 300 kilometers apart.

- **Train 1**: Leaves Station A at 8:00 AM, traveling at 60 km/h towards Station B.
- **Train 2**: Leaves Station B at 9:00 AM, traveling at 80 km/h towards Station A.

We need to determine the time at which these two trains will meet each other.

### Visualizing the Scenario

First, let's visualize the situation:

- At 8:00 AM:
  - Train 1 starts from Station A towards Station B at 60 km/h.
  - Train 2 is still at Station B; it hasn't left yet.

- From 8:00 AM to 9:00 AM (one hour):
  - Train 1 is moving towards Station B.
  - Train 2 is still at Station B.

At 9:00 AM:
- Train 1 has bee