# Code based evaluation

Code-based grading, often referred to as “unit testing,” “heuristic-based evaluation,” “rule-based evaluation,” or “programmatic evaluation,” relies on predefined code—typically using string matching, regular expressions, or other heuristics—to assess model outputs. This approach is ideal in scenarios where exact matches or specific key phrases define correctness, as it’s both fast and reliable.

### Steps:

- **Define Unit Objectives**: Break down what you want to evaluate into specific, testable objectives. IMPORTANT: these objectives can be expressed using a programming language.

- **Implement Code Checks**: Write code that verifies whether the model’s output meets each objective.

- **Iterate and Refine**: Continuously improve your evaluation criteria and code based on the model’s performance and edge cases.

### Tips:

- **Start Here**: Code-based evaluation is a great starting point for evaluating your LLM application. It’s straightforward, modular, and allows for quick feedback.

- **Refine Your Criteria**: This process often reveals limitations in your evaluation criteria, helping you think critically about what constitutes a “good” response.

- **Keep It Simple**: Focus on keeping evaluations simple and modular, which will make them easier to maintain.

- **Integrate with CI/CD**: These unit tests can seamlessly fit into your CI/CD pipeline or act as guardrails, ensuring your application’s outputs meet basic standards before deployment.

## Setup

Run the code cells below to setup your colab notebook.

In [1]:
!pip install -qq google-generativeai weave

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/301.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/586.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.7/310.7 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.2/203.2 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!git clone https://github.com/wandb/eval-course

Cloning into 'eval-course'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects: 100% (94/94), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 94 (delta 46), reused 68 (delta 25), pack-reused 0 (from 0)[K
Receiving objects: 100% (94/94), 1.06 MiB | 14.85 MiB/s, done.
Resolving deltas: 100% (46/46), done.


In [3]:
import sys
sys.path.append("/content/eval-course/notebooks/utils/")

import os
import re
import getpass
import weave
import pandas as pd

# utility script
from llm_client import LLMClient

import nest_asyncio
nest_asyncio.apply()

In [4]:
import google.generativeai as genai

os.environ["GOOGLE_API_KEY"] = getpass.getpass("Please enter your GOOGLE API KEY with Gemini acccess: ")

Please enter your GOOGLE API KEY with Gemini acccess: ··········


In [5]:
# initialize weave for tracing and evaluation
weave_client = weave.init(project_name="eval-course/eval-course")

Please login to Weights & Biases (https://wandb.ai/) to continue:


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/eval-course/eval-course/weave


## Generate Commit Messages from Code Diffs

Imagine you’re working on a project with multiple engineers actively contributing to the same codebase. In a high-velocity environment like this, it’s crucial to maintain clear, informative commit messages to document code changes. Proper commit messages help track code evolution, make debugging easier, and support knowledge transfer across team members.

In this example, **we’ll explore using LLMs to automatically generate commit messages based on code diffs**. Automating this process can save time and maintain consistency, but *it’s essential that the generated commit messages meet certain standards*.

We’ll start by generating commit messages for a sample code diff. Then, we’ll demonstrate how to use code-based evaluation to assess whether these messages meet our standards, using simple checks to ensure quality and relevance.

### Part 1: Commit Generator Application

In [6]:
MODEL = "gemini-1.5-flash-002"
MODEL_CLIENT = "gemini"

In [15]:
class CommitMessageGenerator(weave.Model):
    model: LLMClient = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
    prompt_template: str = """Generate a clear and descriptive commit message for the following code changes.
    Format the commit message in the conventional commits style:
    <type>(<scope>): <description>

    [optional body]

    Code diff:
    {code_diff}

    Focus on:
    - What changed?
    - Why it changed?
    - Any breaking changes
    """

    @weave.op()
    def predict(self, diff: str) -> str:
        prompt = self.prompt_template.format(code_diff=diff)
        response = self.model.predict(user_prompt=prompt)
        return response.strip()

In [16]:
diff_example_1 = """
diff --git a/src/auth.py b/src/auth.py
index abc123..def456 100644
--- a/src/auth.py
+++ b/src/auth.py
@@ -10,6 +10,12 @@ class AuthManager:
     def validate_token(self, token):
         return self.jwt.decode(token, self.secret_key)

+    def refresh_token(self, old_token):
+        if not self.validate_token(old_token):
+            raise InvalidTokenError
+        user_data = self.jwt.decode(old_token)
+        return self.generate_token(user_data)
"""

commit_msg_generator = CommitMessageGenerator()
commit_msg_1 = commit_msg_generator.predict(diff_example_1)
print(commit_msg_1)

🍩 https://wandb.ai/eval-course/eval-course/r/call/0192fcdd-ee97-7d63-a585-e412b379af14
```
feat(auth): Add token refresh functionality

Adds a `refresh_token` method to the `AuthManager` class. This allows clients to obtain a new JWT by providing a valid existing token.  This improves the user experience by extending session lifetimes without requiring repeated logins.  The new method validates the old token; if invalid, it raises an `InvalidTokenError`.
```


### Part 2: Code based evaluation

In this section, we will define a few objective criterias and write a programmatic (no use of LLMs) functions to evaluate the quality of the commit messages.

A good commit message on a high-level should:

- Summarize the changes accurately and concisely.

- Highlight key functions, methods, or modules affected.

- Be free of unnecessary information or “fluff.”

Below we are converting these high-level concepts into unit objectives.

Are these objectives "actually" capturing the full extent of the "quality" measure of the generated commit messages? In this case, it is not.

But the main selling point is "the speed of writing few criterias/objectives as function and the speed of running them".

In [10]:
# Define objectives as functions

# @weave.op()
def follows_conventional_format(model_output: str) -> bool:
    """Check if commit message follows conventional commit format"""
    conv_commit_pattern = r'^(feat|fix|perf|refactor|style|test|docs|build|ci|chore)(\([a-z-]+\))?: .+'
    return bool(re.match(conv_commit_pattern, model_output.split('\n')[0]))


# @weave.op()
def length_appropriate(model_output: str) -> bool:
    """Check if commit message length is appropriate (between 10-72 chars)"""
    first_line = model_output.split('\n')[0]
    return 10 <= len(first_line) <= 72


# @weave.op()
def contains_key_components(model_output: str) -> bool:
    """Check if commit message contains key components (what and why)"""
    return (
        any(word in model_output.lower() for word in ["add", "update", "fix", "remove", "implement"]) and
        ("to" in model_output.lower() or "for" in model_output.lower() or "because" in model_output.lower())
    )


# @weave.op()
def no_generic_terms(model_output: str) -> bool:
    """Check if commit message avoids generic terms"""
    generic_terms = ["stuff", "things", "updated", "fixed", "changed"]
    return not any(term in model_output.lower() for term in generic_terms)


# @weave.op()
def has_imperative_mood(model_output: str) -> bool:
    """Check if commit message uses imperative mood (starts with verb)"""
    first_word = model_output.split('\n')[0].split()[0].lower()
    imperative_verbs = ["add", "update", "fix", "remove", "implement", "change", "refactor", "optimize", "delete", "create"]
    return any(first_word == verb for verb in imperative_verbs)


# @weave.op()
def has_proper_capitalization(model_output: str) -> bool:
    """Check if commit message follows proper capitalization (first letter capitalized, no period)"""
    first_line = model_output.split('\n')[0]
    return (first_line[0].isupper() and
            not first_line.endswith('.'))


# @weave.op()
def has_scope_if_needed(model_output: str) -> bool:
    """Check if commit message includes scope when appropriate"""
    first_line = model_output.split('\n')[0]
    type_with_scope = r'^(feat|fix|refactor)\([a-z-]+\): '
    type_without_scope = r'^(docs|test|style|chore): '
    return bool(re.match(type_with_scope, first_line) or re.match(type_without_scope, first_line))


# @weave.op()
def has_detailed_body_if_complex(model_output: str) -> bool:
    """Check if commit message has detailed body for complex changes"""
    lines = model_output.split('\n')
    # Complex changes indicated by certain keywords
    complex_indicators = ["refactor", "breaking", "deprecate", "remove", "!:"]
    is_complex = any(indicator in lines[0].lower() for indicator in complex_indicators)

    if is_complex:
        # Should have at least one line of body text after blank line
        return len(lines) >= 3 and lines[1].strip() == "" and any(line.strip() for line in lines[2:])
    return True

I have synthetically generated a dataset of code diffs. Let's load it and see what it looks like.

In pratice, you can build this  diffs from your existing code

In [12]:
code_diffs_dataset = weave.ref('weave:///eval-course/eval-course-dev/object/code-diffs:JJTbwBlIr6YqYARd7Yt3epxHkYLwXf7u5YxiYy2vJ7w').get()
print("Total number of samples: ", len(code_diffs_dataset.rows))

print(code_diffs_dataset.rows[0]["diff"], sep="\n")

Total number of samples:  10

diff --git a/src/user.py b/src/user.py
index abc123..def456 100644
--- a/src/user.py
+++ b/src/user.py
@@ -10,6 +10,9 @@ class User:
     def get_name(self):
         return self.name
 
+    def get_email(self):
+        return self.email
+



Note that we are not concerned about "gold" standard commit messages here. We have the user query - in the form of code diffs. We will evaluate the quality of the commit messages generated by LLMs directly using the above defined criterias. This is the beauty and one of the pros of code based evaluation.

Below I am collecting all the different code based criterias under one `Scorer`. The `summarize` method will run at the end of the scoring process to aggregate the scores. If you don't write this method, `auto_summarize` will be called by default. The example below shows how to structure your code evaluation logic along with custom aggregation logic.

In [17]:
from typing import Optional
from weave import Scorer


class CodeDiffScorer(Scorer):
    @weave.op()
    def score(self, model_output: str) -> dict:
        result = {
            "follows_conventional_format": follows_conventional_format(model_output),
            "length_appropriate": length_appropriate(model_output),
            "contains_key_components": contains_key_components(model_output),
            "no_generic_terms": no_generic_terms(model_output),
            "has_imperative_mood": has_imperative_mood(model_output),
            "has_proper_capitalization": has_proper_capitalization(model_output),
            "has_scope_if_needed": has_scope_if_needed(model_output),
            "has_detailed_body_if_complex": has_detailed_body_if_complex(model_output),
        }
        return result

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        if not score_rows:
            return None

        # Initialize counters for each metric with weight 1
        metrics = {
            'follows_conventional_format': {'weight': 1, 'count': 0},
            'length_appropriate': {'weight': 1, 'count': 0},
            'contains_key_components': {'weight': 1, 'count': 0},
            'no_generic_terms': {'weight': 1, 'count': 0},
            'has_imperative_mood': {'weight': 1, 'count': 0},
            'has_proper_capitalization': {'weight': 1, 'count': 0},
            'has_scope_if_needed': {'weight': 1, 'count': 0},
            'has_detailed_body_if_complex': {'weight': 1, 'count': 0}
        }

        # Sum up scores for each metric
        total = len(score_rows)
        for row in score_rows:
            for metric in metrics:
                if row[metric]:
                    metrics[metric]['count'] += 1

        # Calculate weighted average score
        weighted_sum = sum(
            (metrics[metric]['count'] / total) * metrics[metric]['weight']
            for metric in metrics
        )
        total_weights = sum(metrics[metric]['weight'] for metric in metrics)
        code_eval_score = weighted_sum / total_weights

        summary = {'code_eval_score': code_eval_score}

        return summary

code_evaluator = CodeDiffScorer()

Let's run the evaluation.

In [18]:
import asyncio
from weave import Evaluation

# Create evaluation
evaluation = Evaluation(
    dataset=code_diffs_dataset,
    scorers=[code_evaluator]
)

# Run evaluation
asyncio.run(evaluation.evaluate(CommitMessageGenerator()))

🍩 https://wandb.ai/eval-course/eval-course/r/call/0192fcde-1c50-71a0-a970-fbf8d2c9ecc2


{'CodeDiffScorer': {'code_eval_score': 0.325},
 'model_latency': {'mean': 9.362474131584168}}