# Code based evaluation

Code-based grading relies on predefined code, often using string matching and regular expressions, to assess model outputs. This typically involves checking if the response exactly matches a correct answer or includes specific key phrases. When applicable, this approach is ideal because it is extremely fast and dependable. 

- You should ideally start here. 
- Convert what you want to evaluate into "unit" objectives and write code to check if the model output satisfies the objective.
- Doing this exercise will help you think through the evaluation criteria and will likely reveal shortcomings in the evaluation rubric.
- The code based evaluation can also be called "unit testing", "heuristic based evaluation", "rule based evaluation", "programmatic evaluation" etc. All of these terms are more or less synonymous. 
- The code based evaluation is also the easiest to implement. Always keep in mind to keep it simple and modular. 
- These "unit" evaluations can be part of your CI/CD pipeline easily.

In [2]:
%load_ext autoreload
%autoreload 2

In [13]:
import os
import re
import weave
import pandas as pd

from dotenv import load_dotenv
load_dotenv()  # TODO: replace with getpass

import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

import nest_asyncio
nest_asyncio.apply()

In [8]:
# initialize weave
weave_client =weave.init(project_name="eval-course/eval-course-dev")

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/eval-course/eval-course-dev/weave


## Generate commit messages from diffs

Let's imagine a task, where you are using LLMs to generate commit messages for code diffs. This will certainly help you save sometime but is surely useful for fast moving projects where multiple engineers are working on the same codebase. Maintaining a good commit message is important for the health of the codebase.

Let's use this example to motivate how code based evaluation can be useful and how to go about it. Let's start by viewing this in action and then we will write code evaluators.

### Part 1: Commit generator application

In [32]:
class CommitMessageGenerator(weave.Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-1.5-flash")
    prompt_template: str = """
    Generate a clear and descriptive commit message for the following code changes.
    Format the commit message in the conventional commits style:
    <type>(<scope>): <description>
    
    [optional body]
    
    Code diff:
    {diff_content}
    
    Focus on:
    - What changed
    - Why it changed
    - Any breaking changes
    """

    @weave.op()
    def predict(self, diff: str) -> str:
        response = self.model.generate_content(self.prompt_template.format(diff_content=diff))
        return response.text.strip()

In [33]:
diff_example_1 = """
diff --git a/src/auth.py b/src/auth.py
index abc123..def456 100644
--- a/src/auth.py
+++ b/src/auth.py
@@ -10,6 +10,12 @@ class AuthManager:
     def validate_token(self, token):
         return self.jwt.decode(token, self.secret_key)
 
+    def refresh_token(self, old_token):
+        if not self.validate_token(old_token):
+            raise InvalidTokenError
+        user_data = self.jwt.decode(old_token)
+        return self.generate_token(user_data)
"""

commit_msg_generator = CommitMessageGenerator()
commit_msg_1 = commit_msg_generator.predict(diff_example_1)
print(commit_msg_1)

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192dccc-8c79-7cc2-ba03-94e916562f73
```
feat(auth): Add refresh token functionality

Adds a `refresh_token` method to the `AuthManager` class. This method allows users to obtain a new access token using their existing refresh token.

The refresh token endpoint validates the provided refresh token, and if valid, generates a new access token with the same user data. 
```


### Part 2: Code based evaluation

In this section, we will define a few objective criterias and write a programmatic (no use of LLMs) functions to evaluate the quality of the commit messages.

In [42]:
# Define objectives as functions
# @weave.op()
def follows_conventional_format(model_output: str) -> bool:
    """Check if commit message follows conventional commit format"""
    conv_commit_pattern = r'^(feat|fix|perf|refactor|style|test|docs|build|ci|chore)(\([a-z-]+\))?: .+'
    return bool(re.match(conv_commit_pattern, model_output.split('\n')[0]))


# @weave.op()
def length_appropriate(model_output: str) -> bool:
    """Check if commit message length is appropriate (between 10-72 chars)"""
    first_line = model_output.split('\n')[0]
    return 10 <= len(first_line) <= 72


# @weave.op() 
def contains_key_components(model_output: str) -> bool:
    """Check if commit message contains key components (what and why)"""
    return (
        any(word in model_output.lower() for word in ["add", "update", "fix", "remove", "implement"]) and
        ("to" in model_output.lower() or "for" in model_output.lower() or "because" in model_output.lower())
    )


# @weave.op()
def no_generic_terms(model_output: str) -> bool:
    """Check if commit message avoids generic terms"""
    generic_terms = ["stuff", "things", "updated", "fixed", "changed"]
    return not any(term in model_output.lower() for term in generic_terms)


# @weave.op()
def has_imperative_mood(model_output: str) -> bool:
    """Check if commit message uses imperative mood (starts with verb)"""
    first_word = model_output.split('\n')[0].split()[0].lower()
    imperative_verbs = ["add", "update", "fix", "remove", "implement", "change", "refactor", "optimize", "delete", "create"]
    return any(first_word == verb for verb in imperative_verbs)


# @weave.op()
def has_proper_capitalization(model_output: str) -> bool:
    """Check if commit message follows proper capitalization (first letter capitalized, no period)"""
    first_line = model_output.split('\n')[0]
    return (first_line[0].isupper() and 
            not first_line.endswith('.'))


# @weave.op()
def has_scope_if_needed(model_output: str) -> bool:
    """Check if commit message includes scope when appropriate"""
    first_line = model_output.split('\n')[0]
    type_with_scope = r'^(feat|fix|refactor)\([a-z-]+\): '
    type_without_scope = r'^(docs|test|style|chore): '
    return bool(re.match(type_with_scope, first_line) or re.match(type_without_scope, first_line))


# @weave.op()
def has_detailed_body_if_complex(model_output: str) -> bool:
    """Check if commit message has detailed body for complex changes"""
    lines = model_output.split('\n')
    # Complex changes indicated by certain keywords
    complex_indicators = ["refactor", "breaking", "deprecate", "remove", "!:"]
    is_complex = any(indicator in lines[0].lower() for indicator in complex_indicators)

    if is_complex:
        # Should have at least one line of body text after blank line
        return len(lines) >= 3 and lines[1].strip() == "" and any(line.strip() for line in lines[2:])
    return True

I have synthetically generated a dataset of code diffs. Let's load it and see what it looks like.

In [43]:
# Load the dataset
code_diffs_dataset = weave.ref('code-diffs:v0').get()
print("Total number of samples: ", len(code_diffs_dataset.rows))

print(code_diffs_dataset.rows[0]["diff"], sep="\n")

Total number of samples:  10

diff --git a/src/user.py b/src/user.py
index abc123..def456 100644
--- a/src/user.py
+++ b/src/user.py
@@ -10,6 +10,9 @@ class User:
     def get_name(self):
         return self.name
 
+    def get_email(self):
+        return self.email
+



Note that we are not concerned about "gold" standard commit messages here. We have the user query - in the form of code diffs. We will evaluate the quality of the commit messages generated by LLMs directly using the above defined criterias. This is the beauty and one of the pros of code based evaluation.

Below I am collecting all the different code based criterias under one `Scorer`. The `summarize` method will run at the end of the scoring process to aggregate the scores. If you don't write this method, `auto_summarize` will be called by default. The example below shows how to structure your code evaluation logic along with custom aggregation logic.

In [50]:
from typing import Optional
from weave import Scorer


class CodeDiffScorer(Scorer):
    @weave.op()
    def score(self, model_output: str) -> dict:
        result = {
            "follows_conventional_format": follows_conventional_format(model_output),
            "length_appropriate": length_appropriate(model_output),
            "contains_key_components": contains_key_components(model_output),
            "no_generic_terms": no_generic_terms(model_output),
            "has_imperative_mood": has_imperative_mood(model_output),
            "has_proper_capitalization": has_proper_capitalization(model_output),
            "has_scope_if_needed": has_scope_if_needed(model_output),
            "has_detailed_body_if_complex": has_detailed_body_if_complex(model_output),
        }
        return result

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        if not score_rows:
            return None
            
        # Initialize counters for each metric with weight 1
        metrics = {
            'follows_conventional_format': {'weight': 1, 'count': 0},
            'length_appropriate': {'weight': 1, 'count': 0},
            'contains_key_components': {'weight': 1, 'count': 0}, 
            'no_generic_terms': {'weight': 1, 'count': 0},
            'has_imperative_mood': {'weight': 1, 'count': 0},
            'has_proper_capitalization': {'weight': 1, 'count': 0},
            'has_scope_if_needed': {'weight': 1, 'count': 0},
            'has_detailed_body_if_complex': {'weight': 1, 'count': 0}
        }
        
        # Sum up scores for each metric
        total = len(score_rows)
        for row in score_rows:
            for metric in metrics:
                if row[metric]:
                    metrics[metric]['count'] += 1
                    
        # Calculate weighted average score
        weighted_sum = sum(
            (metrics[metric]['count'] / total) * metrics[metric]['weight']
            for metric in metrics
        )
        total_weights = sum(metrics[metric]['weight'] for metric in metrics)
        code_eval_score = weighted_sum / total_weights
        
        summary = {'code_eval_score': code_eval_score}

        return summary
    
code_evaluator = CodeDiffScorer()

Let's run the evaluation.

In [51]:
import asyncio
from weave import Evaluation

# Create evaluation
evaluation = Evaluation(
    dataset=code_diffs_dataset,
    scorers=[code_evaluator]
)

# Run evaluation
asyncio.run(evaluation.evaluate(CommitMessageGenerator()))

🍩 https://wandb.ai/eval-course/eval-course-dev/r/call/0192dce4-f8ba-7d50-b1a5-44f3a77cb754


{'CodeDiffScorer': {'code_eval_score': 0.375},
 'model_latency': {'mean': 4.428258943557739}}