# RM-Gallery Workflow Demonstration

This notebook demonstrates the complete workflow of RM-Gallery, from data input to final analysis results.

In [None]:
# Import required modules
import asyncio
import os
from typing import List, Dict, Any

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

from rm_gallery.core.graders.base_grader import BaseGrader, GraderMode, GraderResult, GraderScore
from rm_gallery.core.graders.llm_grader import LLMGrader
from rm_gallery.core.models.openai_chat_model import OpenAIChatModel
from rm_gallery.core.models.schema.template import PromptTemplate
from rm_gallery.core.models.schema.message import ChatMessage
from rm_gallery.core.runner.grading_runner import GradingRunner, GraderConfig
from rm_gallery.core.analyzer.base_analyzer import BaseAnalyzer, BaseAnalysis
from rm_gallery.core.analyzer.accuracy_analyzer import AccuracyAnalyzer
from rm_gallery.core.analyzer.weighted_average_analyzer import WeightedAverageAnalyzer
from rm_gallery.core.models.base_chat_model import BaseChatModel

## 1. Define Custom Graders

First, we'll define some simple custom graders for demonstration purposes.

In [None]:
class AccuracyGrader(BaseGrader):
    """A simple accuracy grader that checks if answer matches expected value."""
    
    def __init__(self):
        super().__init__(
            name="accuracy_grader",
            mode=GraderMode.POINTWISE,
            description="Evaluates accuracy of answers"
        )
    
    async def aevaluate(self, **kwargs) -> GraderResult:
        answer = kwargs.get("answer", "")
        expected = kwargs.get("expected", None)
        
        if expected is not None:
            score = 1.0 if str(answer) == str(expected) else 0.0
            reason = "Correct answer" if score == 1.0 else "Incorrect answer"
        else:
            score = 0.5  # Default score when no expected value
            reason = "No expected value provided"
            
        return GraderScore(
            name=self.name,
            score=score,
            reason=reason
        )

class RelevanceGrader(BaseGrader):
    """A simple relevance grader that gives a relevance score based on answer length."""
    
    def __init__(self):
        super().__init__(
            name="relevance_grader",
            mode=GraderMode.POINTWISE,
            description="Evaluates relevance of answers based on length"
        )
    
    async def aevaluate(self, **kwargs) -> GraderResult:
        answer = kwargs.get("answer", "")
        query = kwargs.get("query", "")
        
        # Simple relevance based on answer length and query length
        answer_length = len(str(answer))
        query_length = len(str(query))
        
        # Normalize score between 0 and 1
        if query_length > 0:
            score = min(1.0, answer_length / query_length)
        else:
            score = 0.5
            
        reason = f"Answer length: {answer_length}, Query length: {query_length}"
        
        return GraderScore(
            name=self.name,
            score=score,
            reason=reason
        )
    

template = PromptTemplate(
    messages=[
        ChatMessage(
            role="system",
            content="You are an expert evaluator of AI assistant responses. "
                    "Rate the helpfulness of the answer on a scale from 0 to 1, "
                    "where 0 is completely unhelpful and 1 is perfectly helpful. "
        ),
        ChatMessage(
            role="user",
            content="""Question: {query}\nAnswer: {answer}\n\n
please rate the helpfulness of the answer on a scale from 0 to 1.
Response Json Format:
```json
{{
    "score": 0.8,
    "reason": "Explanation of the score"
}}
```
"""
        )
    ]
)

# Custom LLMGrader implementation
class HelpfulnessGrader(LLMGrader):
    """Custom LLM-based grader for evaluating helpfulness using OpenAI API."""

    def __init__(self, model: BaseChatModel):
        super().__init__(
            name="helpfulness_grader",
            mode=GraderMode.POINTWISE,
            description="Evaluates helpfulness of answers using LLM as a judge",
            model=model,
            template=template,
        )
    
    async def aevaluate(self, query: str, answer: str, **kwargs) -> GraderResult:
        return await super().aevaluate(query=query, answer=answer, **kwargs)

## 2. Prepare Input Data

Let's prepare some sample data for evaluation.

In [None]:
# Sample data for evaluation
data = [
    {
        "query": "What is the capital of France?",
        "answer": "Paris",
        "expected": "Paris",
        "label": 1.0,
    },
    {
        "query": "What is 2+2?",
        "answer": "5",
        "expected": "4",
        "label": 0.0,
    },
    {
        "query": "Who wrote Romeo and Juliet?",
        "answer": "William Shakespeare",
        "expected": "William Shakespeare",
        "label": 1.0,
    }
]

print("Input data:")
for i, item in enumerate(data):
    print(f"  Sample {i+1}: {item}")

Input data:
  Sample 1: {'query': 'What is the capital of France?', 'answer': 'Paris', 'expected': 'Paris', 'label': 1.0}
  Sample 2: {'query': 'What is 2+2?', 'answer': '5', 'expected': '4', 'label': 0.0}
  Sample 3: {'query': 'Who wrote Romeo and Juliet?', 'answer': 'William Shakespeare', 'expected': 'William Shakespeare', 'label': 1.0}


## 3. Configure and Run Graders

Now we'll set up the graders and run the evaluation process.

In [None]:
# Create grader configurations
model = OpenAIChatModel(model="qwen3-32b")


grader_configs = [
    GraderConfig(grader=AccuracyGrader()),
    GraderConfig(grader=RelevanceGrader()),
    GraderConfig(grader=HelpfulnessGrader(model=model))
]

# Create the grading runner
runner = GradingRunner(grader_configs=grader_configs, max_concurrency=5)

# Run the evaluation
results = await runner.arun(data)

print("Evaluation results:")
for i, sample_results in enumerate(results):
    print(f"  Sample {i+1}:")
    for grader_name, result in sample_results.items():
        score = getattr(result, "score", 0.0)
        print(f"    {grader_name}: score={score}, reason='{result.reason}'")

Evaluation results:
  Sample 1:
    accuracy_grader: score=1.0, reason='Correct answer'
    relevance_grader: score=0.16666666666666666, reason='Answer length: 5, Query length: 30'
    helpfulness_grader: score=1.0, reason='The answer is correct and directly addresses the question without any unnecessary information.'
  Sample 2:
    accuracy_grader: score=0.0, reason='Incorrect answer'
    relevance_grader: score=0.08333333333333333, reason='Answer length: 1, Query length: 12'
    helpfulness_grader: score=0.0, reason='The answer is incorrect. 2+2 equals 4, not 5. Providing an incorrect answer to a simple arithmetic question is unhelpful.'
  Sample 3:
    accuracy_grader: score=1.0, reason='Correct answer'
    relevance_grader: score=0.7037037037037037, reason='Answer length: 19, Query length: 27'
    helpfulness_grader: score=1.0, reason='The answer is completely correct and directly addresses the question. William Shakespeare is indeed the author of Romeo and Juliet, and the respons

## 4. Analyze Results

Finally, let's analyze the results using different analyzers.

In [None]:
# Create analyzers
accuracy_analyzer = AccuracyAnalyzer()

# Prepare weights for weighted average
weights = {
    "accuracy_grader": 0.5,
    "relevance_grader": 0.3,
    "helpfulness_grader": 0.2
}

weighted_analyzer = WeightedAverageAnalyzer(weights=weights)

# Run analysis
accuracy_analysis = accuracy_analyzer.analyze(data, results, target_grader="helpfulness_grader")
weighted_analysis = weighted_analyzer.analyze(data, results)

print("Analysis results:")
print(f"  Accuracy Analysis: {accuracy_analysis}")
print(f"  Weighted Average Analysis: {weighted_analysis}")

if hasattr(accuracy_analysis, 'accuracy'):
    print(f"  Overall Accuracy: {accuracy_analysis.accuracy:.2%}")
    

if hasattr(weighted_analysis, 'weighted_results'):
    print("  Weighted Scores per Sample:")
    for i, result_dict in enumerate(weighted_analysis.weighted_results):
        weighted_result = result_dict["weighted_result"]
        print(f"    Sample {i+1}: {weighted_result.score:.2f} - {weighted_result.reason}")

Analysis results:
  Accuracy Analysis: name='Accuracy Analysis' metadata={'explanation': 'Correctly predicted 3 out of 3 samples (100.00% accuracy)'} accuracy=1.0
  Weighted Average Analysis: name='weighted_average' metadata={} weighted_results=[{'weighted_result': GraderScore(name='weighted_average', reason='Weighted average score calculated from 3 evaluators. accuracy_grader: 1.0 (weight: 0.5); relevance_grader: 0.16666666666666666 (weight: 0.3); helpfulness_grader: 1.0 (weight: 0.2)', metadata={'weights': {'accuracy_grader': 0.5, 'relevance_grader': 0.3, 'helpfulness_grader': 0.2}, 'component_scores': {'accuracy_grader': 1.0, 'relevance_grader': 0.16666666666666666, 'helpfulness_grader': 1.0}}, score=0.75)}, {'weighted_result': GraderScore(name='weighted_average', reason='Weighted average score calculated from 3 evaluators. accuracy_grader: 0.0 (weight: 0.5); relevance_grader: 0.08333333333333333 (weight: 0.3); helpfulness_grader: 0.0 (weight: 0.2)', metadata={'weights': {'accuracy_

## 5. Custom LLMGrader Explanation

The custom LLMGrader we implemented shows how to build a fully custom grader that uses LLMs for evaluation:

1. **Model Integration**: We used OpenAIChatModel to connect to the OpenAI API
2. **Template-based Prompting**: We created a ChatTemplate with specific instructions for the LLM
3. **Structured Output**: We instructed the LLM to respond in a specific JSON format for easy parsing
4. **Error Handling**: We included error handling to return a default score if the LLM call fails

Key features of our custom implementation:

```python
# The grader uses a specific template to ensure consistent evaluation
self.template = ChatTemplate([
    ChatMessage(role="system", content="Evaluation instructions..."),
    ChatMessage(role="user", content="Question: {{query}}\nAnswer: {{answer}}")
])

# The LLM is instructed to return a specific JSON format
# {"score": 0.8, "reason": "Explanation of the score"}
```

This approach gives you full control over:
- The LLM model being used
- The prompt template and evaluation criteria
- How the LLM response is parsed and converted to a GraderResult
- Error handling and fallback behavior

## Conclusion

This notebook demonstrated the complete workflow of RM-Gallery:

1. **Data Input**: We prepared sample data with queries, answers, and expected values.
2. **Evaluation Execution**: We configured and ran multiple graders concurrently using GradingRunner.
3. **Result Organization**: Results were organized by sample, with each grader's output captured.
4. **Analysis**: We used analyzers to compute overall metrics and weighted averages.
5. **Custom LLMGrader**: We showed how to implement a fully custom LLM-based grader.

This modular approach allows for flexible evaluation pipelines where different graders can be combined and analyzed in various ways. The combination of rule-based graders (like AccuracyGrader) and LLM-based graders (like our custom HelpfulnessGrader) provides both precision and nuanced evaluation capabilities.