# Using Flow Judge with Langchain Evaluators

## Introduction to Flow Judge and LangChain Integration

Flow Judge is an open-source language model optimized for evaluating AI systems. This tutorial demonstrates how to integrate Flow Judge with LangChain. By the end of this notebook, you'll understand how to create custom metrics, run evaluations, and analyze results using both Flow Judge and LangChain tools.  

A key component of this integration is the custom `FlowJudgeLangChainEvaluator` class we created. This class extends LangChain's `StringEvaluator`, allowing Flow Judge to be seamlessly integrated into LangChain workflows. By implementing this custom evaluator, we can use Flow Judge metrics in the same way as LangChain's built-in evaluators, making it easy to incorporate Flow Judge's capabilities into existing LangChain-based evaluation pipelines.

With the `FlowJudgeLangChainEvaluator`, we can easily integrate Flow Judge into LangChain workflow, combining the flexibility and power of Flow Judge's custom metrics with the convenience and standardization of LangChain's framework.


## `Flow-Judge-v0.1`

`Flow-Judge-v0.1` is an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.

Read the technical report [here](https://www.flow-ai.com/blog/flow-judge).


## LangChain evaluators

LangChain is a powerful framework for developing applications using large language models.

Refer to the [LangChain evaluation module API reference](https://python.langchain.com/v0.2/api_reference/langchain/evaluation.html#) for more detailed information about their evaluation module.
 
LangChain's evaluation module offers built-in evaluators for evaluating the outputs of chains and LLMs.

In this notebook, we will demonstrate how to utilize `Flow-Judge-v0.1` custom metrics together with LangChain's framework.

## System Requirements

Flow Judge requires a GPU with at least 2.3GB of VRAM. If you're using a non-Ampere GPU, please use the `Flow-Judge-v0.1_HF_no_flsh_attn` model instead of the default one.

## Install dependencies

In [1]:
#!pip install langchain langchain_openai

# OpenAI API key

You need to provide an OpenAI API key to use the Langchain evaluators with gpt-4. 


In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-.."


## Model

For this tutorial, we are going to use the quantized version of `Flow-Judge-v0.1`. Under the hood, `flow-judge` uses the vLLM engine to run the model.


In [2]:
from flow_judge.models.model_factory import ModelFactory

model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")

INFO 10-07 12:20:30 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-07 12:20:30 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_mod

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 10-07 12:20:32 model_runner.py:926] Loading model weights took 2.1861 GB
INFO 10-07 12:20:33 gpu_executor.py:122] # GPU blocks: 2442, # CPU blocks: 682


## Criteria Based Evaluation 

### Comparing LangChain and Flow-Judge Evaluators

In this example, we'll explore two different approaches to evaluating the helpfulness of AI-generated responses:

1. **LangChain's Criteria-Based Evaluator**: 
   LangChain provides a built-in evaluator specifically designed to assess responses based on specific criteria. In this example, we'll use the "helpfulness" criteria.

2. **Flow-Judge Custom Metric**:
   As an alternative, we'll demonstrate how to use Flow-Judge to create a custom metric for evaluating helpfulness. Our custom metric will use a binary scale, offering a straightforward yet effective way to gauge response quality.

By comparing these two methods, we'll gain insights into the flexibility and capabilities of both LangChain and Flow-Judge for assessing LLM outputs.

Let's dive in and see how these evaluators perform in practice!

In [3]:
# Example query and responses for evaluating helpfulness with LangChain

query = "I'm having trouble logging into my account. What should I do?"

response = '''I'm sorry to hear you're having trouble logging in. Here are some steps you can try:

1. Double-check that you're using the correct email address and password.
2. If you've forgotten your password, click on the 'Forgot Password' link on the login page to reset it.
3. Clear your browser's cache and cookies, then try logging in again.
4. Make sure your internet connection is stable.
5. If you're still having issues, please provide me with the error message you're seeing, or any specific problems you encounter during the login process.

If none of these steps work, I'd be happy to escalate this to our technical support team. They can take a closer look at your account and help resolve any underlying issues.'''

In [4]:
from IPython.display import Markdown, display

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("criteria", criteria="helpfulness")

# This evaluation will use gpt-4 to evaluate the response. 
eval_result = evaluator.evaluate_strings(
    prediction=response,
    input=query
)

display(Markdown(f"**Score:** {eval_result["score"]}"))
display(Markdown(f"**Reasoning:** {eval_result["reasoning"]}"))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


**Score:** 1

**Reasoning:** The criterion for this task is helpfulness. The submission should be helpful, insightful, and appropriate.

Looking at the submission, it provides a detailed step-by-step guide on what the user can do if they're having trouble logging into their account. It starts by suggesting the user to double-check their login credentials, which is a common issue when users have trouble logging in. This is helpful and appropriate.

Next, it suggests the user to use the 'Forgot Password' feature if they've forgotten their password. This is also helpful and appropriate, as it's a common feature on most login pages.

The submission then suggests clearing the browser's cache and cookies, and ensuring a stable internet connection. These are insightful suggestions, as they might not be the first things a user thinks of when they're having trouble logging in.

Finally, the submission offers to escalate the issue to the technical support team if none of the suggested steps work. This is helpful, as it assures the user that further assistance is available if needed.

Based on this analysis, the submission is helpful, insightful, and appropriate. It provides a comprehensive guide on what to do when having trouble logging in, and offers further assistance if needed.

Y

This evaluation example of the off-the-shelf LangChain evaluator for helpfulness uses gpt-4 to rate the response. It provides a score of 1 which is correct in this case. The reasoning is also provided. While gpt-4 is a powerful evaluator due to cost and privacy concerns it's not always feasible to use it for evaluations. 

Now let's see how we can use the `FlowJudgeLangChainEvaluator` to achieve the same result by using the flow-judge model to rate the example.  

We will first create a custom metric for helpfulness. For this example we will use a binary scale to rate the response as helpful or not. Custom metrics can be tailored to evaluate responses based on specific criteria and scoring scales, which makes them a powerful tool for creating use case specific evaluation pipelines. Please refer to the [custom metrics tutorial](2_custom_evaluation_criteria.ipynb) for more examples on how to create custom metrics. 


In [5]:
from flow_judge.metrics import CustomMetric, RubricItem

# Define a binary helpfulness metric
helpfulness_metric = CustomMetric(
    name="binary_helpfulness",
    criteria="Evaluate if the response is helpful in addressing the user's query",
    rubric=[
        RubricItem(score=0, description="The response is not helpful in addressing the user's query."),
        RubricItem(score=1, description="The response is helpful in addressing the user's query.")
    ],
    required_inputs=["input"], # We will use variable input as it's the standard format for LangChain evaluators. 
    required_output="prediction" # We will use variable prediction as it's the standard format for LangChain evaluators. It's also a required fiel
)


> **Note:** Langchain evaluators typically use the following input variables in their evaluations:  
> - `prediction`: The LLM's response. This is always required
> - `input`: The user's query. This is optional.
> - `reference`: The reference answer to the query. Some of the evaluators also map context to the reference variable.This is optional. 
>
>Flow Judge metrics have required inputs and outputs. In order to keep the approach consistent with Langchain the output/response should always be assigned to the `prediction` variable. FlowJudgeLangChainEvaluator will map the prediction correctly to the required output of the metric. The inputs are optional and should be mapped to the required inputs of the metric. 


In [6]:
from flow_judge.integrations.langchain import FlowJudgeLangChainEvaluator

helpfulness_evaluator = FlowJudgeLangChainEvaluator(metric=helpfulness_metric, model=model)

eval_result = helpfulness_evaluator.evaluate_strings(
    prediction=response,
    input=query
)

display(Markdown(f"**Score:** {eval_result["score"]}"))
display(Markdown(f"**Reasoning:** {eval_result["reasoning"]}"))

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it, est. speed input: 413.91 toks/s, output: 69.61 toks/s]


**Score:** 1

**Reasoning:** The response provided is highly helpful in addressing the user's query about trouble logging into their account. It offers a clear, step-by-step approach to troubleshoot the issue, covering common problems such as incorrect login credentials, forgotten passwords, and technical issues like browser cache and internet connection. Additionally, it provides an option to escalate the issue to technical support if the suggested steps do not resolve the problem. This comprehensive and user-friendly approach effectively addresses the user's concern and provides actionable solutions.

We have now demonstrated how to use the `FlowJudgeLangChainEvaluator` to evaluate the same example response using a custom metric.  

Next let's take a look at a different example for evaluation the correctness of QA examples based on context. 

## QA Evaluations

In this example, we compare two approaches for evaluating question-answering (QA) responses:

1. **LangChain's Context QA Evaluator**
2. **Flow-Judge Custom QA Metric**

### LangChain QA Evaluation

LangChain's built-in "context_qa" evaluator provides a binary assessment:

- Score: CORRECT/INCORRECT
- Reasoning: Brief explanation of the judgment

### Flow-Judge QA Evaluation

Our custom Flow-Judge metric offers a more nuanced evaluation:

- Score: 1-3 scale
   - 1: Incorrect or missing most key points
   - 2: Partially correct with some missing information or minor inaccuracies
   - 3: Fully correct and complete
- Reasoning: Detailed explanation of the score and how well the response aligns with the reference answer

### Comparison

Both evaluators assessed the correctness of the response in the context of the given query and reference answer. While LangChain provides a straightforward CORRECT/INCORRECT judgment, Flow-Judge offers a more granular assessment with its 1-3 scale.

Key differences:
1. **Scoring granularity**: Flow-Judge's 3-point scale allows for more nuanced feedback compared to LangChain's binary output.
2. **Reasoning detail**: Flow-Judge typically provides more comprehensive explanations, which can be valuable for understanding subtle quality differences between responses.
3. **Customization**: The Flow-Judge metric can be easily adjusted to focus on specific aspects of QA performance, offering flexibility for various use cases.

This comparison demonstrates how Flow-Judge can provide more detailed insights into response quality, which can be particularly useful for fine-tuning QA systems or conducting in-depth analyses of model outputs.

In [7]:
# Example: Customer Service QA with Context

# User Query
query = "What can I do if I'm not happy with my new TechGadget Pro?"

# Context
context = """
TechGadget Pro smartphone features:
- 6.5-inch OLED display, 5G capable
- 128GB or 256GB storage
- Triple camera system
- Water-resistant (IP68)
- Price starts at $799

Return policy: 30-day full refund for undamaged products.
For software issues: Try restarting and updating before contacting support.
"""

# Customer Service Response
response = """
I understand you're not satisfied with your TechGadget Pro. Here are some options for you:
1. You can return the device within 14 days for a full refund, as long as it's undamaged.
2. If you're experiencing software issues, try restarting the device and check for any available updates.
3. For hardware problems, please visit one of our authorized service centers.
4. We offer a trade-in program if you'd prefer to upgrade to a different model.
Could you tell me more about what specific issues you're facing with the TechGadget Pro?
"""



### LangChain's Context QA Evaluator

In [8]:
qa_evaluator = load_evaluator("context_qa")

eval_result = qa_evaluator.evaluate_strings(
    prediction=response,
    input=query, 
    reference=context # QA evalchain maps the reference as context
)

display(Markdown(f"**Score:** {eval_result["score"]}"))
display(Markdown(f"**Reasoning:** {eval_result["reasoning"]}"))


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


**Score:** 0

**Reasoning:** INCORRECT

### Flow-Judge Custom QA Evaluator

In [9]:
correctness_metric = CustomMetric(
    name="context_correctness",
    criteria="Evaluate the correctness of the response based on the given context",
    rubric=[
        RubricItem(score=1, description="The response is mostly incorrect or contradicts the information in the context."),
        RubricItem(score=2, description="The response is partially correct but misses some key information from the context or contains minor inaccuracies."),
        RubricItem(score=3, description="The response is fully correct and accurately reflects the information provided in the context.")
    ],
    required_inputs=["input", "context"],
    required_output="prediction"
)


In [10]:
flow_judge_correctness_evaluator = FlowJudgeLangChainEvaluator(model=model, metric=correctness_metric)

# Evaluate using Flow-Judge evaluator
correctness_result = flow_judge_correctness_evaluator.evaluate_strings(
    input=query,
    context=context,
    prediction=response
)

display(Markdown(f"**Score:** {correctness_result["score"]}"))
display(Markdown(f"**Reasoning:** {correctness_result["reasoning"]}"))

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.19s/it, est. speed input: 276.10 toks/s, output: 71.45 toks/s]


**Score:** 2

**Reasoning:** The response provided by the AI system is mostly correct but contains a significant inaccuracy that affects its overall score. 

1. The return policy is correctly stated as a 30-day full refund for undamaged products.
2. The suggestion to restart the device and check for updates is appropriate for software issues.
3. The advice to visit an authorized service center for hardware problems is also correct.
4. The mention of a trade-in program is not mentioned in the given context and is an additional option that was not provided.

The main issue is the incorrect return period stated in the response. The context specifies a 30-day return policy, but the response incorrectly states a 14-day period. This is a critical error as it directly contradicts the information provided in the context.

Given these points, the response is partially correct but misses some key information from the context and contains a significant inaccuracy.

As we can see the flow-judge evaluator provides a more detailed score and reasoning for the correctness of the response. 

## Summary 

This notebook demonstrated how to integrate Flow Judge with LangChain, combining the strengths of both frameworks. We showed how to create custom metrics using Flow Judge and seamlessly incorporate them into LangChain workflows. By comparing Flow Judge's approach with LangChain's built-in evaluators, we highlighted the benefits of using Flow Judge for more detailed and customizable insights into response quality.

A key component of this integration is the custom `FlowJudgeLangChainEvaluator` class we created. This class allows Flow Judge to be easily integrated into existing LangChain pipelines, enabling users to leverage Flow Judge's powerful evaluation capabilities within the familiar LangChain ecosystem.

By using Flow Judge within LangChain, developers can:
1. Create highly customized evaluation metrics tailored to specific use cases
2. Obtain more granular and detailed feedback on model outputs
3. Seamlessly incorporate Flow Judge's evaluation capabilities into existing LangChain-based projects

This integration demonstrates how Flow Judge can enhance LangChain's functionality, providing users with more flexible and powerful tools for evaluating AI-generated responses.