# Using Flow Judge with Langchain

## Introduction to Flow Judge and LangChain Integration

Flow Judge is an open-source language model optimized for evaluating AI systems. This tutorial demonstrates how to integrate Flow Judge with LangChain. By the end of this notebook, you'll understand how to create custom metrics, run evaluations, and analyze results using both Flow Judge and LangChain tools.  

A key component of this integration is the custom `FlowJudgeLangChainEvaluator` class we created. This class extends LangChain's `StringEvaluator`, allowing Flow Judge to be seamlessly integrated into LangChain workflows. By implementing this custom evaluator, we can use Flow Judge metrics in the same way as LangChain's built-in evaluators, making it easy to incorporate Flow Judge's capabilities into existing LangChain workflows.

## `Flow-Judge-v0.1`

`Flow-Judge-v0.1` is an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.

Read the technical report [here](https://www.flow-ai.com/blog/flow-judge).


## LangChain evaluators

LangChain is a powerful framework for developing applications using large language models.

Refer to the [LangChain evaluation module API reference](https://python.langchain.com/v0.2/api_reference/langchain/evaluation.html#) for more detailed information about their evaluation module.
 
LangChain's evaluation module offers built-in evaluators for evaluating the outputs of chains and LLMs. In this notebook, we will demonstrate how to utilize `Flow-Judge-v0.1` custom metrics together with LangChain's framework.


## Install dependencies

In [3]:
try:
    from langchain import LLMChain
except ImportError as e:
    print("langchain is not installed. ")
    print("Please run `pip install langchain` to install it.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit(f"Stopping execution due to missing langchain dependency: {e}")

try:
    from langchain_openai import ChatOpenAI
except ImportError as e:
    print("langchain_openai is not installed. ")
    print("Please run `pip install langchain_openai` to install it.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit(f"Stopping execution due to missing langchain_openai dependency: {e}")

# OpenAI API key

You need to provide an OpenAI API key to use the Langchain evaluators with gpt-4. 


In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."


## Model

For this tutorial, we are going to use the default VLLM version of `Flow-Judge-v0.1`.


In [1]:
from flow_judge import Vllm

model = Vllm()



INFO 10-09 14:24:39 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-09 14:24:39 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_mod

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 10-09 14:24:40 model_runner.py:926] Loading model weights took 2.1861 GB
INFO 10-09 14:24:42 gpu_executor.py:122] # GPU blocks: 2442, # CPU blocks: 682




We will first create a custom metric for helpfulness. For this example we will use a binary scale to rate the response as helpful or not. Custom metrics can be tailored to evaluate responses based on specific criteria and scoring scales, which makes them a powerful tool for creating use case specific evaluation pipelines. 


## QA Evaluations

In this example, we compare two approaches for evaluating question-answering (QA) responses:

1. LangChain's Context QA Evaluator
2. Flow-Judge Custom QA Metric

### LangChain QA Evaluation

LangChain's built-in "context_qa" evaluator provides a binary assessment:

- Score: 0/1
- Reasoning: CORRECT/INCORRECT

### Flow-Judge QA Evaluation

With Flow-Judge we can create custom metrics that offers a more nuanced evaluation. For this example we will create correctness evaluator that will judge the responses on a 1-3 scale. Please refer to the [custom metrics tutorial](2_custom_evaluation_criteria.ipynb) for more examples on how to create custom metrics. 

- Score: 1-3 scale
   - 1: Incorrect or missing most key points
   - 2: Partially correct with some missing information or minor inaccuracies
   - 3: Fully correct and complete
- Reasoning: Detailed explanation of the score and how well the response aligns with the context

In [11]:
# Example: Customer Service QA with Context

# User Query
query = "What can I do if I'm not happy with my new TechGadget Pro?"

# Context
context = """
TechGadget Pro smartphone features:
- 6.5-inch OLED display, 5G capable
- 128GB or 256GB storage
- Triple camera system
- Water-resistant (IP68)
- Price starts at $799

Return policy: 30-day full refund for undamaged products.
For software issues: Try restarting and updating before contacting support.
"""

# Customer Service Response
response = """
I understand you're not satisfied with your TechGadget Pro. Here are some options for you:
1. You can return the device within 14 days for a full refund, as long as it's undamaged.
2. If you're experiencing software issues, try restarting the device and check for any available updates.
3. For hardware problems, please visit one of our authorized service centers.
4. We offer a trade-in program if you'd prefer to upgrade to a different model.
Could you tell me more about what specific issues you're facing with the TechGadget Pro?
"""



### LangChain's Context QA Evaluator

In [9]:
from IPython.display import Markdown, display
from langchain.evaluation import load_evaluator 

# Load the langchain evaluator for context qa
qa_evaluator = load_evaluator("context_qa")

# Evaluate the response
eval_result = qa_evaluator.evaluate_strings(
    prediction=response,
    input=query, 
    reference=context # QA evalchain maps the reference as context
)

display(Markdown(f"**Score:** {eval_result["score"]}"))
display(Markdown(f"**Reasoning:** {eval_result["reasoning"]}"))


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


**Score:** 0

**Reasoning:** INCORRECT

This evaluation example of the off-the-shelf LangChain evaluator for QA uses gpt-4 to rate the response. It provides a score of 0 and an reasoning that the response is incorrect. While gpt-4 is a powerful evaluator due to cost and privacy concerns it's not always feasible to use it for evaluations. 

Now let's see how we can use the `FlowJudgeLangChainEvaluator` to achieve the same result by using the flow-judge model to rate the example.  

### Flow-Judge Custom QA Evaluator

In [6]:
from flow_judge import CustomMetric, RubricItem

correctness_metric = CustomMetric(
    name="context_correctness",
    criteria="Evaluate the correctness of the response based on the given context",
    rubric=[
        RubricItem(score=1, description="The response is mostly incorrect or contradicts the information in the context."),
        RubricItem(score=2, description="The response is partially correct but misses some key information from the context or contains minor inaccuracies."),
        RubricItem(score=3, description="The response is fully correct and accurately reflects the information provided in the context.")
    ],
    required_inputs=["query", "context"],
    required_output="response" # see note below for output 
)


>**Note:** Langchain evaluators typically use the following input variables:
> - `prediction`: The LLM's response (always required)
> - `input`: The user's query (optional)
> - `reference`: The reference answer or context (optional)
>
>Flow Judge Metric Requirements
>Flow Judge metrics have specific required inputs and outputs.
>
>To maintain consistency when using Langchain evaluators with Flow Judge metrics:
>
>1. Always assign the output/response to the `prediction` variable.
>2. The FlowJudgeLangChainEvaluator will automatically map `prediction` to the required output of the metric.
>3. Map other inputs as follows:
   >- If the Flow Judge metric requires an input corresponding to the user's query, map it to `input`.
   >- If the metric requires a reference or context input, map it to `reference`.



In [12]:
from flow_judge.integrations.langchain import FlowJudgeLangChainEvaluator 

# Initialize the FlowJudgeLangChainEvaluator with the model and metric
flow_judge_correctness_evaluator = FlowJudgeLangChainEvaluator(model=model, metric=correctness_metric)

# Evaluate using Flow-Judge evaluator
correctness_result = flow_judge_correctness_evaluator.evaluate_strings(
    query=query,
    context=context,
    prediction=response
)

display(Markdown(f"**Score:** {correctness_result["score"]}"))
display(Markdown(f"**Reasoning:** {correctness_result["reasoning"]}"))

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.36s/it, est. speed input: 261.49 toks/s, output: 71.69 toks/s]


**Score:** 2

**Reasoning:** The response provided by the AI system is mostly correct but contains a significant inaccuracy that affects its overall quality. 

1. The return policy information is incorrect. The context clearly states that the return policy is a 30-day full refund for undamaged products, not 14 days as mentioned in the response. This is a major error as it provides incorrect information to the user.

2. The advice for software issues is correct and aligns with the context.

3. The suggestion to visit an authorized service center for hardware problems is not mentioned in the context and seems like an additional service that might not be available.

4. The trade-in program is not mentioned in the context and appears to be an unsolicited suggestion.

5. The request for more details about specific issues is appropriate and helpful.

Overall, while the response contains some correct information and helpful suggestions, the significant error in the return policy information and the inclusion of unmentioned services make it only partially correct.

### Comparison

Both evaluators assessed the correctness of the response in the context of the given query and reference answer. While LangChain provides a straightforward 0/1 judgment, Flow-Judge offers a more granular assessment with its 1-3 scale.

Key differences:
1. **Scoring granularity**: Flow-Judge's 3-point scale allows for more nuanced feedback compared to LangChain's binary output. This is fully customizable so you can choose the scoring granularity that best fits your use case.
2. **Reasoning detail**: Flow-Judge provides comprehensive explanations, which can be valuable for understanding subtle quality differences between responses.
3. **Customization**: The Flow-Judge metric can be easily adjusted to focus on specific aspects of QA performance, offering flexibility for various use cases.

This comparison demonstrates how Flow-Judge can provide more detailed insights into response quality, which can be particularly useful for fine-tuning QA systems or conducting in-depth analyses of model outputs.

## Summary

In this notebook, we explored how Flow Judge can work alongside LangChain for evaluating LLM responses. Here are the key takeaways:

1. Custom metrics: We created tailored evaluation criteria using Flow Judge.
2. Integration: The `FlowJudgeLangChainEvaluator` class lets us use Flow Judge within LangChain workflows.
3. Comparison: We saw how Flow Judge's approach offers more detailed insights compared to LangChain's built-in evaluators.

Benefits of using Flow Judge with LangChain:
- More customizable evaluation metrics
- Granular feedback on model outputs
- Easy integration with existing LangChain projects

Overall, this combo gives you flexibility and power when assessing LLM-generated responses.