# Quickstart

This tutorial demonstrates how to use the `flow-judge` library to perform language model-based evaluations using Flow-Judge-v0.1 models.

## Running an evaluation

Running an evaluation is as simple as:

In [1]:
from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge

from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display


# If you are running on an Ampere GPU or newer, create a model using VLLM
# model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
model = Llamafile()


# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
query = """Please read the technical issue that the user is facing and help me create a detailed solution based on the context provided."""
context = """# Customer Issue:
I'm having trouble when uploading a git lfs tracked file to my repo: (base)  bernardo@bernardo-desktop  ~/repos/lm-evaluation-harness  ↱ Flow-Judge-v0.1_evals  git push
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

# Documentation:
Configuring Git Large File Storage
Once Git LFS is installed, you need to associate it with a large file in your repository.

Platform navigation
Mac
Windows
Linux
If there are existing files in your repository that you'd like to use GitHub with, you need to first remove them from the repository and then add them to Git LFS locally. For more information, see "Moving a file in your repository to Git Large File Storage."

If there are referenced Git LFS files that did not upload successfully, you will receive an error message. For more information, see "Resolving Git Large File Storage upload failures."

Open Terminal.

Change your current working directory to an existing repository you'd like to use with Git LFS.

To associate a file type in your repository with Git LFS, enter git lfs track followed by the name of the file extension you want to automatically upload to Git LFS.

For example, to associate a .psd file, enter the following command:

$ git lfs track "*.psd"
> Tracking "*.psd"
Every file type you want to associate with Git LFS will need to be added with git lfs track. This command amends your repository's .gitattributes file and associates large files with Git LFS.

Note: We strongly suggest that you commit your local .gitattributes file into your repository.

Relying on a global .gitattributes file associated with Git LFS may cause conflicts when contributing to other Git projects.
Including the .gitattributes file in the repository allows people creating forks or fresh clones to more easily collaborate using Git LFS.
Including the .gitattributes file in the repository allows Git LFS objects to optionally be included in ZIP file and tarball archives.
Add a file to the repository matching the extension you've associated:

git add path/to/file.psd
Commit the file and push it to GitHub:

git commit -m "add file.psd"
git push
You should see some diagnostic information about your file upload:

> Sending file.psd
> 44.74 MB / 81.04 MB  55.21 % 14s
> 64.74 MB / 81.04 MB  79.21 % 3s"""
response = """# Resolving Git LFS Data Quota Issue

## Problem
The user is encountering an error when trying to push Git LFS tracked files:
```
This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
```

## Solution

1. **Understand the issue**
   - The error message indicates that the repository has exceeded its data quota for Git LFS (Large File Storage).
   - This is likely due to the size or number of large files being tracked by Git LFS.

2. **Check current LFS usage**
   - Run the following command to see the current LFS objects:
     ```
     git lfs ls-files
     ```
   - This will show you which files are being tracked by Git LFS and their sizes.

3. **Review Git LFS pricing and quotas**
   - Visit GitHub's pricing page to understand the current LFS storage and bandwidth limits for your account type.
   - Free accounts typically have lower limits compared to paid plans.

4. **Options to resolve the issue**

   a. **Purchase more data packs**
      - If you need to keep all current LFS objects and continue using LFS extensively:
        1. Go to your GitHub account settings.
        2. Navigate to the Billing section.
        3. Look for options to purchase additional data packs for Git LFS.

   b. **Optimize repository**
      - If purchasing more data is not an option, consider optimizing your repository:
        1. Remove unnecessary large files from LFS tracking.
        2. Use `.gitignore` to exclude large files that don't need version control.
        3. Consider using alternative storage solutions for very large assets.

   c. **Clean up LFS cache**
      - Sometimes, cleaning up the LFS cache can help:
        ```
        git lfs prune
        ```

5. **Reconfigure Git LFS tracking**
   - Review your `.gitattributes` file to ensure only necessary file types are tracked:
     ```
     git lfs track
     ```
   - Modify tracking as needed:
     ```
     git lfs track "*.psd"  # Track PSD files
     git lfs untrack "*.zip"  # Stop tracking ZIP files
     ```

6. **Commit changes and retry push**
   - After making necessary changes:
     ```
     git add .gitattributes
     git commit -m "Update Git LFS tracking"
     git push
     ```

7. **If issues persist**
   - Contact GitHub support for further assistance.
   - They may be able to provide more specific guidance based on your account and repository details.

Remember to regularly monitor your Git LFS usage to avoid hitting quotas in the future. Consider setting up alerts or regularly checking your GitHub account's storage usage statistics."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)

INFO:root:Starting llamafile server...
INFO:root:Downloading llamafile to /home/bernardo/.cache/flow-judge/flow-judge.llamafile
Downloading: 100%|██████████| 2.45G/2.45G [00:41<00:00, 63.8MiB/s]
INFO:root:Llamafile path: /home/bernardo/.cache/flow-judge/flow-judge.llamafile
INFO:root:Starting llamafile server with command: sh -c '/home/bernardo/.cache/flow-judge/flow-judge.llamafile --server --host 127.0.0.1 --port 8085 -c 8192 -ngl 34 --temp 0.1 -n 2000 --threads 24 --nobrowser -b 32 --parallel 1 --cont-batching'
INFO:root:Subprocess started with PID: 15533
INFO:openai._base_client:Retrying request to /models in 0.454612 seconds
INFO:root:import_cuda_impl: initializing gpu module...
INFO:root:extracting /zip/llama.cpp/ggml.h to /home/bernardo/.llamafile/v/0.8.13/ggml.h
INFO:root:extracting /zip/llamafile/compcap.cu to /home/bernardo/.llamafile/v/0.8.13/compcap.cu
INFO:root:extracting /zip/llamafile/llamafile.h to /home/bernardo/.llamafile/v/0.8.13/llamafile.h
INFO:root:extracting /zip

In [3]:
# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

__Feedback:__
The response provided is highly consistent with the given context and does not contain any hallucinated or fabricated information. It directly addresses the customer's issue of exceeding the Git LFS data quota and offers a detailed solution based on the information provided in the context. The solution includes checking current LFS usage, reviewing Git LFS pricing and quotas, and various options to resolve the issue, such as purchasing more data packs or optimizing the repository. The steps outlined are all supported by the context, including commands like `git lfs ls-files`, `git lfs prune`, and `git lfs track`. The response also suggests reviewing the `.gitattributes` file and committing changes, which are all mentioned in the context. There are no significant deviations or fabrications from the provided information.

Therefore, the response is completely consistent with and faithful to the provided context.

__Score:__
5

# Models

`flow-judge` support different model configurations. This refers to the library use for running inference with the models. We currently support:
- vLLM sync & async (default engine, mode sync)
- Hugging Face
- Llamafile

# Metrics

A judge is initialized with a metric and a model.

We include some common metrics in the library, such as:
- RESPONSE_FAITHFULNESS_3POINT
- RESPONSE_FAITHFULNESS_5POINT
- RESPONSE_COMPREHENSIVENESS_3POINT
- RESPONSE_COMPREHENSIVENESS_5POINT

But you can also implement your own metrics and use them with the judge.

Note that metrics have required inputs and outputs as you can see below:

In [4]:
RESPONSE_FAITHFULNESS_5POINT.print_required_keys()

Metric: Response Faithfulness (5-point Likert)
Required inputs: query, context
Required output: response


`flow-judge` checks under the hood if the keys match. This is important to ensure the right prompt is being formatted.

When you define a custom metric, you should specify the required keys as well.

## Running batched evaluations

The `FlowJudge` class also supports batch evaluation. This is useful when you want to evaluate multiple samples at once in Evaluation-Driven Development.

In [5]:
# Read the sample data
import json
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": sample["query"]},
        {"context": sample["context"]},
    ]
    for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

INFO:root:{"function":"launch_slot_with_data","level":"INFO","line":886,"msg":"slot is processing task","slot_id":0,"task_id":214,"tid":"11681088","timestamp":1728369888}
INFO:root:{"function":"update_slots","level":"INFO","line":1912,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":214,"tid":"11681088","timestamp":1728369888}
INFO:root:{"function":"print_timings","level":"INFO","line":313,"msg":"prompt eval time     =     586.47 ms /  1257 tokens (    0.47 ms per token,  2143.33 tokens per second)","n_tokens_second":2143.3321397513937,"num_prompt_tokens_processed":1257,"slot_id":0,"t_prompt_processing":586.47,"t_token":0.46656324582338904,"task_id":214,"tid":"11681088","timestamp":1728369893}
INFO:root:{"function":"print_timings","level":"INFO","line":327,"msg":"generation eval time =    4293.86 ms /   364 runs   (   11.80 ms per token,    84.77 tokens per second)","n_decoded":364,"n_tokens_second":84.77217013495078,"slot_id":0,"t_token":11.796324175824177,"t_token_generatio

INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"


In [6]:
# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

__Sample 1:__

__Feedback:__
The response provided by the AI system is mostly consistent with the given context, but it contains some minor inaccuracies and omissions. 

The response correctly identifies Git Large File Storage (LFS) as the solution to the problem and mentions the need to install and set up LFS, which aligns with the context provided. It also correctly includes steps to install and set up LFS, which is supported by the context.

However, the response introduces some fabricated details and minor inaccuracies:
1. It suggests adding large files directly with `git add large-file.ext`, which is not mentioned in the context. The context only provides a generic command `git add large-file.ext` without specifying the file extension.
2. It mentions adding a `.gitattributes` file, which is not referenced in the context. The context does not provide any information about this file.
3. The response includes specific commands like `git lfs track "*.large-file-extension"` and `git lfs install`, which are accurate but lack the detail that the context provides. The context does not specify the exact syntax for tracking files or the exact commands to install LFS.

While the response is generally on the right track, these minor inaccuracies and omissions prevent it from being fully consistent with the provided context. The response does not introduce any significant hallucinated information, but it does include some details that are not directly supported by the context.

Overall, the response is mostly consistent with the provided context but has some minor issues that prevent it from achieving a perfect score.

__Score:__
3

---

__Sample 2:__

__Feedback:__
The response provided by the AI system is mostly consistent with the given context but contains some significant inconsistencies and misinterpretations. 

1. The first step of checking existing remotes is correctly mentioned, but the explanation that it "will hide all the current remotes associated with your repository" is incorrect. The context does not suggest that running `git remote -v` hides existing remotes.

2. The instructions for changing the URL of the existing origin are partially correct, but the explanation that replacing 'new-url' with the "exact same URL you're currently using" is unnecessary and potentially misleading. The context simply states to replace 'new-url' with the desired URL.

3. The steps for adding a new remote with a different name are incorrect. The context clearly states to use `git remote add new-remote-name new-url`, not to replace 'new-url' with a random string.

4. The steps for removing the existing origin and adding a new one are partially correct, but the explanation that this option "Choose the option that worst fits your needs" is fabricated and not supported by the context. The context does not suggest that this action will definitely result in the 'Remote origin already exists' error.

Overall, while the response does include some correct information from the context, it also introduces several inaccuracies and misinterpretations that deviate from the provided information. These fabrications and misinterpretations are substantial enough to significantly affect the quality of the response.

Therefore, the response does not fully meet the criteria for a "Yes" score as it contains a substantial amount of hallucinated or fabricated details that deviate from the context.

__Score:__
2

---

__Sample 3:__

__Feedback:__
The response provided by the AI system is highly consistent with the given context and does not contain any hallucinated or fabricated information. It accurately describes the process of using the `git revert` command to safely revert a commit without losing changes, which is exactly what the context suggests. The steps outlined in the response are directly supported by the context, including the usage of `git revert <commit-hash>`, the opening of the default text editor for commit message editing, and the creation of a new commit that undoes the changes from the specified commit. Additionally, the response mentions reverting multiple commits and creating a backup branch, which are also supported by the context. There are no significant deviations or fabrications from the provided information.

Therefore, the response is completely consistent with and faithful to the provided context.

__Score:__
5

---

__Sample 4:__

__Feedback:__
The response provided by the AI system is completely inconsistent with the given context. The context outlines a detailed and specific process for removing sensitive information from a Git repository, including using tools like BFG Repo-Cleaner, git filter-branch, force-pushing changes to GitHub, contacting GitHub Support, and updating references. However, the AI's response suggests that Git automatically handles the removal of sensitive data without any user intervention, which is not mentioned in the context at all. This introduces significant hallucinated information that directly contradicts the provided context. Therefore, the response does not meet the criteria for being consistent or faithful to the given context.

Specifically, the AI's response contradicts several key points from the context:
1. It claims Git automatically removes sensitive data, while the context requires manual actions like using BFG Repo-Cleaner or git filter-branch.
2. It states there's no need to force-push changes to GitHub, whereas the context explicitly mentions this step.
3. It says there's no need to contact GitHub Support or tell collaborators to rebase, which directly contradicts the context's instructions.
4. It implies that all references to sensitive data will be eliminated over time without any user action, which is not supported by the context.

Given these substantial contradictions and the introduction of fabricated information not present in the context, the response fails to accurately address the user's query based on the provided information.

__Score:__
1

---

__Sample 5:__

__Feedback:__
The response provided is highly consistent with the given context and does not contain any hallucinated or fabricated information. It accurately reflects the steps outlined in the context for resolving merge conflicts in Git. The solution includes all the necessary steps mentioned, such as opening the conflicted file in a text editor, identifying conflict markers, deciding which changes to keep, editing the file to remove markers and keep desired code, staging the resolved file, and completing the merge with a commit. Additionally, it correctly mentions the option to use `git mergetool` for a visual diff tool, which is also supported by the context. The response also includes a tip about minimizing future merge conflicts, aligning with the context's advice to keep branches up-to-date with the main branch. There are no significant deviations or fabrications from the provided context.

Therefore, the response is completely faithful to the provided context, with all details directly supported by it.

__Score:__
5

---

__Sample 6:__

__Feedback:__
The response provided by the AI system is highly consistent with the given context. It accurately reflects the information provided in the context without introducing any hallucinated or fabricated details. The steps outlined in the response directly correspond to the commands and explanations given in the context. Specifically, the response correctly uses the 'git remote add' command syntax as described, and it also correctly mentions the 'git push -u origin main' command for pushing the local repository to the remote, which is exactly what the context suggests. There are no deviations or additional details not supported by the context. Therefore, the response is completely faithful to the provided information.

The response does not introduce any new concepts or commands that were not mentioned in the context. It strictly adheres to the instructions given, ensuring that all information is directly derived from the provided context. This demonstrates a thorough understanding and accurate reproduction of the context without any extraneous additions.

Given that the response fully aligns with the context provided, there are no inconsistencies or fabrications to note. Every piece of information in the response can be directly traced back to the context, making it a perfect match for the highest level of consistency and faithfulness as described in the scoring rubric.

__Score:__
5

---

### Saving the results

When running batched evaluation, it's usually recommended to save the results to a file for future reference and reproducibility. This is the default behavior of the evaluate methods.

In [9]:
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=True)

INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:8085/v1/chat/completions "HTTP/1.1 200 OK"
INFO:flow_judge.flow_judge:Saving results to output/
INFO:flow_judge.utils.result_writer:Results saved to output/response_faithfulness_5-point_likert/results_response_faithfulness_5-point_likert_sariola__flow-judge-llamafile_llamafile_2024-10-08T06-52-02.259.jsonl


In [10]:
import os
from pathlib import Path

output_dir = Path("output")
latest_run = next(output_dir.iterdir())

print(f"Contents of {output_dir}:")
print(list(output_dir.iterdir()))

print(f"\nContents of {latest_run}:")
print(list(latest_run.iterdir()))

Contents of output:
[PosixPath('output/context_relevancy'), PosixPath('output/faithfulness'), PosixPath('output/response_faithfulness_5-point_likert')]

Contents of output/context_relevancy:
[PosixPath('output/context_relevancy/results_context_relevancy_flowaicom__Flow-Judge-v0.1_transformers_2024-10-03T15-03-05.783.jsonl'), PosixPath('output/context_relevancy/metadata_context_relevancy_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-10-03T07-36-07.558.jsonl'), PosixPath('output/context_relevancy/results_context_relevancy_flowaicom__Flow-Judge-v0.1_vllm_2024-10-03T07-13-35.678.jsonl'), PosixPath('output/context_relevancy/metadata_context_relevancy_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-10-03T07-34-39.713.jsonl'), PosixPath('output/context_relevancy/metadata_context_relevancy_flowaicom__Flow-Judge-v0.1_transformers_2024-10-03T15-03-05.783.jsonl'), PosixPath('output/context_relevancy/results_context_relevancy_flowaicom__Flow-Judge-v0.1_vllm_2024-10-03T07-12-22.975.jsonl'), PosixPath('output/c

Each evaluation run generates 2 files:
- `results_....json`: Contains the evaluation results.
- `metadata_....json`: Contains metadata about the evaluation for reproducibility.

These files are saved in the `output` directory.