# Quickstart

This tutorial demonstrates how to use the `flow-eval` library to perform language model-based evaluations using `Flow-Judge-v0.1` or any other model supported by the library.


## Running an evaluation

Running an evaluation is as simple as:

In [1]:
from flow_eval.lm.models import Vllm, Llamafile, Hf, Baseten
from flow_eval.lm.metrics import RESPONSE_FAITHFULNESS_5POINT
from flow_eval.core import EvalInput
from flow_eval import LMEvaluator
from IPython.display import Markdown, display


# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Baseten if you don't want to run locally.
# As a pre-requisite step:
#  - Sign up to Baseten
#  - Generate an api key https://app.baseten.co/settings/api_keys
#  - Set the api key as an environment variable & initialize:
# import os
# os.environ["BASETEN_API_KEY"] = "your_api_key"
# model = Baseten()

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

# Initialize the judge
faithfulness_evaluator = LMEvaluator(
    eval=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
query = """Please read the technical issue that the user is facing and help me create a detailed solution based on the context provided."""
context = """# Customer Issue:
I'm having trouble when uploading a git lfs tracked file to my repo: (base)  bernardo@bernardo-desktop  ~/repos/lm-evaluation-harness  ↱ Flow-Judge-v0.1_evals  git push
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

# Documentation:
Configuring Git Large File Storage
Once Git LFS is installed, you need to associate it with a large file in your repository.

Platform navigation
Mac
Windows
Linux
If there are existing files in your repository that you'd like to use GitHub with, you need to first remove them from the repository and then add them to Git LFS locally. For more information, see "Moving a file in your repository to Git Large File Storage."

If there are referenced Git LFS files that did not upload successfully, you will receive an error message. For more information, see "Resolving Git Large File Storage upload failures."

Open Terminal.

Change your current working directory to an existing repository you'd like to use with Git LFS.

To associate a file type in your repository with Git LFS, enter git lfs track followed by the name of the file extension you want to automatically upload to Git LFS.

For example, to associate a .psd file, enter the following command:

$ git lfs track "*.psd"
> Tracking "*.psd"
Every file type you want to associate with Git LFS will need to be added with git lfs track. This command amends your repository's .gitattributes file and associates large files with Git LFS.

Note: We strongly suggest that you commit your local .gitattributes file into your repository.

Relying on a global .gitattributes file associated with Git LFS may cause conflicts when contributing to other Git projects.
Including the .gitattributes file in the repository allows people creating forks or fresh clones to more easily collaborate using Git LFS.
Including the .gitattributes file in the repository allows Git LFS objects to optionally be included in ZIP file and tarball archives.
Add a file to the repository matching the extension you've associated:

git add path/to/file.psd
Commit the file and push it to GitHub:

git commit -m "add file.psd"
git push
You should see some diagnostic information about your file upload:

> Sending file.psd
> 44.74 MB / 81.04 MB  55.21 % 14s
> 64.74 MB / 81.04 MB  79.21 % 3s"""
response = """# Resolving Git LFS Data Quota Issue

## Problem
The user is encountering an error when trying to push Git LFS tracked files:
```
This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
```

## Solution

1. **Understand the issue**
   - The error message indicates that the repository has exceeded its data quota for Git LFS (Large File Storage).
   - This is likely due to the size or number of large files being tracked by Git LFS.

2. **Check current LFS usage**
   - Run the following command to see the current LFS objects:
     ```
     git lfs ls-files
     ```
   - This will show you which files are being tracked by Git LFS and their sizes.

3. **Review Git LFS pricing and quotas**
   - Visit GitHub's pricing page to understand the current LFS storage and bandwidth limits for your account type.
   - Free accounts typically have lower limits compared to paid plans.

4. **Options to resolve the issue**

   a. **Purchase more data packs**
      - If you need to keep all current LFS objects and continue using LFS extensively:
        1. Go to your GitHub account settings.
        2. Navigate to the Billing section.
        3. Look for options to purchase additional data packs for Git LFS.

   b. **Optimize repository**
      - If purchasing more data is not an option, consider optimizing your repository:
        1. Remove unnecessary large files from LFS tracking.
        2. Use `.gitignore` to exclude large files that don't need version control.
        3. Consider using alternative storage solutions for very large assets.

   c. **Clean up LFS cache**
      - Sometimes, cleaning up the LFS cache can help:
        ```
        git lfs prune
        ```

5. **Reconfigure Git LFS tracking**
   - Review your `.gitattributes` file to ensure only necessary file types are tracked:
     ```
     git lfs track
     ```
   - Modify tracking as needed:
     ```
     git lfs track "*.psd"  # Track PSD files
     git lfs untrack "*.zip"  # Stop tracking ZIP files
     ```

6. **Commit changes and retry push**
   - After making necessary changes:
     ```
     git add .gitattributes
     git commit -m "Update Git LFS tracking"
     git push
     ```

7. **If issues persist**
   - Contact GitHub support for further assistance.
   - They may be able to provide more specific guidance based on your account and repository details.

Remember to regularly monitor your Git LFS usage to avoid hitting quotas in the future. Consider setting up alerts or regularly checking your GitHub account's storage usage statistics."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = faithfulness_evaluator.evaluate(eval_input, save_results=False)

INFO:datasets:PyTorch version 2.4.0 available.


INFO 01-23 19:28:27 awq_marlin.py:90] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 01-23 19:28:27 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 01-23 19:28:30 model_runner.py:1025] Loading model weights took 2.1717 GB
INFO 01-23 19:28:31 gpu_executor.py:122] # GPU blocks: 3084, # CPU blocks: 682


Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.14s/it, est. speed input: 299.97 toks/s, output: 44.11 toks/s]


In [2]:
# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

__Feedback:__
The response provided is mostly consistent with the context given. It addresses the main issue of the user encountering a data quota error when pushing Git LFS tracked files. The solution outlines steps to understand the issue, check current LFS usage, review pricing and quotas, and options to resolve the issue, such as purchasing more data packs or optimizing the repository.

However, there are a few minor inconsistencies and additions that are not explicitly mentioned in the context:

1. The command `git lfs ls-files` to check current LFS usage is not mentioned in the context. While it is a logical step, it is not part of the provided documentation.

2. The suggestion to use `.gitignore` to exclude large files is not mentioned in the context. This is a reasonable suggestion but not part of the given information.

3. The command `git lfs prune` to clean up the LFS cache is not mentioned in the context. This is a useful suggestion but not part of the provided information.

Despite these minor additions, the vast majority of the content in the response is supported by the context. The response does not contain any significant hallucinated or fabricated information that contradicts the context.

Therefore, the response is mostly consistent with the provided context, with only minor and inconsequential inconsistencies.

__Score:__
4

# Models

`flow-judge` support different model configurations. This refers to the library use for running inference with the models. We currently support:
- vLLM sync & async (default engine, mode sync)
- Hugging Face
- Llamafile

# Metrics

A judge is initialized with a metric and a model.

We include some common metrics in the library, such as:
- RESPONSE_FAITHFULNESS_3POINT
- RESPONSE_FAITHFULNESS_5POINT
- RESPONSE_COMPREHENSIVENESS_3POINT
- RESPONSE_COMPREHENSIVENESS_5POINT

But you can also implement your own metrics and use them with the judge.

Note that metrics have required inputs and outputs as you can see below:

In [4]:
RESPONSE_FAITHFULNESS_5POINT.keys()

{'input_columns': ['query', 'context'],
 'output_column': 'response',
 'expected_output_column': None}

`flow-eval` checks under the hood if the keys match. This is important to ensure the right prompt is being formatted.

When you define a custom metric, you should specify the required keys as well.

## Running batched evaluations

The `LMEvaluator` class also supports batch evaluation. This is useful when you want to evaluate multiple samples at once in Evaluation-Driven Development.

In [5]:
# Read the sample data
import json
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": sample["query"]},
        {"context": sample["context"]},
    ]
    for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

# Run the batch evaluation
results = faithfulness_evaluator.batch_evaluate(eval_inputs_batch, save_results=False)

Processed prompts: 100%|██████████| 6/6 [00:08<00:00,  1.48s/it, est. speed input: 871.29 toks/s, output: 175.07 toks/s]


In [6]:
# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

__Sample 1:__

__Feedback:__
The response provided is mostly consistent with the context given. It correctly suggests using Git Large File Storage (LFS) to resolve the issue of pushing large files to a Git repository, which aligns with the context information. The steps outlined for installing and setting up Git LFS are also accurate and relevant to the context.

However, there are a few minor inconsistencies and fabrications:

1. The response suggests tracking large files using `git lfs track "*.large-file-extension"`, but the context does not provide specific instructions for tracking large files.

2. The response includes commands like `git add .gitattributes` and `git add large-file.ext`, which are not explicitly mentioned or implied in the context.

3. The response assumes the user is working with a specific file extension (`.large-file-extension`), which is not addressed in the context.

4. The response mentions pushing changes using `git push origin main`, which is not discussed or implied in the context.

While the response is largely based on the context provided, these minor inconsistencies and fabrications prevent it from being completely faithful to the given information.

__Score:__
3

---

__Sample 2:__

__Feedback:__
The response provided is mostly consistent with the given context, but it contains some significant inconsistencies and misleading information. 

1. The first step is correctly mentioned as checking existing remotes using `git remote -v`, which is consistent with the context.
2. The second step suggests changing the URL of the existing origin using `git remote set-url origin new-url`. This is also consistent with the context.
3. The third step mentions adding a new remote with a different name using `git remote add new-remote-name new-url`. This is consistent with the context.
4. The fourth step is where the response starts to deviate from the context. It suggests removing a remote with a different name and adding a new one using `git remote remove origin` followed by `git remote add origin new-url`. This is not only inconsistent but also potentially harmful advice, as it suggests removing the origin which is the default remote name.

Additionally, the response contains misleading information and instructions that are not supported by the context:
- It suggests that running `git remote -v` will hide all current remotes, which is incorrect.
- It advises to replace 'new-url' with the exact same URL, which contradicts the purpose of changing the URL.
- It mentions using 'new-remote-name' with an existing remote, which is not supported by the context.
- It suggests that after making these changes, one will definitely encounter the error again, which is not necessarily true and is misleading.

Overall, while the response includes some correct information from the context, it also introduces significant misinformation and instructions that deviate from the provided context.

__Score:__
2

---

__Sample 3:__

__Feedback:__
The response provided is mostly consistent with the context given. It accurately describes the `git revert` command and provides a step-by-step guide on how to use it, which aligns well with the context information. The response also mentions the importance of having a backup, which is a good practice as suggested in the context.

However, there are a few minor inconsistencies and additions that are not explicitly mentioned in the context:

1. The response suggests using `git log` to find the commit hash, which is a reasonable suggestion but not mentioned in the context.
2. It provides additional information about reverting multiple commits using a range, which is a useful tip but not explicitly stated in the context.
3. The response suggests creating a backup branch, which is good practice but not mentioned in the context.

These minor additions and suggestions, while helpful, are not directly supported by the context and could be considered as slight deviations from the strict context provided. Therefore, the response is mostly consistent but not perfectly faithful to the context.

__Score:__
4

---

__Sample 4:__

__Feedback:__
The response provided by the AI system is significantly inconsistent with the given context. The context outlines a detailed, step-by-step process for removing sensitive data from a Git repository, including using tools like BFG Repo-Cleaner or git filter-branch, force-pushing changes to GitHub, contacting GitHub Support, advising collaborators to rebase, and using git commands to remove old references.

However, the response suggests that no action is needed, that Git automatically handles the removal of sensitive data, and that no specialized tools or steps are required. This directly contradicts the context, which provides specific tools and steps to follow. The response introduces a substantial amount of fabricated information that deviates from the context, suggesting that the AI system has hallucinated or fabricated details not present in the original context.

Therefore, based on the evaluation criteria and scoring rubric, the response is mostly inconsistent with the provided context.

__Score:__
2

---

__Sample 5:__

__Feedback:__
The response is mostly consistent with the provided context, with only minor and inconsequential inconsistencies. The response accurately describes the steps to resolve merge conflicts, including opening conflicted files, identifying conflict markers, deciding which changes to keep, editing the files, staging the resolved files, and committing the changes. It also mentions the option to use `git mergetool` for a visual diff tool, which aligns with the context.

However, there are a few minor issues:
1. The response suggests using `git add <filename>` instead of `git add <filename>`, which is a minor typographical error.
2. The response includes a tip about minimizing merge conflicts by keeping branches up-to-date, which, while helpful, is not part of the original context.

These minor issues do not significantly detract from the overall consistency and accuracy of the response with the provided context.

__Score:__
4

---

__Sample 6:__

__Feedback:__
The response provided by the AI system is highly consistent with the context given. It accurately describes the process of adding a remote repository and pushing local changes to a remote repository using Git. The steps outlined in the response are directly supported by the context, which provides the necessary commands and syntax.

1. The first step of adding the remote repository is correctly described using the 'git remote add' command, with the correct syntax and example provided.
2. The second step of pushing the local branch to the remote repository is accurately described using the 'git push -u origin main' command.

There are no hallucinated or fabricated details in the response. All information provided is directly supported by the context. The response is clear, concise, and faithful to the original context, making it easy for the user to follow the instructions.

Overall, the response meets the highest standard of consistency and faithfulness to the provided context.

__Score:__
5

---

### Saving the results

When running batched evaluation, it's usually recommended to save the results to a file for future reference and reproducibility. This is the default behavior of the evaluate methods.

In [7]:
# Run the batch evaluation
results = faithfulness_evaluator.batch_evaluate(eval_inputs_batch, save_results=True)

Processed prompts: 100%|██████████| 6/6 [00:06<00:00,  1.09s/it, est. speed input: 1180.37 toks/s, output: 204.62 toks/s]
INFO:flow_eval.core.base:Saving results to output/
INFO:flow_eval.core.io:Results saved to output/Response_Faithfulness_5-point_Likert/results_Response_Faithfulness_5-point_Likert_flowaicomFlow-Judge-v01-AWQ_vllm_2025-01-23T18-29-45.567.jsonl


In [8]:
import os
from pathlib import Path

output_dir = Path("output")
latest_run = next(output_dir.iterdir())

print(f"Contents of {output_dir}:")
print(list(output_dir.iterdir()))

print(f"\nContents of {latest_run}:")
print(list(latest_run.iterdir()))

Contents of output:
[PosixPath('output/Response_Faithfulness_5-point_Likert')]

Contents of output/Response_Faithfulness_5-point_Likert:
[PosixPath('output/Response_Faithfulness_5-point_Likert/results_Response_Faithfulness_5-point_Likert_flowaicomFlow-Judge-v01-AWQ_vllm_2025-01-23T18-29-45.567.jsonl'), PosixPath('output/Response_Faithfulness_5-point_Likert/metadata_Response_Faithfulness_5-point_Likert_flowaicomFlow-Judge-v01-AWQ')]


Each evaluation run generates 2 files:
- `results_....json`: Contains the evaluation results.
- `metadata_....json`: Contains metadata about the evaluation for reproducibility.

These files are saved in the `output` directory.