# Quickstart

This tutorial demonstrates how to use the `flow-judge` library to perform language model-based evaluations using Flow-Judge-v0.1 models.

## Running an evaluation

Running an evaluation is as simple as:

In [1]:
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1_HF_no_flsh_attn") # ! Replace with "Flow-Judge-v0.1_HF_no_flsh_attn" if running on no Ampere GPUs

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
query = """Please read the technical issue that the user is facing and help me create a detailed solution based on the context provided."""
context = """# Customer Issue:
I'm having trouble when uploading a git lfs tracked file to my repo: (base)  bernardo@bernardo-desktop  ~/repos/lm-evaluation-harness  ↱ Flow-Judge-v0.1_evals  git push                                            
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

# Documentation:
Configuring Git Large File Storage
Once Git LFS is installed, you need to associate it with a large file in your repository.

Platform navigation
Mac
Windows
Linux
If there are existing files in your repository that you'd like to use GitHub with, you need to first remove them from the repository and then add them to Git LFS locally. For more information, see "Moving a file in your repository to Git Large File Storage."

If there are referenced Git LFS files that did not upload successfully, you will receive an error message. For more information, see "Resolving Git Large File Storage upload failures."

Open Terminal.

Change your current working directory to an existing repository you'd like to use with Git LFS.

To associate a file type in your repository with Git LFS, enter git lfs track followed by the name of the file extension you want to automatically upload to Git LFS.

For example, to associate a .psd file, enter the following command:

$ git lfs track "*.psd"
> Tracking "*.psd"
Every file type you want to associate with Git LFS will need to be added with git lfs track. This command amends your repository's .gitattributes file and associates large files with Git LFS.

Note: We strongly suggest that you commit your local .gitattributes file into your repository.

Relying on a global .gitattributes file associated with Git LFS may cause conflicts when contributing to other Git projects.
Including the .gitattributes file in the repository allows people creating forks or fresh clones to more easily collaborate using Git LFS.
Including the .gitattributes file in the repository allows Git LFS objects to optionally be included in ZIP file and tarball archives.
Add a file to the repository matching the extension you've associated:

git add path/to/file.psd
Commit the file and push it to GitHub:

git commit -m "add file.psd"
git push
You should see some diagnostic information about your file upload:

> Sending file.psd
> 44.74 MB / 81.04 MB  55.21 % 14s
> 64.74 MB / 81.04 MB  79.21 % 3s"""
response = """# Resolving Git LFS Data Quota Issue

## Problem
The user is encountering an error when trying to push Git LFS tracked files:
```
This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
```

## Solution

1. **Understand the issue**
   - The error message indicates that the repository has exceeded its data quota for Git LFS (Large File Storage).
   - This is likely due to the size or number of large files being tracked by Git LFS.

2. **Check current LFS usage**
   - Run the following command to see the current LFS objects:
     ```
     git lfs ls-files
     ```
   - This will show you which files are being tracked by Git LFS and their sizes.

3. **Review Git LFS pricing and quotas**
   - Visit GitHub's pricing page to understand the current LFS storage and bandwidth limits for your account type.
   - Free accounts typically have lower limits compared to paid plans.

4. **Options to resolve the issue**

   a. **Purchase more data packs**
      - If you need to keep all current LFS objects and continue using LFS extensively:
        1. Go to your GitHub account settings.
        2. Navigate to the Billing section.
        3. Look for options to purchase additional data packs for Git LFS.

   b. **Optimize repository**
      - If purchasing more data is not an option, consider optimizing your repository:
        1. Remove unnecessary large files from LFS tracking.
        2. Use `.gitignore` to exclude large files that don't need version control.
        3. Consider using alternative storage solutions for very large assets.

   c. **Clean up LFS cache**
      - Sometimes, cleaning up the LFS cache can help:
        ```
        git lfs prune
        ```

5. **Reconfigure Git LFS tracking**
   - Review your `.gitattributes` file to ensure only necessary file types are tracked:
     ```
     git lfs track
     ```
   - Modify tracking as needed:
     ```
     git lfs track "*.psd"  # Track PSD files
     git lfs untrack "*.zip"  # Stop tracking ZIP files
     ```

6. **Commit changes and retry push**
   - After making necessary changes:
     ```
     git add .gitattributes
     git commit -m "Update Git LFS tracking"
     git push
     ```

7. **If issues persist**
   - Contact GitHub support for further assistance.
   - They may be able to provide more specific guidance based on your account and repository details.

Remember to regularly monitor your Git LFS usage to avoid hitting quotas in the future. Consider setting up alerts or regularly checking your GitHub account's storage usage statistics."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)

INFO 10-02 20:13:36 importing.py:10] Triton not installed; certain GPU-related functions will not be available.


ValueError: Failed to create vLLM model: GPU is not available.                     vLLM requires a GPU to run.                     Check https://docs.vllm.ai/en/latest/getting_started/installation.html                     for installation requirements.

In [2]:
# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

__Feedback:__
The response provided is mostly consistent with the context given, but it introduces some minor fabrications and assumptions that are not explicitly supported by the context.

The response correctly identifies the problem of exceeding the Git LFS data quota and provides a general solution outline. It also includes some specific steps that are not directly mentioned in the context, such as running `git lfs ls-files` to check current LFS usage and reviewing Git LFS pricing and quotas. These additions are helpful but not directly derived from the given context.

The response does accurately reflect some information from the context, such as the need to purchase more data packs if the user has a free account and wants to keep using LFS extensively. It also mentions optimizing the repository by removing unnecessary large files and using `.gitignore` to exclude large files, which aligns with the context's suggestion to "remove them from the repository and then add them to Git LFS locally."

However, the response introduces some assumptions and additional steps that are not explicitly mentioned in the context. For example, it suggests contacting GitHub support if issues persist, which is not mentioned in the provided context. It also includes a recommendation to regularly monitor Git LFS usage and set up alerts, which, while potentially useful, is not stated in the context.

Overall, while the response is mostly consistent with the context and provides valuable information, it introduces some minor fabrications and assumptions that are not directly supported by the given context.

__Score:__
3

# Models

`flow-judge` support different model configurations. This refers to the library use for running inference with the models. We currently support:
- vLLM (default)
- Hugging Face

You can check the available models and choose the one that best fits your needs. By default, we run inference with a quantized model using the vLLM engine.

In [4]:
from flow_judge.models.model_configs import get_available_configs
get_available_configs()

['Flow-Judge-v0.1-AWQ',
 'Flow-Judge-v0.1',
 'Flow-Judge-v0.1_HF',
 'Flow-Judge-v0.1_HF_no_flsh_attn',
 'Flow-Judge-v0.1-AWQ-Async']

# Metrics

A judge is initialized with a metric and a model.

We include some common metrics in the library, such as:
- RESPONSE_FAITHFULNESS_3POINT
- RESPONSE_FAITHFULNESS_5POINT
- RESPONSE_COMPREHENSIVENESS_3POINT
- RESPONSE_COMPREHENSIVENESS_5POINT

But you can also implement your own metrics and use them with the judge.

Note that metrics have required inputs and outputs as you can see below:

In [5]:
RESPONSE_FAITHFULNESS_5POINT.print_required_keys()

Metric: Response Faithfulness (5-point Likert)
Required inputs: query, context
Required output: response


`flow-judge` checks under the hood if the keys match. This is important to ensure the right prompt is being formatted.

When you define a custom metric, you should specify the required keys as well.

## Running batched evaluations

The `FlowJudge` class also supports batch evaluation. This is useful when you want to evaluate multiple samples at once in Evaluation-Driven Development.

In [6]:
# Read the sample data
import json
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": sample["query"]},
        {"context": sample["context"]},
    ]
    for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]
                         
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

Processed prompts: 100%|██████████| 6/6 [00:07<00:00,  1.21s/it, est. speed input: 1071.99 toks/s, output: 197.56 toks/s]


In [8]:
# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

__Sample 1:__

__Feedback:__
tags without any additional surrounding text.
- Write the numeric score inside <score> tags, without any additional surrounding text and always after the feedback.

Please accurately evaluate the task. Strictly adhere to the evaluation criteria and rubric. 
 <feedback>
The response provided is mostly consistent with the given context, but it introduces a few minor inconsistencies and fabrications. 

1. The response correctly mentions the use of Git Large File Storage (LFS) to resolve the issue, which aligns with the context provided.
2. It accurately describes the steps to install and set up Git LFS, which is supported by the context.
3. The response introduces a step to track large files using `git lfs track "*.large-file-extension"`, which is a plausible extension of the context but not explicitly mentioned. This is a minor fabrication.
4. The response mentions adding a .gitattributes file and committing large files, which is consistent with the context but not explicitly stated.
5. The response concludes with instructions to push changes, which is a standard Git operation but not specifically mentioned in the context.

Overall, the response is mostly consistent with the context, with only minor and inconsequential inconsistencies or fabrications.

__Score:__
4

---

__Sample 2:__

__Feedback:__
tags without any additional surrounding text.
- Write the numeric score inside <score> tags, without any additional surrounding text and always after the feedback.

Please accurately evaluate the task. Strictly adhere to the evaluation criteria and rubric. 
 <feedback>
The response provided is mostly consistent with the context but contains some significant inconsistencies and fabrications that deviate from the given information. 

1. The first step of checking existing remotes is correctly mentioned, but the explanation is misleading. The context does not state that this will "hide all the current remotes," which is a fabrication.
2. The second step about changing the URL of the existing origin is partially correct but includes a fabrication. The context does not mention replacing 'new-url' with the "exact same URL you're currently using," which is incorrect.
3. The third step about removing a remote with a different name is partially correct but includes a fabrication. The context does not suggest replacing 'new-remote-name' with the name of an existing remote or 'new-url' with any random string.
4. The fourth step about removing the existing origin and adding a new one is correct but includes a fabrication. The context does not suggest choosing the option that "worst fits your needs."

Overall, while the response does include some correct information from the context, it also introduces several fabrications and misleading statements that significantly deviate from the provided information. Therefore, the response is mostly inconsistent with the context.

__Score:__
2

---

__Sample 3:__

__Feedback:__
The response provided is mostly consistent with the given context and does not contain significant hallucinated or fabricated information. The response accurately describes the `git revert` command and its usage, which is directly supported by the context. It also mentions the `git reset` command, although it does not elaborate on it as the context does. However, this is a minor omission since the primary focus of the response is on reverting commits using `git revert`.

The response includes additional helpful information such as creating a backup branch before performing the operation, which, while not explicitly mentioned in the context, is a good practice that aligns with the advice to ensure safety when making significant changes to Git history.

Overall, the response is informative and mostly faithful to the context, with only minor additions that do not contradict the provided information.

__Score:__
4

---

__Sample 4:__

__Feedback:__
tags without any additional surrounding text.
- Write the numeric score inside <score> tags, without any additional surrounding text and always after the feedback.

Please accurately evaluate the task. Strictly adhere to the evaluation criteria and rubric. 
 <feedback>
The response provided is significantly inconsistent with the given context. The context outlines a series of specific steps to remove sensitive information from a Git repository, including using tools like BFG Repo-Cleaner or git filter-branch, force-pushing changes, contacting GitHub Support, advising collaborators to rebase, and running specific Git commands to remove old references.

However, the generated response contradicts this information by stating that no action is needed and that Git will automatically handle the removal of sensitive data. This claim is not supported by the context and introduces substantial fabricated information. The response also incorrectly suggests that force-pushing, contacting GitHub Support, and other mentioned actions are unnecessary, which directly contradicts the context provided.

Given the significant amount of hallucinated and fabricated information, the response fails to accurately reflect the context and provides misleading advice to the user.

__Score:__
1

---

__Sample 5:__

__Feedback:__
The response provided is mostly consistent with the given context and does not contain significant hallucinated or fabricated information. Here is a detailed breakdown:

1. **Consistency with Context**: The response accurately follows the steps outlined in the context for resolving merge conflicts. It correctly mentions opening the conflicted file in a text editor, identifying conflict markers, deciding which changes to keep, editing the file, saving it, staging the resolved file, and committing the changes. These steps are directly supported by the context provided.

2. **Additional Information**: The response includes a tip about minimizing merge conflicts in the future by keeping branches up-to-date with the main branch, which is not explicitly mentioned in the context but is a logical and relevant piece of advice. This addition, while not fabricated, is not part of the original context.

3. **Minor Inconsistencies**: The response does not contain any significant hallucinated or fabricated information. The additional tip about keeping branches up-to-date is a minor addition that does not detract from the overall consistency with the provided context.

Overall, the response is mostly consistent with the context, with only a minor addition that does not contradict the provided information.

__Score:__
4

---

__Sample 6:__

__Feedback:__
tags without any additional surrounding text.
- Write the numeric score inside <score> tags, without any additional surrounding text and always after the feedback.

Please accurately evaluate the task. Strictly adhere to the evaluation criteria and rubric. 
 <feedback>
The response provided is highly consistent with the given context. It accurately reflects the information provided in the context about adding a remote repository to a local Git repository. The response correctly uses the 'git remote add' command syntax and provides an example that matches the context exactly. It also correctly mentions the 'git push -u origin main' command to push the local repository to the remote, which is in line with the context.

There are no hallucinated or fabricated details in the response. Every piece of information provided is directly supported by the context. The response is clear, concise, and faithfully represents the technical instructions given in the context.

Therefore, the response meets the highest standard of consistency and faithfulness to the provided context.

__Score:__
5

---

### Saving the results

When running batched evaluation, it's usually recommended to save the results to a file for future reference and reproducibility. This is the default behavior of the evaluate methods.

In [9]:
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=True)

INFO:flow_judge.flow_judge:Saving results to output/
INFO:flow_judge.utils.result_writer:Results saved to output/response_faithfulness_5-point_likert/results_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1_transformers_2024-09-20T13-50-24.779.jsonl


In [10]:
import os
from pathlib import Path

output_dir = Path("output")
latest_run = next(output_dir.iterdir())

print(f"Contents of {output_dir}:")
print(list(output_dir.iterdir()))

print(f"\nContents of {latest_run}:")
print(list(latest_run.iterdir()))

Contents of output:
[PosixPath('output/response_faithfulness_5-point_likert')]

Contents of output/response_faithfulness_5-point_likert:
[PosixPath('output/response_faithfulness_5-point_likert/results_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1_transformers_2024-09-20T13-50-24.779.jsonl'), PosixPath('output/response_faithfulness_5-point_likert/metadata_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1_transformers_2024-09-20T13-50-24.779.jsonl')]


Each evaluation run generates 2 files:
- `results_....json`: Contains the evaluation results.
- `metadata_....json`: Contains metadata about the evaluation for reproducibility.

These files are saved in the `output` directory.