# Quickstart

This tutorial demonstrates how to use the `flow-judge` library to perform language model-based evaluations using Flow-Judge-v0.1 models.

## Running an evaluation

Running an evaluation is as simple as:

In [1]:
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ") # ! Replace with "Flow-Judge-v0.1_HF_no_flsh_attn" is running on no Ampere GPUs

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
user_instructions = """Please read the technical issue that the user is facing and help me create a detailed solution based on the context provided."""
customer_issue = """I'm having trouble when uploading a git lfs tracked file to my repo: (base)  bernardo@bernardo-desktop  ~/repos/lm-evaluation-harness  ↱ Flow-Judge-v0.1_evals  git push                                            
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access."""
context = """Configuring Git Large File Storage
Once Git LFS is installed, you need to associate it with a large file in your repository.

Platform navigation
Mac
Windows
Linux
If there are existing files in your repository that you'd like to use GitHub with, you need to first remove them from the repository and then add them to Git LFS locally. For more information, see "Moving a file in your repository to Git Large File Storage."

If there are referenced Git LFS files that did not upload successfully, you will receive an error message. For more information, see "Resolving Git Large File Storage upload failures."

Open Terminal.

Change your current working directory to an existing repository you'd like to use with Git LFS.

To associate a file type in your repository with Git LFS, enter git lfs track followed by the name of the file extension you want to automatically upload to Git LFS.

For example, to associate a .psd file, enter the following command:

$ git lfs track "*.psd"
> Tracking "*.psd"
Every file type you want to associate with Git LFS will need to be added with git lfs track. This command amends your repository's .gitattributes file and associates large files with Git LFS.

Note: We strongly suggest that you commit your local .gitattributes file into your repository.

Relying on a global .gitattributes file associated with Git LFS may cause conflicts when contributing to other Git projects.
Including the .gitattributes file in the repository allows people creating forks or fresh clones to more easily collaborate using Git LFS.
Including the .gitattributes file in the repository allows Git LFS objects to optionally be included in ZIP file and tarball archives.
Add a file to the repository matching the extension you've associated:

git add path/to/file.psd
Commit the file and push it to GitHub:

git commit -m "add file.psd"
git push
You should see some diagnostic information about your file upload:

> Sending file.psd
> 44.74 MB / 81.04 MB  55.21 % 14s
> 64.74 MB / 81.04 MB  79.21 % 3s"""
response = """# Resolving Git LFS Data Quota Issue



## Problem
The user is encountering an error when trying to push Git LFS tracked files:
```
This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
```

## Solution

1. **Understand the issue**
   - The error message indicates that the repository has exceeded its data quota for Git LFS (Large File Storage).
   - This is likely due to the size or number of large files being tracked by Git LFS.

2. **Check current LFS usage**
   - Run the following command to see the current LFS objects:
     ```
     git lfs ls-files
     ```
   - This will show you which files are being tracked by Git LFS and their sizes.

3. **Review Git LFS pricing and quotas**
   - Visit GitHub's pricing page to understand the current LFS storage and bandwidth limits for your account type.
   - Free accounts typically have lower limits compared to paid plans.

4. **Options to resolve the issue**

   a. **Purchase more data packs**
      - If you need to keep all current LFS objects and continue using LFS extensively:
        1. Go to your GitHub account settings.
        2. Navigate to the Billing section.
        3. Look for options to purchase additional data packs for Git LFS.

   b. **Optimize repository**
      - If purchasing more data is not an option, consider optimizing your repository:
        1. Remove unnecessary large files from LFS tracking.
        2. Use `.gitignore` to exclude large files that don't need version control.
        3. Consider using alternative storage solutions for very large assets.

   c. **Clean up LFS cache**
      - Sometimes, cleaning up the LFS cache can help:
        ```
        git lfs prune
        ```

5. **Reconfigure Git LFS tracking**
   - Review your `.gitattributes` file to ensure only necessary file types are tracked:
     ```
     git lfs track
     ```
   - Modify tracking as needed:
     ```
     git lfs track "*.psd"  # Track PSD files
     git lfs untrack "*.zip"  # Stop tracking ZIP files
     ```

6. **Commit changes and retry push**
   - After making necessary changes:
     ```
     git add .gitattributes
     git commit -m "Update Git LFS tracking"
     git push
     ```

7. **If issues persist**
   - Contact GitHub support for further assistance.
   - They may be able to provide more specific guidance based on your account and repository details.

Remember to regularly monitor your Git LFS usage to avoid hitting quotas in the future. Consider setting up alerts or regularly checking your GitHub account's storage usage statistics."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"user_instructions": user_instructions},
        {"customer_issue": customer_issue},
        {"context": context}
    ],
    output=response,
)

# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)

INFO 09-17 13:37:45 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 09-17 13:37:45 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_mod

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 09-17 13:37:47 model_runner.py:926] Loading model weights took 2.1861 GB
INFO 09-17 13:37:49 gpu_executor.py:122] # GPU blocks: 3084, # CPU blocks: 682


Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.29s/it, est. speed input: 341.32 toks/s, output: 46.56 toks/s]


In [2]:
# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

__Feedback:__
The response provided is mostly consistent with the context given. It addresses the specific issue of a Git LFS data quota being exceeded and offers a detailed solution. The response includes steps such as checking current LFS usage, reviewing Git LFS pricing and quotas, and optimizing the repository. It also suggests purchasing more data packs, which aligns with the context's suggestion to purchase more data packs to restore access.

However, there are a few minor inconsistencies and additions that are not explicitly mentioned in the context:

1. The response suggests running `git lfs ls-files` to check current LFS usage, which is a good practice but not mentioned in the provided context.
2. It includes a suggestion to clean up the LFS cache using `git lfs prune`, which is a helpful tip but not part of the given context.
3. The response advises setting up alerts or regularly checking GitHub account's storage usage statistics, which is a proactive measure but not mentioned in the context.

Despite these minor additions, the vast majority of the content is supported by the context, and the response does not introduce any significant hallucinated or fabricated information. Therefore, the response is mostly consistent with the provided context.

__Score:__
4

## Running batched evaluations

The `FlowJudge` class also supports batch evaluation. This is useful when you want to evaluate multiple samples at once in Evaluation-Driven Development.

In [9]:
# Read the sample data
import json
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"user_instructions": sample["user_instructions"]},
        {"customer_issue": sample["customer_issue"]},
        {"context": sample["context"]}
    ]
    for sample in data
]
outputs_batch = [sample["response"] for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]
                         
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

Processed prompts: 100%|██████████| 6/6 [00:07<00:00,  1.24s/it, est. speed input: 1040.28 toks/s, output: 190.25 toks/s]


In [10]:
# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

__Sample 1:__

__Feedback:__
The response is mostly consistent with the provided context, but it introduces some minor inconsistencies and fabrications. 

1. The suggestion to use Git LFS is appropriate and consistent with the context.
2. The step to install Git LFS is also correct and aligns with the context.
3. The instruction to set up Git LFS for the user account is accurate.
4. The suggestion to track large files using `git lfs track` is correct, but it assumes the user knows the file extensions, which is not explicitly mentioned in the context.
5. The instruction to add a .gitattributes file is not mentioned in the context, which only briefly mentions that Git LFS replaces large files with text pointers inside Git.
6. The commands to add and commit large files, and to push changes, are generally correct but lack the specificity of using Git LFS for handling large files.

Overall, the response is mostly consistent with the context, but it introduces some minor inconsistencies and fabrications that are not fully supported by the provided context.

__Score:__
4

---

__Sample 2:__

__Feedback:__
The response provided is mostly consistent with the context but contains several significant inaccuracies and misleading instructions that deviate from the context. 

1. The first step correctly instructs to check existing remotes using `git remote -v`, which is consistent with the context.
2. The second step incorrectly suggests using `git remote set-url origin new-url` to keep the URL unchanged, which contradicts the context that implies changing the URL.
3. The third step incorrectly suggests using `git remote add new-remote-name new-url` to remove a remote, which is the opposite of what the context suggests.
4. The fourth step incorrectly combines commands to remove and add a remote, which is not supported by the context.

Additionally, the response includes misleading information such as suggesting that these changes will definitely result in the same error, which is not supported by the context. The response also introduces fabricated elements like "Replace 'new-url' with the exact same URL you're currently using" and "Replace 'new-remote-name' with the name of an existing remote, and 'new-url' with any random string," which are not present in the context.

Overall, while the response does contain some correct information from the context, it is mostly inconsistent and contains significant amounts of fabricated information that deviate from the provided context.

__Score:__
2

---

__Sample 3:__

__Feedback:__
The response provided is mostly consistent with the context given. It accurately describes the `git revert` command and provides a step-by-step guide on how to use it, which aligns well with the context. The response also mentions the importance of having a backup, which is a good practice as suggested in the context.

However, there are a few minor inconsistencies and additions that are not explicitly mentioned in the context:

1. The response suggests using `git log` to find the commit hash, which is a reasonable suggestion but not mentioned in the context.

2. The response includes a command for reverting multiple commits using a range, which is a useful addition but not explicitly stated in the context.

3. The suggestion to create a backup branch before performing any Git operations is good advice but not directly mentioned in the context.

These minor additions and suggestions, while helpful, are not fabricated or hallucinated information but rather practical tips that go beyond the given context. Therefore, the response is mostly consistent with the context but includes some additional helpful information.

Overall, the response is very close to being completely consistent with the provided context, with only minor and inconsequential inconsistencies or fabrications.

__Score:__
4

---

__Sample 4:__

__Feedback:__
The response provided is completely inconsistent with the given context. The context outlines a series of specific steps to remove sensitive data from a Git repository, including using tools like BFG Repo-Cleaner or git filter-branch, force-pushing changes to GitHub, contacting GitHub Support, advising collaborators to rebase, and using git commands to remove old references. However, the response suggests that no action is needed and that Git will automatically handle the removal of sensitive data, which directly contradicts the context provided. Therefore, the response contains significant hallucinated information that is not supported by the context.

Based on the evaluation criteria and scoring rubric, the response fits the description for a score of 1, as it is completely inconsistent with the provided context and contains hallucinated information.

__Score:__
1

---

__Sample 5:__

__Feedback:__
The response is mostly consistent with the provided context, with only minor and inconsequential inconsistencies. The response accurately describes the steps to resolve merge conflicts, including opening the conflicted file, identifying conflict markers, deciding which changes to keep, editing the file, saving, staging, and committing the changes. It also mentions using `git mergetool` as an alternative method, which aligns with the context.

However, there are a few minor issues:
1. The response suggests using `git add <filename>` instead of `git add <filename>`, which is a minor typo but does not significantly impact the overall consistency.
2. The response includes a tip about minimizing merge conflicts in the future, which, while helpful, is not part of the original context.

Overall, the response is faithful to the context and does not contain significant hallucinated or fabricated information. The minor inconsistencies and additional tip do not detract from the overall accuracy and relevance of the response.

__Score:__
4

---

__Sample 6:__

__Feedback:__
The response is highly consistent with the provided context. It accurately describes the process of adding a remote repository and pushing local changes to a remote repository using Git. The steps outlined in the response directly align with the context given, including the use of the 'git remote add' command and the 'git push -u origin main' command. There are no hallucinated or fabricated details in the response; all information is supported by the context provided. The response is clear, concise, and directly applicable to the customer's issue, making it a faithful and accurate representation of the context.

__Score:__
5

---

### Saving the results

When running batched evaluation, it's usually recommended to save the results to a file for future reference and reproducibility. This is the default behavior of the evaluate methods.

In [11]:
# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=True)

Processed prompts: 100%|██████████| 6/6 [00:07<00:00,  1.28s/it, est. speed input: 1007.69 toks/s, output: 199.25 toks/s]
INFO:flow_judge.flow_judge:Saving results to output/


In [12]:
import os
from pathlib import Path

output_dir = Path("output")
latest_run = next(output_dir.iterdir())

print(f"Contents of {output_dir}:")
print(list(output_dir.iterdir()))

print(f"\nContents of {latest_run}:")
print(list(latest_run.iterdir()))

Contents of output:
[PosixPath('output/response_faithfulness_5-point_likert')]

Contents of output/response_faithfulness_5-point_likert:
[PosixPath('output/response_faithfulness_5-point_likert/metadata_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-09-17T07-24-25.410.jsonl'), PosixPath('output/response_faithfulness_5-point_likert/results_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-09-17T11-41-16.298.jsonl'), PosixPath('output/response_faithfulness_5-point_likert/metadata_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-09-17T11-41-16.298.jsonl'), PosixPath('output/response_faithfulness_5-point_likert/results_response_faithfulness_5-point_likert_flowaicom__Flow-Judge-v0.1-AWQ_vllm_2024-09-17T07-24-25.410.jsonl')]


Each evaluation run generates 2 files:
- `results_....json`: Contains the evaluation results.
- `metadata_....json`: Contains metadata about the evaluation for reproducibility.

These files are saved in the `output` directory.