# Lynx



You'd need a GPU with atleast 32GB RAM in order to load the smallest Lynx model i.e. 8B FP32.

You'd need atleast 4 GPUs with 40GB RAM each to load the bigger Lynx model i.e. 70B FP32.

Install dependencies

In [None]:
%pip install transformers
%pip install datasets --quiet

Load model

In [None]:
from vllm import LLM, SamplingParams

model_name_70b = 'PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct'

model_70b = LLM(model_name_70b, tensor_parallel_size=4, gpu_memory_utilization=0.98)

INFO 12-16 05:42:19 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 12-16 05:42:20 config.py:1020] Defaulting to use mp for distributed inference
INFO 12-16 05:42:20 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct', speculative_config=None, tokenizer='PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, c

Loading safetensors checkpoint shards:   0% Completed | 0/30 [00:00<?, ?it/s]


INFO 12-16 05:51:31 model_runner.py:1077] Loading model weights took 32.8599 GB
[1;36m(VllmWorkerProcess pid=3495)[0;0m [1;36m(VllmWorkerProcess pid=3497)[0;0m INFO 12-16 05:51:31 model_runner.py:1077] Loading model weights took 32.8599 GB
INFO 12-16 05:51:31 model_runner.py:1077] Loading model weights took 32.8599 GB
[1;36m(VllmWorkerProcess pid=3496)[0;0m INFO 12-16 05:51:31 model_runner.py:1077] Loading model weights took 32.8599 GB
[1;36m(VllmWorkerProcess pid=3496)[0;0m [1;36m(VllmWorkerProcess pid=3497)[0;0m [1;36m(VllmWorkerProcess pid=3495)[0;0m INFO 12-16 05:51:35 worker.py:232] Memory profiling results: total_gpu_memory=39.38GiB initial_memory_usage=34.19GiB peak_torch_memory=33.58GiB memory_usage_post_profile=35.11GiB non_torch_memory=2.24GiB kv_cache_size=2.78GiB gpu_memory_utilization=0.98
INFO 12-16 05:51:35 worker.py:232] Memory profiling results: total_gpu_memory=39.38GiB initial_memory_usage=34.33GiB peak_torch_memory=33.58GiB memory_usage_post_profile=35.3

### Halubench

In [None]:
from datasets import load_dataset, Dataset

ds = load_dataset("PatronusAI/HaluBench")
data = ds["test"].to_pandas()

# store sources for subsets
sources = data.source_ds.unique().tolist()
sources.remove('RAGTruth')
sources.remove('halueval')

In [None]:
# Lynx prompt template
def generate_prompt(question, document, answer):
    prompt = f"""
    Given the following QUESTION, DOCUMENT and ANSWER you must analyze the provided answer and determine whether it is faithful to the contents of the DOCUMENT. The ANSWER must not offer new information beyond the context provided in the DOCUMENT. The ANSWER also must not contradict information provided in the DOCUMENT. Output your final verdict by strictly following this format: "PASS" if the answer is faithful to the DOCUMENT and "FAIL" if the answer is not faithful to the DOCUMENT. Show your reasoning.
    --
    QUESTION (THIS DOES NOT COUNT AS BACKGROUND INFORMATION):
    {question}
    --
    DOCUMENT:
    {document}
    --
    ANSWER:
    {answer}
    --
    Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE":
    {{"REASONING": "<your reasoning as bullet points>", "SCORE": "<your final score>"}}
    """
    return prompt

In [None]:
import os, json

def save_to_json(filename, reasonings, scores, source):
    # check if json exists
    if os.path.exists(f'{filename}.json'):
        with open(f'{filename}.json', 'r') as f:
            results_dict = json.load(f)
    else:
        results_dict = {}

    # add the results under source key
    results_dict[source] = [{'reasoning': reasoning, 'score': score} for reasoning, score in zip(reasonings, scores)]

    # write the updated dictionary
    with open(f'{filename}.json', 'w') as f:
        json.dump(results_dict, f, indent=2)

In [None]:
from tqdm import tqdm
import re

# extract reasoning and score from LLM response
def postprocess(results):
  reasonings = []
  scores = []
  none_score = 0

  reasoning_pattern = r'"REASONING":\s*\[(.*?)\]'
  score_pattern = r'"SCORE":\s*(\w+)'

  for response in results:
      text = response.outputs[0].text
      reasoning_match = re.search(reasoning_pattern, text, re.DOTALL)
      reasoning = [r.strip("'") for r in reasoning_match.group(1).split("', '")] if reasoning_match else None
      reasonings.append(reasoning)

      score_match = re.search(score_pattern, text)
      if score_match:
          score = score_match.group(1)
          if score == "PASS":
              score_num = 1
          elif score == "FAIL":
              score_num = 0
          else:
              score_num = None
      else:
          none_score+=1
          score_num = None
      scores.append(score_num)

  print("Responses without score: ", none_score)
  return reasonings, scores

def parse_output(response):
    reasoning_pattern = r'"REASONING":\s*\[(.*?)\]'
    score_pattern = r'"SCORE":\s*(\w+)'

    text = response.outputs[0].text
    reasoning_match = re.search(reasoning_pattern, text, re.DOTALL)
    reasoning = [r.strip("'") for r in reasoning_match.group(1).split("', '")] if reasoning_match else None

    score_match = re.search(score_pattern, text)
    if score_match:
        score = score_match.group(1)
        if score == "PASS":
            score_num = 1
        elif score == "FAIL":
            score_num = 0
        else:
            score_num = None
    else:
        score_num = None

    return reasoning, score_num

In [None]:
def run_metrics(data, sources, model, filename):
    results = {}
    max_retries = 10
    for source in sources:
        # extract subset
        subset = data[data.source_ds == source]

        prompts = [generate_prompt(question, document, answer) for question, document, answer in zip(subset['question'], subset['passage'], subset['answer'])]
        params = SamplingParams(max_tokens=1024)

        responses = model.generate(prompts, params, use_tqdm=True)

        to_retry_inputs = []
        to_retry_indices = []
        for i, resp in enumerate(responses):
            feedback, score = parse_output(resp)
            if feedback is None:
                to_retry_inputs.append(prompts[i])
                to_retry_indices.append(i)

        # Retry logic with progress bar
        retries = 0
        while to_retry_inputs and retries < max_retries:
            retries += 1
            print(f"Retrying failed batches: Attempt {retries}/{max_retries}")
            retry_outputs = model.generate(to_retry_inputs, params, use_tqdm=True)

            new_to_retry_inputs = []
            new_to_retry_indices = []
            for idx, (retry_idx, output) in enumerate(zip(to_retry_indices, retry_outputs)):
                feedback, score = parse_output(output)
                if feedback is None:  # Still failing
                    new_to_retry_inputs.append(to_retry_inputs[idx])
                    new_to_retry_indices.append(to_retry_indices[idx])
                else:
                    responses[retry_idx] = output  # Update with successful retry

            to_retry_inputs = new_to_retry_inputs
            to_retry_indices = new_to_retry_indices

        results[source] = responses
        # postprocess to again extract reasoning and score
        reasonings, scores = postprocess(responses)
        # save results to json
        save_to_json(filename, reasonings, scores, source)

        print("Completed evaluation on {0} dataset. Length of feedback: {1} and scores: {2}".format(source, len(reasonings), len(scores)))
    return results

In [None]:
filename = 'lynx_70b'

results = run_metrics(data, sources, model_70b, filename)

Processed prompts:   0%|                                        | 2/1000 [00:03<27:40,  1.66s/it, est. speed input: 228.07 toks/s, output: 0.50 toks/s]



Processed prompts:   9%|███▎                                 | 89/1000 [01:15<14:54,  1.02it/s, est. speed input: 554.20 toks/s, output: 405.79 toks/s]



Processed prompts:  16%|█████▋                              | 159/1000 [02:15<13:55,  1.01it/s, est. speed input: 537.97 toks/s, output: 453.09 toks/s]



Processed prompts:  25%|█████████▏                          | 254/1000 [03:43<06:25,  1.93it/s, est. speed input: 524.39 toks/s, output: 494.77 toks/s]



Processed prompts:  33%|███████████▊                        | 327/1000 [04:52<10:40,  1.05it/s, est. speed input: 518.60 toks/s, output: 498.20 toks/s]



Processed prompts:  40%|██████████████▎                     | 397/1000 [05:59<10:39,  1.06s/it, est. speed input: 509.70 toks/s, output: 501.95 toks/s]



Processed prompts:  50%|██████████████████                  | 501/1000 [07:23<06:30,  1.28it/s, est. speed input: 520.44 toks/s, output: 512.28 toks/s]



Processed prompts:  57%|████████████████████▋               | 573/1000 [08:30<02:57,  2.40it/s, est. speed input: 514.39 toks/s, output: 519.72 toks/s]



Processed prompts:  65%|███████████████████████▎            | 648/1000 [09:42<05:17,  1.11it/s, est. speed input: 510.82 toks/s, output: 517.61 toks/s]



Processed prompts:  71%|█████████████████████████▋          | 714/1000 [10:46<03:58,  1.20it/s, est. speed input: 504.75 toks/s, output: 521.84 toks/s]



Processed prompts:  78%|████████████████████████████        | 778/1000 [12:01<04:32,  1.23s/it, est. speed input: 494.10 toks/s, output: 518.95 toks/s]



Processed prompts:  89%|███████████████████████████████▉    | 888/1000 [13:29<00:44,  2.54it/s, est. speed input: 506.00 toks/s, output: 523.17 toks/s]



Processed prompts:  97%|███████████████████████████████████ | 973/1000 [14:38<00:33,  1.22s/it, est. speed input: 510.42 toks/s, output: 519.81 toks/s]



Processed prompts: 100%|███████████████████████████████████| 1000/1000 [15:24<00:00,  1.08it/s, est. speed input: 498.63 toks/s, output: 517.75 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts:  17%|██████▍                               | 64/379 [00:59<05:25,  1.03s/it, est. speed input: 513.28 toks/s, output: 387.95 toks/s]



Processed prompts:  32%|███████████▉                         | 122/379 [02:14<12:11,  2.85s/it, est. speed input: 426.34 toks/s, output: 420.29 toks/s]



Processed prompts:  53%|███████████████████▌                 | 201/379 [03:36<01:22,  2.15it/s, est. speed input: 436.57 toks/s, output: 480.95 toks/s]



Processed prompts:  68%|█████████████████████████            | 257/379 [04:34<02:29,  1.23s/it, est. speed input: 437.84 toks/s, output: 484.73 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 379/379 [06:48<00:00,  1.08s/it, est. speed input: 435.76 toks/s, output: 505.78 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts:   1%|▎                                        | 1/141 [00:04<09:36,  4.12s/it, est. speed input: 107.83 toks/s, output: 0.73 toks/s]



Processed prompts:  62%|███████████████████████▍              | 87/141 [01:21<01:12,  1.34s/it, est. speed input: 513.27 toks/s, output: 389.89 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 141/141 [02:24<00:00,  1.03s/it, est. speed input: 461.48 toks/s, output: 478.33 toks/s]


Retrying failed batches: Attempt 3/10


Processed prompts: 100%|███████████████████████████████████████| 59/59 [01:27<00:00,  1.48s/it, est. speed input: 323.56 toks/s, output: 362.83 toks/s]


Retrying failed batches: Attempt 4/10


Processed prompts: 100%|███████████████████████████████████████| 23/23 [00:40<00:00,  1.77s/it, est. speed input: 286.65 toks/s, output: 246.70 toks/s]


Retrying failed batches: Attempt 5/10


Processed prompts: 100%|███████████████████████████████████████| 10/10 [00:38<00:00,  3.80s/it, est. speed input: 146.73 toks/s, output: 124.27 toks/s]


Retrying failed batches: Attempt 6/10


Processed prompts: 100%|███████████████████████████████████████████| 2/2 [00:19<00:00,  9.53s/it, est. speed input: 84.02 toks/s, output: 41.62 toks/s]


Retrying failed batches: Attempt 7/10


Processed prompts: 100%|███████████████████████████████████████████| 1/1 [00:06<00:00,  6.78s/it, est. speed input: 87.06 toks/s, output: 28.18 toks/s]


Retrying failed batches: Attempt 8/10


Processed prompts: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00,  5.97it/s, est. speed input: 3536.11 toks/s, output: 5.99 toks/s]


Retrying failed batches: Attempt 9/10


Processed prompts: 100%|███████████████████████████████████████████| 1/1 [00:35<00:00, 35.77s/it, est. speed input: 16.49 toks/s, output: 28.63 toks/s]


Retrying failed batches: Attempt 10/10


Processed prompts: 100%|██████████████████████████████████████████| 1/1 [00:05<00:00,  5.70s/it, est. speed input: 103.56 toks/s, output: 28.08 toks/s]


Responses without score:  2
Completed evaluation on DROP dataset. Length of feedback: 1000 and scores: 1000


Processed prompts:   0%|                                        | 2/1000 [00:04<33:47,  2.03s/it, est. speed input: 234.79 toks/s, output: 0.71 toks/s]



Processed prompts:   9%|███▍                                 | 93/1000 [01:44<08:55,  1.69it/s, est. speed input: 474.18 toks/s, output: 406.99 toks/s]



Processed prompts:  19%|██████▉                             | 193/1000 [03:12<11:42,  1.15it/s, est. speed input: 548.62 toks/s, output: 443.63 toks/s]



Processed prompts:  29%|██████████▎                         | 288/1000 [04:51<07:19,  1.62it/s, est. speed input: 546.27 toks/s, output: 458.41 toks/s]



Processed prompts:  36%|████████████▉                       | 361/1000 [06:18<15:46,  1.48s/it, est. speed input: 529.66 toks/s, output: 453.89 toks/s]



Processed prompts:  45%|████████████████▎                   | 454/1000 [07:51<19:41,  2.16s/it, est. speed input: 537.58 toks/s, output: 458.88 toks/s]



Processed prompts:  53%|███████████████████▏                | 534/1000 [09:23<07:16,  1.07it/s, est. speed input: 530.50 toks/s, output: 463.44 toks/s]



Processed prompts:  61%|█████████████████████▉              | 609/1000 [10:57<02:39,  2.45it/s, est. speed input: 516.91 toks/s, output: 467.56 toks/s]



Processed prompts:  71%|█████████████████████████▌          | 710/1000 [12:41<01:36,  3.02it/s, est. speed input: 519.96 toks/s, output: 469.18 toks/s]



Processed prompts:  78%|████████████████████████████        | 781/1000 [14:02<05:34,  1.53s/it, est. speed input: 516.27 toks/s, output: 463.47 toks/s]



Processed prompts:  86%|███████████████████████████████▏    | 865/1000 [15:31<02:58,  1.32s/it, est. speed input: 516.42 toks/s, output: 467.48 toks/s]



Processed prompts:  96%|██████████████████████████████████▌ | 961/1000 [17:06<00:13,  2.80it/s, est. speed input: 521.10 toks/s, output: 472.08 toks/s]



Processed prompts: 100%|███████████████████████████████████| 1000/1000 [18:08<00:00,  1.09s/it, est. speed input: 512.22 toks/s, output: 468.86 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts:  14%|█████▎                                | 61/433 [00:55<04:48,  1.29it/s, est. speed input: 608.32 toks/s, output: 375.94 toks/s]



Processed prompts:  31%|███████████▌                         | 135/433 [02:11<04:04,  1.22it/s, est. speed input: 572.11 toks/s, output: 420.30 toks/s]



Processed prompts:  53%|███████████████████▍                 | 228/433 [03:50<02:57,  1.15it/s, est. speed input: 556.62 toks/s, output: 442.37 toks/s]



Processed prompts:  71%|██████████████████████████▏          | 307/433 [05:22<01:13,  1.71it/s, est. speed input: 536.77 toks/s, output: 456.21 toks/s]



Processed prompts:  94%|██████████████████████████████████▌  | 405/433 [06:53<00:27,  1.04it/s, est. speed input: 550.03 toks/s, output: 455.10 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 433/433 [07:40<00:00,  1.06s/it, est. speed input: 529.71 toks/s, output: 457.74 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts:  37%|██████████████▏                       | 67/180 [01:25<02:31,  1.34s/it, est. speed input: 434.74 toks/s, output: 356.44 toks/s]



Processed prompts:  84%|███████████████████████████████▏     | 152/180 [02:59<00:33,  1.20s/it, est. speed input: 481.19 toks/s, output: 423.36 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 180/180 [03:58<00:00,  1.33s/it, est. speed input: 429.93 toks/s, output: 417.78 toks/s]


Retrying failed batches: Attempt 3/10


Processed prompts: 100%|███████████████████████████████████████| 85/85 [01:40<00:00,  1.19s/it, est. speed input: 490.82 toks/s, output: 388.08 toks/s]


Retrying failed batches: Attempt 4/10


Processed prompts: 100%|███████████████████████████████████████| 40/40 [00:53<00:00,  1.35s/it, est. speed input: 441.84 toks/s, output: 404.86 toks/s]


Retrying failed batches: Attempt 5/10


Processed prompts: 100%|███████████████████████████████████████| 17/17 [00:41<00:00,  2.42s/it, est. speed input: 253.34 toks/s, output: 264.78 toks/s]


Retrying failed batches: Attempt 6/10


Processed prompts: 100%|█████████████████████████████████████████| 8/8 [00:37<00:00,  4.66s/it, est. speed input: 129.52 toks/s, output: 125.31 toks/s]


Retrying failed batches: Attempt 7/10


Processed prompts: 100%|██████████████████████████████████████████| 3/3 [00:10<00:00,  3.63s/it, est. speed input: 178.54 toks/s, output: 67.48 toks/s]


Retrying failed batches: Attempt 8/10


Processed prompts: 100%|███████████████████████████████████████████| 2/2 [00:22<00:00, 11.48s/it, est. speed input: 55.47 toks/s, output: 36.16 toks/s]


Responses without score:  4
Completed evaluation on pubmedQA dataset. Length of feedback: 1000 and scores: 1000


Processed prompts:   3%|█                                    | 30/1000 [00:40<32:04,  1.98s/it, est. speed input: 689.16 toks/s, output: 185.96 toks/s]



Processed prompts:  20%|███████▎                            | 203/1000 [04:31<25:07,  1.89s/it, est. speed input: 716.21 toks/s, output: 308.88 toks/s]



Processed prompts:  35%|████████████▋                       | 352/1000 [08:05<08:20,  1.29it/s, est. speed input: 689.36 toks/s, output: 331.85 toks/s]



Processed prompts:  51%|██████████████████▏                 | 506/1000 [11:39<12:51,  1.56s/it, est. speed input: 684.63 toks/s, output: 332.86 toks/s]



Processed prompts:  65%|███████████████████████▎            | 647/1000 [14:58<11:21,  1.93s/it, est. speed input: 680.69 toks/s, output: 339.41 toks/s]



Processed prompts:  79%|████████████████████████████▌       | 793/1000 [18:24<04:48,  1.39s/it, est. speed input: 671.59 toks/s, output: 340.40 toks/s]



Processed prompts:  95%|██████████████████████████████████▎ | 952/1000 [21:53<01:03,  1.32s/it, est. speed input: 676.65 toks/s, output: 342.94 toks/s]



Processed prompts: 100%|███████████████████████████████████| 1000/1000 [23:17<00:00,  1.40s/it, est. speed input: 670.66 toks/s, output: 342.37 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts:  19%|███████▏                              | 95/499 [02:42<06:14,  1.08it/s, est. speed input: 576.54 toks/s, output: 313.77 toks/s]



Processed prompts:  44%|████████████████▍                    | 221/499 [05:51<04:33,  1.02it/s, est. speed input: 620.52 toks/s, output: 316.11 toks/s]



Processed prompts:  78%|████████████████████████████▊        | 389/499 [10:07<04:33,  2.49s/it, est. speed input: 629.43 toks/s, output: 321.15 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 499/499 [13:10<00:00,  1.58s/it, est. speed input: 628.33 toks/s, output: 323.14 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts:  20%|███████▍                              | 50/254 [01:13<04:00,  1.18s/it, est. speed input: 670.46 toks/s, output: 260.17 toks/s]



Processed prompts:  85%|███████████████████████████████▌     | 217/254 [05:37<00:53,  1.46s/it, est. speed input: 677.10 toks/s, output: 300.45 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 254/254 [07:02<00:00,  1.66s/it, est. speed input: 652.46 toks/s, output: 292.92 toks/s]


Retrying failed batches: Attempt 3/10


Processed prompts:  84%|███████████████████████████████      | 109/130 [03:05<00:34,  1.64s/it, est. speed input: 649.18 toks/s, output: 289.69 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 130/130 [04:06<00:00,  1.90s/it, est. speed input: 595.40 toks/s, output: 285.53 toks/s]


Retrying failed batches: Attempt 4/10


Processed prompts: 100%|███████████████████████████████████████| 72/72 [02:19<00:00,  1.94s/it, est. speed input: 637.38 toks/s, output: 245.44 toks/s]


Retrying failed batches: Attempt 5/10


Processed prompts: 100%|███████████████████████████████████████| 38/38 [01:31<00:00,  2.41s/it, est. speed input: 543.24 toks/s, output: 238.86 toks/s]


Retrying failed batches: Attempt 6/10


Processed prompts: 100%|███████████████████████████████████████| 21/21 [00:44<00:00,  2.10s/it, est. speed input: 662.23 toks/s, output: 189.97 toks/s]


Retrying failed batches: Attempt 7/10


Processed prompts: 100%|███████████████████████████████████████| 11/11 [00:40<00:00,  3.64s/it, est. speed input: 399.57 toks/s, output: 124.21 toks/s]


Retrying failed batches: Attempt 8/10


Processed prompts: 100%|██████████████████████████████████████████| 6/6 [00:37<00:00,  6.25s/it, est. speed input: 211.83 toks/s, output: 64.48 toks/s]


Retrying failed batches: Attempt 9/10


Processed prompts: 100%|██████████████████████████████████████████| 3/3 [00:20<00:00,  6.80s/it, est. speed input: 211.27 toks/s, output: 37.43 toks/s]


Retrying failed batches: Attempt 10/10


Processed prompts: 100%|██████████████████████████████████████████| 3/3 [00:36<00:00, 12.24s/it, est. speed input: 117.47 toks/s, output: 63.61 toks/s]


Responses without score:  5
Completed evaluation on FinanceBench dataset. Length of feedback: 1000 and scores: 1000


Processed prompts:  23%|████████▍                           | 233/1000 [17:18<53:11,  4.16s/it, est. speed input: 903.28 toks/s, output: 105.12 toks/s]



Processed prompts:  81%|███████████████████████████▌      | 809/1000 [1:02:31<17:00,  5.34s/it, est. speed input: 870.39 toks/s, output: 105.34 toks/s]



Processed prompts: 100%|█████████████████████████████████| 1000/1000 [1:17:36<00:00,  4.66s/it, est. speed input: 864.77 toks/s, output: 105.60 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts:  67%|████████████████████████▉            | 442/657 [35:45<22:21,  6.24s/it, est. speed input: 839.57 toks/s, output: 104.83 toks/s]



Processed prompts: 100%|█████████████████████████████████████| 657/657 [54:50<00:00,  5.01s/it, est. speed input: 817.46 toks/s, output: 104.42 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts:  36%|█████████████▎                       | 161/448 [13:53<18:05,  3.78s/it, est. speed input: 808.82 toks/s, output: 100.56 toks/s]

### ELI5

In [None]:
def run_metrics(data, model, filename):
    results = {}
    max_retries = 10
    source = 'eli5'
    prompts = [generate_prompt(question, document, answer) for question, document, answer in zip(data['user_input'], data['reference'], data['response'])]
    params = SamplingParams(max_tokens=1024)

    responses = model.generate(prompts, params, use_tqdm=True)

    to_retry_inputs = []
    to_retry_indices = []
    for i, resp in enumerate(responses):
        feedback, score = parse_output(resp)
        if feedback is None:
            to_retry_inputs.append(prompts[i])
            to_retry_indices.append(i)

    # Retry logic with progress bar
    retries = 0
    while to_retry_inputs and retries < max_retries:
        retries += 1
        print(f"Retrying failed batches: Attempt {retries}/{max_retries}")
        retry_outputs = model.generate(to_retry_inputs, params, use_tqdm=True)

        new_to_retry_inputs = []
        new_to_retry_indices = []
        for idx, (retry_idx, output) in enumerate(zip(to_retry_indices, retry_outputs)):
            feedback, score = parse_output(output)
            if feedback is None:  # Still failing
                new_to_retry_inputs.append(to_retry_inputs[idx])
                new_to_retry_indices.append(to_retry_indices[idx])
            else:
                responses[retry_idx] = output  # Update with successful retry

        to_retry_inputs = new_to_retry_inputs
        to_retry_indices = new_to_retry_indices

    results[source] = responses
    # postprocess to extract reasoning and score
    reasonings, scores = postprocess(responses)
    # save results to json
    save_to_json(filename, reasonings, scores, source)

    print("Completed evaluation on {0} dataset. Length of feedback: {1} and scores: {2}".format(source, len(reasonings), len(scores)))
    return results

In [None]:
from datasets import load_dataset

ds = load_dataset("explodinggradients/ELI5")
data = ds["train"].to_pandas()

filename = 'lynx_70b'

results = run_metrics(data, model_70b, filename)

Processed prompts:   3%|██                                                                    | 3/100 [00:05<02:18,  1.43s/it, est. speed input: 219.47 toks/s, output: 0.56 toks/s]



Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 100/100 [01:38<00:00,  1.02it/s, est. speed input: 413.89 toks/s, output: 544.98 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 38/38 [00:43<00:00,  1.16s/it, est. speed input: 351.71 toks/s, output: 494.67 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.79s/it, est. speed input: 146.95 toks/s, output: 211.57 toks/s]


Retrying failed batches: Attempt 3/10


Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 6/6 [00:36<00:00,  6.16s/it, est. speed input: 69.29 toks/s, output: 123.36 toks/s]


Retrying failed batches: Attempt 4/10


Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:27<00:00,  9.31s/it, est. speed input: 45.13 toks/s, output: 39.72 toks/s]

Responses without score:  0
Completed evaluation on eli5 dataset. Length of feedback: 100 and scores: 100



