# Prometheus




You'd need a GPU with atleast 32GB RAM in order to load the smallest Prometheus model i.e. 7B FP32.

You'd need 4 GPUs with atleast 32 GB RAM each to load the bigger Prometheus model i.e. 8x7B MoE.

Install dependencies

In [None]:
%pip install datasets
%pip install prometheus-eval vllm triton

Load model

In [None]:
from prometheus_eval.vllm import VLLM
from prometheus_eval import PrometheusEval

model_name_7b = "prometheus-eval/prometheus-7b-v2.0"
# model_7b = VLLM(model=model_name_7b)

model_name_8x7b = "prometheus-eval/prometheus-8x7b-v2.0"
model_8x7b = VLLM(model=model_name_8x7b, tensor_parallel_size=4) # tensor_parallel_size = num_of_gpus

INFO 12-16 04:07:31 config.py:1020] Defaulting to use mp for distributed inference
INFO 12-16 04:07:31 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='prometheus-eval/prometheus-8x7b-v2.0', speculative_config=None, tokenizer='prometheus-eval/prometheus-8x7b-v2.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=prometheus-eval/prometheus-8x7b-v2.0, num_sche

Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]


### Halubench

In [None]:
from datasets import load_dataset

ds = load_dataset("PatronusAI/HaluBench")
data = ds["test"].to_pandas()

# store sources for subsets
sources = data.source_ds.unique().tolist()
sources.remove('RAGTruth')
sources.remove('halueval')

In [None]:
import os
import json

def save_to_json(filename, feedbacks, scores, source):
    results_dict = {}

    # Check if the JSON file exists and load its content
    if os.path.exists(f'{filename}.json'):
        try:
            with open(f'{filename}.json', 'r') as f:
                file_content = f.read().strip()
                if file_content:  # Ensure the file is not empty
                    results_dict = json.loads(file_content)
        except (json.JSONDecodeError, IOError) as e:
            print(f"Error loading {filename}.json: {e}. Initializing as empty.")

    # Add the new results under the source key
    results_dict[source] = [{'feedback': feedback, 'score': score} for feedback, score in zip(feedbacks, scores)]

    # Write the updated dictionary back to the file
    try:
        with open(f'{filename}.json', 'w') as f:
            json.dump(results_dict, f, indent=2)
    except IOError as e:
        print(f"Error saving {filename}.json: {e}")

In [None]:
def run_metrics(data, sources, model, prompt_template, rubrics, filename, normalize=False):
    for source in sources:
        # extract subset
        subset = data[data.source_ds == source]
        # re-format
        instructions = []
        responses = []

        for idx, row in subset.iterrows():
            # combine passage and question for instructions
            passage = "Passage:\n" + row['passage']
            question = "Question:\n" + row['question']
            instructions.append(passage + "\n" + question)

            answer = row['answer']
            responses.append(answer)

        # initialise prometheus judge
        judge = PrometheusEval(model=model, absolute_grade_template=prompt_template)
        # run prometheus model
        feedbacks, scores = judge.absolute_grade(
            instructions=instructions,
            responses=responses,
            rubric=rubrics[source]
        )

        if normalize:
            scores = [(score - 1) / 4 for score in scores]
        else:
            scores = [1 if score == 5 else 0 for score in scores]
        # save results to json
        save_to_json(filename, feedbacks, scores, source)

        print("Completed evaluation on {0} dataset. Length of feedback: {1} and scores: {2}".format(source, len(feedbacks), len(scores)))

Original prompt template

In [None]:
prompt_template_orig = """###Task Description:
An instruction (including passage and a question), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "(write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Score Rubrics:
{rubric}

###Feedback: """

Scoring rubrics for all 4 datasets of Halubench

In [None]:
covidqa_rubric_data = """
[Does the response contain only information that is explicitly present in the provided passage?]
Score 1: The response states information that is completely absent from or contradicts the passage. The answer includes specific numbers, facts, or claims that cannot be found in the passage.
Score 2: The response contains a mix of information from the passage and additional unsupported claims. While some elements are correct, the answer includes details, numbers, or relationships not present in the source passage.
Score 3: The response mostly aligns with the passage but includes minor additions or assumptions. The core information is accurate, but the answer occasionally includes details that aren't explicitly stated in the passage.
Score 4: The response closely follows the passage with rare deviations. Nearly all information can be directly traced to the source text, with only slight imprecisions in specific details.
Score 5: The response contains only information that is explicitly stated in the passage. All specific numbers, relationships, and technical details can be directly verified in the source text.
"""

In [None]:
pubmed_rubric_data = """
[How accurately does the response reflect the evidence presented in the passage, maintaining appropriate scientific precision and avoiding unsupported claims or speculation?]
Score 1: The response makes claims that directly contradict the passage, fabricates unsupported information, misuses technical terminology, introduces speculative mechanisms or implications, makes absolute claims without appropriate uncertainty, or drastically mismatches the passage's level of detail. The response fails to maintain scientific integrity and cannot be considered reliable.
Score 2: The response contains significant misinterpretations of evidence, unsupported extrapolations beyond data scope, imprecise use of technical terminology, inclusion of speculative details, overgeneralization of findings, or substantial deviation from the passage's level of detail. While some aspects may be accurate, the response's reliability is compromised by these issues.
Score 3: The response generally aligns with the main findings but includes minor unsupported details, slight misinterpretations, occasional imprecise terminology, reasonable but unsupported elaborations, missing some limitations, or inconsistent detail level. While generally reliable, the response requires some scrutiny for complete accuracy.
Score 4: The response accurately reflects the evidence with only minor issues such as subtle extrapolations (though reasonable), rare imprecisions in technical terminology, occasional missing caveats, or slight variations in detail level. The response maintains good scientific integrity and can be considered largely reliable.
Score 5: The response perfectly reflects the presented evidence, maintains appropriate scientific uncertainty, uses precise technical terminology, avoids unsupported speculation, properly acknowledges limitations, and matches the passage's level of detail. The response maintains complete scientific integrity and can be fully relied upon as an accurate reflection of the passage.
"""

In [None]:
drop_rubric_data = """
[Does the response contain only information that is explicitly supported by the passage, maintaining accuracy and relevance to the specific question asked?]
Score 1: The response contains information that is completely unsupported by the passage, or contradicts the passage directly. This includes fabricated details, numbers, or claims that cannot be verified from the source passage.
Score 2: The response contains significant inaccuracies or unverified information, though some elements might align with the passage. The answer may include unsupported numerical values or claims that go beyond reasonable inference.
Score 3: The response shows partial alignment with the passage but includes some unverified details or questionable inferences. While core information might be accurate, there are elements that cannot be fully verified from the passage.
Score 4: The response closely aligns with the passage with very minimal unverified information. Any inferences made are reasonable and well-supported by the content, though there might be slight imprecisions in numerical values or specific details.
Score 5: The response contains only information that is explicitly stated in or can be directly verified from the passage. All numerical values, facts, and claims perfectly match the passage, and all inferences are fully supported by the content.
"""

In [None]:
financebench_rubric_data = """
[Does the response provide an answer that is verifiable against the provided passage, using specified formulas when given and adhering to any stated rounding requirements?]
Score 1: The response presents a value or explanation that cannot be verified using the information in the provided passage, showing no clear connection to the source data or specified calculation methods.
Score 2: The response shows some connection to the provided data or specified formulas, but contains significant errors or makes unsupported claims that deviate from what can be verified.
Score 3: The response generally aligns with the provided data and calculation methods but contains minor errors in computation, rounding, or reasoning.
Score 4: The response closely matches what can be verified from the provided data using proper methods, with only minimal deviations in precision or completeness.
Score 5: The response exactly matches what can be verified from the provided data, using specified formulas when given and adhering perfectly to any stated requirements for rounding or explanation.
"""

In [None]:
scoring_rubrics = {
    'covidQA': covidqa_rubric_data,
    'pubmedQA': pubmed_rubric_data,
    'DROP': drop_rubric_data,
    'FinanceBench': financebench_rubric_data
}

In [None]:
filename = 'prometheus_8x7b_orig_prompt_template'
normalize = True

run_metrics(data, sources, model_8x7b, prompt_template_orig, scoring_rubrics, filename, normalize=normalize)

Processed prompts: 100%|███████████████████| 1000/1000 [01:10<00:00, 14.15it/s, est. speed input: 11529.24 toks/s, output: 2099.02 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts: 100%|████████████████████████████| 4/4 [00:03<00:00,  1.19it/s, est. speed input: 960.75 toks/s, output: 195.25 toks/s]


Processed 1000/1000 instances.


Finalizing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 17021.79it/s]


Completed evaluation on DROP dataset. Length of feedback: 1000 and scores: 1000


Processed prompts:   0%|                                     | 0/1000 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|███████████████████| 1000/1000 [01:44<00:00,  9.61it/s, est. speed input: 10086.64 toks/s, output: 2144.07 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts: 100%|█████████████████████████████| 1/1 [00:03<00:00,  3.35s/it, est. speed input: 325.77 toks/s, output: 92.78 toks/s]


Processed 1000/1000 instances.


Finalizing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 11135.22it/s]


Completed evaluation on pubmedQA dataset. Length of feedback: 1000 and scores: 1000


Processed prompts: 100%|███████████████████| 1000/1000 [02:29<00:00,  6.69it/s, est. speed input: 10967.98 toks/s, output: 1503.26 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts: 100%|█████████████████████████| 11/11 [00:10<00:00,  1.05it/s, est. speed input: 2362.07 toks/s, output: 303.02 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts: 100%|█████████████████████████████| 1/1 [00:02<00:00,  2.19s/it, est. speed input: 557.94 toks/s, output: 91.92 toks/s]


Processed 1000/1000 instances.


Finalizing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 13176.50it/s]


Completed evaluation on FinanceBench dataset. Length of feedback: 1000 and scores: 1000


Processed prompts: 100%|████████████████████| 1000/1000 [05:10<00:00,  3.22it/s, est. speed input: 16102.00 toks/s, output: 454.69 toks/s]


Retrying failed batches: Attempt 1/10


Processed prompts: 100%|████████████████████████| 64/64 [00:23<00:00,  2.78it/s, est. speed input: 15923.47 toks/s, output: 372.20 toks/s]


Retrying failed batches: Attempt 2/10


Processed prompts: 100%|███████████████████████████| 5/5 [00:03<00:00,  1.26it/s, est. speed input: 8072.07 toks/s, output: 148.62 toks/s]


Retrying failed batches: Attempt 3/10


Processed prompts: 100%|████████████████████████████| 1/1 [00:01<00:00,  1.69s/it, est. speed input: 4749.96 toks/s, output: 74.04 toks/s]


Processed 1000/1000 instances.


Finalizing: 100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 17234.48it/s]

Completed evaluation on covidQA dataset. Length of feedback: 1000 and scores: 1000





### ELI5

In [None]:
from datasets import load_dataset

ds = load_dataset("explodinggradients/ELI5")
data = ds["train"].to_pandas()

def run_metrics(data, model, prompt_template, rubrics, filename, normalize=False):
    # re-format
    instructions = []
    responses = []

    for idx, row in data.iterrows():
        # combine passage and question for instructions
        passage = "Passage:\n" + row['reference']
        question = "Question:\n" + row['user_input']
        instructions.append(passage + "\n" + question)

        answer = row['response']
        responses.append(answer)

    # initialise prometheus judge
    judge = PrometheusEval(model=model, absolute_grade_template=prompt_template)
    # run prometheus model
    feedbacks, scores = judge.absolute_grade(
        instructions=instructions,
        responses=responses,
        rubric=rubrics
    )

    if normalize:
        scores = [(score - 1) / 4 for score in scores]
    else:
        scores = [1 if score == 5 else 0 for score in scores]
    # save results to json
    save_to_json(filename, feedbacks, scores, 'eli5')

    print("Completed evaluation on {0} dataset. Length of feedback: {1} and scores: {2}".format('eli5', len(feedbacks), len(scores)))

eli5_rubric_data = """
[Does the response contain only information that is explicitly supported by the passage, maintaining accuracy and relevance to the specific question asked?]
Score 1: The response contains information that is completely unsupported by the passage, or contradicts the passage directly. This includes fabricated details, numbers, or claims that cannot be verified from the source passage.
Score 2: The response contains significant inaccuracies or unverified information, though some elements might align with the passage. The answer may include unsupported numerical values or claims that go beyond reasonable inference.
Score 3: The response shows partial alignment with the passage but includes some unverified details or questionable inferences. While core information might be accurate, there are elements that cannot be fully verified from the passage.
Score 4: The response closely aligns with the passage with very minimal unverified information. Any inferences made are reasonable and well-supported by the content, though there might be slight imprecisions in numerical values or specific details.
Score 5: The response contains only information that is explicitly stated in or can be directly verified from the passage. All numerical values, facts, and claims perfectly match the passage, and all inferences are fully supported by the content.
"""

filename = 'prometheus_8x7b_orig_prompt_template'
normalize = True

run_metrics(data, model_8x7b, prompt_template_orig, eli5_rubric_data, filename, normalize=normalize)

README.md:   0%|          | 0.00/480 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/48.6k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

Processed prompts: 100%|████████████████████████████████████████████████████████████████| 100/100 [00:15<00:00,  6.53it/s, est. speed input: 4782.52 toks/s, output: 1417.63 toks/s]


Processed 100/100 instances.


Finalizing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 10805.33it/s]

Completed evaluation on eli5 dataset. Length of feedback: 100 and scores: 100



