Skip to content
This repository has been archived by the owner on Jun 1, 2024. It is now read-only.

Latest commit

 

History

History
57 lines (34 loc) · 2.8 KB

crowd-instruction.md

File metadata and controls

57 lines (34 loc) · 2.8 KB

Which AI is Better?

You are an expert in AI evaluation, and your task is to compare the quality of model outputs from two AI systems, System A and System B.

Your goal is to determine which model produces better output for a given prompt.

Steps

Step 1. Carefully read the provided prompt and model outputs.

  • Read the output generated by System A.
  • Read the output generated by System B.

Step 2. Compare the two outputs (System A and System B) and assess their overall quality based on the following factors:

  • Relevance: Which output aligns better with the given prompt and provides a more suitable response?
  • Coherence: Consider the clarity and logical flow of the responses. Which output is more coherent and well-structured?
  • Grammatical Correctness: Identify any grammatical errors or issues in both outputs.
  • Completeness: Evaluate if either output misses important information or context from the prompt.

Step 3. Based on your assessment, select the model that you believe produces the better output for the given prompt.

  • You must choose only one: System A, System B, or Tie. In most cases, one system is better than another.
  • If two outputs are identical, please select Tie.
  • If both systems did not perform the task correctly, please also select Tie.
  • Some long outputs might be truncated. That's fine and should not count as an error.

Thank you for your expertise in this evaluation task! Your contribution is valuable in assessing the quality of AI models and will be released as an open-source dataset.

Examples

Example 1:

Prompt. Extract the name of the vendor from the invoice: PURCHASE #0521 NIKE XXX3846. Reply with only the name.

  • System A. Nike
  • System B. The name of the vendor is NIKE.

The correct option is System A because its output is better aligned to the given prompt (relevance), both systems printed well-structured replies (coherence), both outputs are grammatical (grammatical correctness), and no information lost (completeness).

Example 2:

Prompt. Help me find out if this customer review is more "positive" or "negative". Q: Review 1. A: positive. Q: Review 2. A:

  • System A. I would classify the first review as negative and the second review as positive.
  • System B. positive

The correct option is System B because its output is better aligned to the given prompt (relevance), its output better fits the few-shot prompt structure (coherence), both outputs are grammatical (grammatical correctness), and no information lost (completeness).

Example 3:

Prompt. How much is 2+2?

  • System A. 22
  • System B. As an AI language model, I cannot express this.

The correct option is Tie because both systems performed poorly. One should expect 4 as the reasonable output.