# Evaluation Guideline

## 1. Set Up the Evaluation Framework
- **Objective**: Establish a robust method to assess AI model performance.
- **Key Considerations**:
  - Focus on functional correctness as the ultimate metric for evaluating application performance.
  - Use a combination of reference-based and reference-free metrics depending on available data.
## 2. Core Evaluation Metrics
- **Loss Function**:
  - Return logarithmic loss to quantify model performance on training data.
- **Functional Correctness**:
  - Measure how well the AI performs its intended task (e.g., answering questions accurately).

## 3. Similarity Measurements (Against Reference Data)
- **Input**: Question, generated response, and reference response(s).
- **Types of Similarity**:
  1. **Lexical Similarity**:
     - Measures how similar the generated response *looks* to the reference.
     - Metrics: BLEU, ROUGE, METEOR++, TER, CIDEr (n-gram-based).
     - **Caveats**: Requires a comprehensive set of reference responses; higher scores don’t always mean better responses.
  2. **Semantic Similarity**:
     - Evaluates closeness in meaning between generated and reference responses.
     - Depends on the quality of embedding algorithms; poor embeddings can yield low scores despite similar meanings.
     - **Drawback**: May require significant compute and time.
- **Avoid**: Exact match (rarely works effectively).

## 4. AI as a Judge
- **Overview**: AI-judged evaluation is a prevalent method in production settings.
- **Approaches**:

  1. **Quality of Response**:

In [2]:
prompt = """Given the following question and answer, evaluate how good the answer is
    for the question. Use the score from 1 to 5.
    - 1 means very bad.
    - 5 means very good.
    Question: [QUESTION]
    Answer: [ANSWER]
    Score:"""

2. **Comparison to Reference**:

In [None]:
prompt = """Given the following question, reference answer, and generated answer,
    evaluate whether this generated answer is the same as the reference answer.
    Output True or False.
    Question: [QUESTION]
    Reference answer: [REFERENCE ANSWER]
    Generated answer: [GENERATED ANSWER]"""

 3. **Preference Comparison**

In [None]:
prompt = """Given the following question and two answers, evaluate which answer is
    better. Output A or B.
    Question: [QUESTION]
    A: [FIRST ANSWER]
    B: [SECOND ANSWER]
    The better answer is:"""

- **Prompting an AI Judge**:
  1. Define the task (e.g., evaluate relevance).
  2. Specify criteria (e.g., “Does the answer address the question per the ground truth?”).
  3. Choose a scoring system:
     - **Classification**: e.g., good/bad, relevant/irrelevant.
     - **Discrete Numerical**: e.g., 1-5 (preferred over continuous).
     - **Continuous Numerical**: e.g., 0-1 (less effective).
  - **Tips**:
    - Language models excel with text/classification over numbers.
    - Discrete scoring (1-5) outperforms wider ranges or continuous scoring.
    - Include examples in prompts (e.g., what a 1, 3, or 5 looks like) for better results.

- **Example Prompt**:  

In [None]:
prompt = """
  Score the relevance of a generated answer to a question based on the ground truth, using 1-5. Focus on whether the generated answer contains sufficient info per the ground truth. If it contradicts the ground truth, score 1-2. 
  Example: 
  Question: "Is the sky blue?" 
  Ground truth: "Yes, the sky is blue." 
  Generated: "No, the sky is not blue." 
  Score: 1 (Reason: Contradicts ground truth).
"""


## 5. Specialized Evaluation Tools
- **Reward Model**:
- Scores a (prompt, response) pair based on quality.
- **Reference-Based Judge**:
- Assesses generated responses relative to reference(s).
- **Preference Model**:
- Predicts user-preferred responses for alignment or ranking.
- **Self-Evaluation/Self-Critique**:
- Model evaluates its own outputs (experimental approach).
- **Subset Evaluation**:
- Use a strong model (e.g., GPT-4) to judge a sample (e.g., 1%) of responses generated by a cheaper model.

## 6. Deprecated/Unsupported Metrics
- **Perplexity** *(Don’t Use)*:
- Lower with structured data, higher with large vocabularies, lower with longer contexts.
- Typical values can be as low as 3, but it’s unreliable for models post-trained with SFT/RLHF.
- Maybe use to confirm the data, model from pretraining

## 7. Finetuning
1. **Your training data should be:**
- Large (ideally thousands or tens of thousands of examples)
- High-quality (consistently formatted and cleaned of incomplete or incorrect 0 examples)
- Representative (training data should be similar to the data upon which you’ll use your model)
- Sufficiently specified (i.e., containing enough information in the input to generate what you want to see in the output)

2. **Prompts for a fine-tuned model do not typically need instructions or examples, as the model can learn the task from the training examples. Including instructions shouldn’t hurt performance, but the extra text tokens will add cost to each API call.**

3. **Instructions can still be useful when fine-tuning a single model to do multiple tasks. For example, if you train a model to classify multiple features from the same text string (e.g., whether an item is edible or whether it’s handheld), you’ll need some type of instruction to tell the model which feature you want labeled.

4. For classification, end your text prompts with a text sequence to tell the model that the input text is done and the classification should begin. Without such a signal, the model may append additional invented text before appending a class label, resulting in outputs like:
- burger edible (accurate)
- burger and fries edible (not quite was asked for)
- burger-patterned novelty tie inedible (inaccurate)
- burger burger burger burger (no label generated)

5. In general, fine-tuning can work with any label, whether the label has semantic meaning (e.g., “ edible”) or not (e.g., “1”). That said, in cases with little training data per label, it’s possible that semantic labels work better, so that the model can leverage its knowledge of the label’s meaning.

    When convenient, we recommend single-token labels. You can check the number of tokens in a string with the OpenAI tokenizer. Single-token labels have a few advantages:
Lowest cost
Easier to get their probabilities, which are useful for metrics confidence scores, precision, recall
No hassle from specifying stop sequences or post-processing completions in order to compare labels of different length

6. One useful fact: all numbers <500 are single tokens.
7. If you do use multi-token labels, we recommend that each label begin with a different token. If multiple labels begin with the same token, an unsure model might end up biased toward those labels.
8. To assess the value of getting more data, you can train models on subsets of your current dataset—e.g., 25%, 50%, 100%—and then see how performance scales with dataset size. If you plot accuracy versus number of training examples, the slope at 100% will indicate the improvement you can expect from getting more data. (Note that you cannot infer the value of additional data from the evolution of accuracy during a single training run, as a model half-trained on twice the data is not equivalent to a fully trained model.) 
9. Evaluating your fine-tuned model is crucial to (a) improve your model and (b) tell when it’s good enough to be deployed.

    Many metrics can be used to characterize the performance of a classifier
    - Accuracy
    - F1
    - Precision / Positive Predicted Value / False Discovery Rate
    - Recall / Sensitivity
    - Specificity
    - AUC / AUROC (area under the receiver operator characteristic curve)
    - AUPRC (area under the precision recall curve)
    - Cross entropy
10. A single project might end up trying all models. One illustrative development path might look like this:
    - Test code using the cheapest & fastest model (ada)
    - Run a few early experiments to check whether your dataset works as expected with a middling model (curie)
    - Run a few more experiments with the best model to see how far you can push performance (text-davinci-002)
    - Once you have good results, do a training run with all models to map out the price-performance frontier and select the model that makes the most sense for your use case  (ada, babbage, curie, davinci, text-davinci-002)
11. Another possible development path that uses multiple models could be:
    - Starting with a small dataset, train the best possible model (text-davinci-002)
    - Use this fine-tuned model to generate many more labels and expand your dataset by multiples
    - Use this new dataset to train a cheaper model (ada)
12. - Step 1: Fine-tune on cheap, semi-relevant data
    - E.g., unstructured domain text (such as legal or medical text)
    - E.g., similar task data (such as another large classification set)
    - Step 2: Fine-tune on expensive labeled examples
    - E.g., text and classes (if training a classifier)
    - To fine-tune a previously fine-tuned model, pass in the fine-tuned model name when creating a new fine-tuning job (e.g. -m curie:ft-<org>-<date>). Other training parameters do not have to be changed, however if your new training data is much smaller than your previous training data, you may find it useful to reduce learning_rate_multiplier by a factor of 2 to 4.


## 8. Common mistakes
1. **Insufficiently specified training data**:
One thing to keep in mind is that training data is more than just a mapping of inputs to correct answers. Crucially, the inputs need to contain the information needed to derive an answer. This can happen more subtly when some information is given but some is still missing.
2. **Input data format that doesn’t match the training data format**:
Make sure that when you use your fine-tuned model, your submitted prompts match the format of your training data.


## 9. Prompt Engineering
1. **Write Clear and Explicit Instructions**:
Communicating with AI is the same as communicating with humans: clarity helps.
Here are a few tips on how to write clear instructions.
    - Explain, without ambiguity, what you want the model to do
If you want the model to score an essay, explain the score system you want to use. Is
it from 1 to 5 or 1 to 10? If there’s an essay the model’s uncertain about, do you want
it to pick a score to the best of its ability or to output “I don’t know”?
    - As you experiment with a prompt, you might observe undesirable behaviors that
require adjustments to the prompt to prevent them.

2. **Ask the model to adopt a persona**:
A persona can help the model to understand the perspective it’s supposed to use to
generate responses.

3. **Provide examples**:
Examples can reduce ambiguity about how you want the model to respond. This might sound obvious, but if you’re worried about input token length, opt for
example formats that use fewer tokens.

4. **Specify the output format**:
If you want the model to be concise, tell it so. Long outputs are not only costly
(model APIs charge per token) but they also increase latency. For tasks expecting structured outputs, such as classification, use markers to mark
the end of the prompts to let the model know that the structured outputs should
begin.

5. **Provide Sufficient Context**:
Just as reference texts can help students do better on an exam, sufficient context can
help models perform better. If you want the model to answer questions about a
paper, including that paper in the context will likely improve the model’s responses.

## More reference
- **OPENAI Cook book for finetuning** https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit?tab=t.0