<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/main/llm_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LLM Evaluation**

**Blogposts**

* https://cloud.google.com/blog/products/ai-machine-learning/evaluating-large-language-models-in-business?hl=en

* https://cloud.google.com/blog/products/ai-machine-learning/enhancing-llm-quality-and-interpretability-with-the-vertex-gen-ai-evaluation-service/?hl=en

* https://medium.com/google-cloud/vqa-3-how-to-evaluate-generated-answers-from-rag-at-scale-on-vertex-ai-70bc397cb33d

* Video: [Beyond recall: Evaluating Gemini with Vertex AI Auto SxS](https://www.youtube.com/live/ysvjuAPY8xs)



**Technical Documentation**

* Notebooks: https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/evaluation

* https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview

  * https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval

  * https://cloud.google.com/vertex-ai/generative-ai/docs/models/computation-based-eval-pipeline

**Use Vertex AI SDK for evaluating a summarization task (with Computation-based metrics)**

(The code uses metric bundles for evaluating a summarization task and it automatically logs evaluation parameters and metrics in Vertex AI Experiments)

In [None]:
from vertexai.preview.evaluation import EvalTask
from vertexai.generative_models import GenerativeModel

summarization_eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[
        "text_generation_quality",
        "text_generation_factuality",
        "text_generation_instruction_following",
        "summarization_pointwise_reference_free",
    ],
    experiment="generative-ai-eval-experiment",
)

prompt_templates = [
    "Instruction: {instruction}. Article: {context}. Summary:",
    # Provide a list of prompt templates to evaluate and compare.
    ...
]

eval_results = []
for i, prompt_template in enumerate(prompt_templates):
    eval_result = summarization_eval_task.evaluate(
        model=GenerativeModel("gemini-1.5-pro"),
        prompt_template=prompt_template,
        experiment_run_name=f"eval-run-prompt-{i}",
    )
    eval_results.append(
        (f"Prompt #{i}", eval_result.summary_metrics, eval_result.metrics_table)
    )