From 90c323f55763692f852bae87130b1431a31a85a8 Mon Sep 17 00:00:00 2001 From: Cleo Schneider Date: Wed, 1 May 2024 21:51:23 +0000 Subject: [PATCH 1/2] Remove preview language from evals docs and expand supported evaluators --- docs/evaluation.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/docs/evaluation.md b/docs/evaluation.md index 5f1b13c0cc..a1c63b1c8a 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -1,6 +1,4 @@ -# Evaluation (Preview) - -Note: **Evaluation in Firebase Genkit is currently in early preview** with a limited set of available evaluation metrics. You can try out the current experience by following the documentation below. If you run into any issues or have suggestions for improvements, please [file an issue](http://github.com/google/genkit/issues). We would love to see your feedback as we refine the evaluation experience! +# Evaluation Evaluations are a form of testing which helps you validate your LLM’s responses and ensure they meet your quality bar. @@ -9,7 +7,7 @@ of your LLM-powered applications. Genkit tooling helps you automatically extract For example, if you have a RAG flow, Genkit will extract the set of documents that was returned by the retriever so that you can evaluate the -quality of your retriever while it runs in the context of the flow as shown below with the RAGAS faithfulness and answer relevancy metrics: +quality of your retriever while it runs in the context of the flow as shown below with the Genkit faithfulness and answer relevancy metrics: ```js import { GenkitMetric, genkitEval } from '@genkit-ai/evaluator'; @@ -25,8 +23,6 @@ export default configureGenkit({ }); ``` -We only support a small number of evaluators to help developers get started that are inspired by [RAGAS](https://docs.ragas.io/en/latest/index.html) metrics including: Faithfulness, Answer Relevancy, and Maliciousness. - Start by defining a set of inputs that you want to use as an input dataset called `testQuestions.json`. This input dataset represents the test cases you will use to generate output for evaluation. ```json @@ -57,6 +53,24 @@ genkit eval:flow bobQA --input testQuestions.json --output eval-result.json Note: Below you can see an example of how an LLM can help you generate the test cases. +## Supported Evaluator Plugins + +### Genkit Eval + +We have created a small number of native evaluators to help developers get started that are inspired by [RAGAS](https://docs.ragas.io/en/latest/index.html) metrics including: + +- Faithfulness +- Answer Relevancy +- Maliciousness + +### VertexAI Rapid Evaluators + +We support a handful of VertexAI Rapid Evaluators via the [VertexAI Plugin](/docs/plugins/vertex-ai#evaluation). + +### Langchain Evaluators + +Firebase Genkit supports [Langchain Criteria Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/) via the Langchain Plugin. + ## Advanced use `eval:flow` is a convenient way quickly evaluate the flow, but sometimes you From 32109f7a509b1b400471bb14ebb8726d1fcc3312 Mon Sep 17 00:00:00 2001 From: Cleo Schneider Date: Thu, 2 May 2024 11:55:46 +0000 Subject: [PATCH 2/2] Incorporating feedback --- docs/evaluation.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/evaluation.md b/docs/evaluation.md index a1c63b1c8a..e326ab876e 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -53,23 +53,22 @@ genkit eval:flow bobQA --input testQuestions.json --output eval-result.json Note: Below you can see an example of how an LLM can help you generate the test cases. -## Supported Evaluator Plugins +## Supported evaluators -### Genkit Eval +### Genkit evaluators -We have created a small number of native evaluators to help developers get started that are inspired by [RAGAS](https://docs.ragas.io/en/latest/index.html) metrics including: +Genkit includes a small number of native evaluators, inspired by RAGAS, to help you get started: - Faithfulness - Answer Relevancy - Maliciousness -### VertexAI Rapid Evaluators +### Evaluator plugins -We support a handful of VertexAI Rapid Evaluators via the [VertexAI Plugin](/docs/plugins/vertex-ai#evaluation). +Genkit supports additional evaluators through plugins: -### Langchain Evaluators - -Firebase Genkit supports [Langchain Criteria Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/) via the Langchain Plugin. +- VertexAI Rapid Evaluators via the [VertexAI Plugin](plugins/vertex-ai#evaluation). +- [LangChain Criteria Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/) via the [LangChain plugin](plugins/langchain.md). ## Advanced use