firebase · kevinthecheung · Nov 12, 2024 · Nov 12, 2024 · Nov 12, 2024
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -1,37 +1,49 @@
 # Evaluation
 
-Evaluations are a form of testing which helps you validate your LLM’s responses and ensure they meet your quality bar.
-
-Firebase Genkit supports third-party evaluation tools through plugins, paired with powerful observability features which provide insight into the runtime state
-of your LLM-powered applications. Genkit tooling helps you automatically extract data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system’s building blocks.
-
-For example, if you have a RAG flow, Genkit will extract the set
-of documents that was returned by the retriever so that you can evaluate the
-quality of your retriever while it runs in the context of the flow as shown below with the Genkit faithfulness and answer relevancy metrics:
-
-```js
-import { GenkitMetric, genkitEval } from '@genkit-ai/evaluator';
-import { textEmbeddingGecko } from '@genkit-ai/vertexai';
-
-export default configureGenkit({
+Evaluations are a form of testing that helps you validate your LLM's responses
+and ensure they meet your quality bar.
+
+Firebase Genkit supports third-party evaluation tools through plugins, paired
+with powerful observability features that provide insight into the runtime state
+of your LLM-powered applications. Genkit tooling helps you automatically extract
+data including inputs, outputs, and information from intermediate steps to
+evaluate the end-to-end quality of LLM responses as well as understand the
+performance of your system's building blocks.
+
+For example, if you have a RAG flow, Genkit will extract the set of documents
+that was returned by the retriever so that you can evaluate the quality of your
+retriever while it runs in the context of the flow as shown below with the
+Genkit faithfulness and answer relevancy metrics:
+
+```ts
+import { genkit } from 'genkit';
+import { genkitEval, GenkitMetric } from '@genkit-ai/evaluator';
+import { vertexAI, textEmbedding004, gemini15Flash } from '@genkit-ai/vertexai';
+
+const ai = genkit({
   plugins: [
+    vertexAI(),
     genkitEval({
       judge: gemini15Flash,
       metrics: [GenkitMetric.FAITHFULNESS, GenkitMetric.ANSWER_RELEVANCY],
-      embedder: textEmbeddingGecko, // GenkitMetric.ANSWER_RELEVANCY requires an embedder
+      embedder: textEmbedding004, // GenkitMetric.ANSWER_RELEVANCY requires an embedder
     }),
   ],
   // ...
 });
 ```
 
-Note: The configuration above requires installing the `@genkit-ai/evaluator` and `@genkit-ai/vertexai` packages.
+**Note:** The configuration above requires installing the `genkit`,
+`@genkit-ai/google-ai`, `@genkit-ai/evaluator` and `@genkit-ai/vertexai`
+packages.
 
 ```posix-terminal
   npm install @genkit-ai/evaluator @genkit-ai/vertexai
 ```
 
-Start by defining a set of inputs that you want to use as an input dataset called `testInputs.json`. This input dataset represents the test cases you will use to generate output for evaluation.
+Start by defining a set of inputs that you want to use as an input dataset
+called `testInputs.json`. This input dataset represents the test cases you will
+use to generate output for evaluation.
 
 ```json
 ["Cheese", "Broccoli", "Spinach and Kale"]
@@ -52,41 +64,47 @@ genkit start
 
 Then navigate to `localhost:4000/evaluate`.
 
-Alternatively, you can provide an output file to inspect the output in a json file.
+Alternatively, you can provide an output file to inspect the output in a JSON
+file.
 
 ```posix-terminal
 genkit eval:flow menuSuggestionFlow --input testInputs.json --output eval-result.json
 ```
 
-Note: Below you can see an example of how an LLM can help you generate the test
-cases.
+**Note:** Below you can see an example of how an LLM can help you generate the
+test cases.
 
 ## Supported evaluators
 
 ### Genkit evaluators
 
-Genkit includes a small number of native evaluators, inspired by RAGAS, to help you get started:
+Genkit includes a small number of native evaluators, inspired by RAGAS, to help
+you get started:
 
-- Faithfulness
-- Answer Relevancy
-- Maliciousness
+*   Faithfulness
+*   Answer Relevancy
+*   Maliciousness
 
 ### Evaluator plugins
 
 Genkit supports additional evaluators through plugins:
 
-- VertexAI Rapid Evaluators via the [VertexAI Plugin](plugins/vertex-ai#evaluation).
-- [LangChain Criteria Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/) via the [LangChain plugin](plugins/langchain.md).
+*   VertexAI Rapid Evaluators via the [VertexAI
+    Plugin](http://plugins/vertex-ai#evaluation).
+*   [LangChain Criteria
+    Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/)
+    via the [LangChain plugin](http://plugins/langchain.md).
 
 ## Advanced use
 
 `eval:flow` is a convenient way to quickly evaluate the flow, but sometimes you
-might need more control over evaluation steps. This may occur if you are using a different
-framework and already have some output you would like to evaluate. You can perform all
-the step that `eval:flow` performs semi-manually.
+might need more control over evaluation steps. This may occur if you are using a
+different framework and already have some output you would like to evaluate. You
+can perform all the steps that `eval:flow` performs semi-manually.
 
-You can batch run your Genkit flow and add a unique label to the run which
-then will be used to extract an evaluation dataset (a set of inputs, outputs, and contexts).
+You can batch run your Genkit flow and add a unique label to the run which then
+will be used to extract an evaluation dataset (a set of inputs, outputs, and
+contexts).
 
 Run the flow over your test inputs:
 
@@ -100,7 +118,8 @@ Extract the evaluation data:
 genkit eval:extractData myRagFlow --label customLabel --output customLabel_dataset.json
 ```
 
-The exported data will be output as a json file with each testCase in the following format:
+The exported data will be output as a JSON file with each testCase in the
+following format:
 
 ```json
 [
@@ -114,13 +133,19 @@ The exported data will be output as a json file with each testCase in the follow
 ]
 ```
 
-The data extractor will automatically locate retrievers and add the produced docs to the context array. By default, `eval:run` will run against all configured evaluators, and like `eval:flow`, results for `eval:run` will appear in the evaluation page of Developer UI, located at `localhost:4000/evaluate`.
+The data extractor will automatically locate retrievers and add the produced
+docs to the context array. By default, `eval:run` will run against all
+configured evaluators, and like `eval:flow`, results for `eval:run` will appear
+in the evaluation page of Developer UI, located at `localhost:4000/evaluate`.
 
 ### Custom extractors
 
-You can also provide custom extractors to be used in `eval:extractData` and `eval:flow` commands. Custom extractors allow you to override the default extraction logic giving you more power in creating datasets and evaluating them.
+You can also provide custom extractors to be used in `eval:extractData` and
+`eval:flow` commands. Custom extractors allow you to override the default
+extraction logic giving you more power in creating datasets and evaluating them.
 
-To configure custom extractors, add a tools config file named `genkit-tools.conf.js` to your project root, if you don't have one already.
+To configure custom extractors, add a tools config file named
+`genkit-tools.conf.js` to your project root if you don't have one already.
 
 ```posix-terminal
 cd $GENKIT_PROJECT_HOME
@@ -134,7 +159,7 @@ In the tools config file, add the following code:
 module.exports = {
   evaluators: [
     {
-      flowName: 'myFlow',
+      actionRef: '/flow/myFlow',
       extractors: {
         context: { outputOf: 'foo-step' },
         output: 'bar-step',
@@ -144,17 +169,34 @@ module.exports = {
 };
 ```
 
-In this sample, you configure an extractor for `myFlow` flow. The config overrides the extractors for `context` and `output` fields, and used the default logic for the `input` field.
+In this sample, you configure an extractor for `myFlow` flow. The config
+overrides the extractors for `context` and `output` fields and uses the default
+logic for the `input` field.
 
 The specification of the evaluation extractors is as follows:
 
-- `evaluators` field accepts an array of EvaluatorConfig objects, which are scoped by `flowName`
-- `extractors` is an object that specifies the extractor overrides. The current supported keys in `extractors` are `[input, output, context]`. The acceptable value types are:
-  - `string` - this should be a step name, specified as a stirng. The output of this step is extracted for this key.
-  - `{ inputOf: string }` or `{ outputOf: string }` - These objects represent specific channels (input or output) of a step. For example, `{ inputOf: 'foo-step' }` would extract the input of step `foo-step` for this key.
-  - `(trace) => string;` - For further flexibility, you can provide a function that accepts a Genkit trace and returns a `string`, and specify the extraction logic inside this function. Refer to `genkit/genkit-tools/common/src/types/trace.ts` for the exact TraceData schema.
-
-Note: The extracted data for all these steps will be a JSON string. The tooling will parse this JSON string at the time of evaluation automatically. If providing a function extractor, make sure that the output is a valid JSON string. For eg: `"Hello, world!"` is not valid JSON; `"\"Hello, world!\""` is valid.
+*   `evaluators` field accepts an array of EvaluatorConfig objects, which are
+    scoped by `flowName`
+*   `extractors` is an object that specifies the extractor overrides. The
+    current supported keys in `extractors` are `[input, output, context]`. The
+    acceptable value types are:
+    *   `string` - this should be a step name, specified as a string. The output
+        of this step is extracted for this key.
+    *   `{ inputOf: string }` or `{ outputOf: string }` - These objects
+        represent specific channels (input or output) of a step. For example, `{
+        inputOf: 'foo-step' }` would extract the input of step `foo-step` for
+        this key.
+    *   `(trace) => string;` - For further flexibility, you can provide a
+        function that accepts a Genkit trace and returns a `string`, and specify
+        the extraction logic inside this function. Refer to
+        `genkit/genkit-tools/common/src/types/trace.ts` for the exact TraceData
+        schema.
+
+**Note:** The extracted data for all these steps will be a JSON string. The
+tooling will parse this JSON string at the time of evaluation automatically. If
+providing a function extractor, make sure that the output is a valid JSON
+string. For example: `"Hello, world!"` is not valid JSON; `"\"Hello, world!\""`
+is valid.
 
 ### Running on existing datasets
 
@@ -170,35 +212,42 @@ To output to a different location, use the `--output` flag.
 genkit eval:flow menuSuggestionFlow --input testInputs.json --output customLabel_evalresult.json
 ```
 
-To run on a subset of the configured evaluators, use the `--evaluators` flag and provide a comma separated list of evaluators by name:
+To run on a subset of the configured evaluators, use the `--evaluators` flag and
+provide a comma-separated list of evaluators by name:
 
 ```posix-terminal
 genkit eval:run customLabel_dataset.json --evaluators=genkit/faithfulness,genkit/answer_relevancy
 ```
 
 ### Synthesizing test data using an LLM
 
-Here's an example flow that uses a PDF file to generate possible questions
-users might be asking about it.
+Here's an example flow that uses a PDF file to generate possible questions users
+might be asking about it.
 
-```js
-export const synthesizeQuestions = defineFlow(
+```ts
+import { genkit, run, z } from "genkit";
+import { googleAI, gemini15Flash } from "@genkit-ai/googleai";
+import { chunk } from "llm-chunk";
+
+const ai = genkit({ plugins: [googleAI()] });
+
+export const synthesizeQuestions = ai.defineFlow(
   {
-    name: 'synthesizeQuestions',
-    inputSchema: z.string().describe('PDF file path'),
+    name: "synthesizeQuestions",
+    inputSchema: z.string().describe("PDF file path"),
     outputSchema: z.array(z.string()),
   },
   async (filePath) => {
     filePath = path.resolve(filePath);
-    const pdfTxt = await run('extract-text', () => extractText(filePath));
+    const pdfTxt = await run("extract-text", () => extractText(filePath));
 
-    const chunks = await run('chunk-it', async () =>
+    const chunks = await run("chunk-it", async () =>
       chunk(pdfTxt, chunkingConfig)
     );
 
     const questions: string[] = [];
     for (var i = 0; i < chunks.length; i++) {
-      const qResponse = await generate({
+      const qResponse = await ai.generate({
         model: gemini15Flash,
         prompt: {
           text: `Generate one question about the text below: ${chunks[i]}`,