diff --git a/ai-engineering/create.mdx b/ai-engineering/create.mdx
index 9f9f23232..5827e0a96 100644
--- a/ai-engineering/create.mdx
+++ b/ai-engineering/create.mdx
@@ -9,7 +9,7 @@ import { definitions } from '/snippets/definitions.mdx'
The **Create** stage is about defining a new AI capability as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.
-### Defining a capability as a prompt object
+### Define a capability as a prompt object
In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.
diff --git a/ai-engineering/measure.mdx b/ai-engineering/measure.mdx
index b288e8b0f..64a1872f2 100644
--- a/ai-engineering/measure.mdx
+++ b/ai-engineering/measure.mdx
@@ -1,11 +1,11 @@
---
title: "Measure"
description: "Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data."
-keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "graders"]
+keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "scorers", "graders", "scores"]
---
import { Badge } from "/snippets/badge.jsx"
-import { definitions } from '/snippets/definitions.mdx'
+import { definitions } from "/snippets/definitions.mdx"
The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools.
@@ -13,73 +13,222 @@ The evaluation framework described here is in active development. Axiom is worki
The **Measure** stage is where you quantify the quality and effectiveness of your AI capability. Instead of relying on anecdotal checks, this stage uses a systematic process called an eval to score your capability’s performance against a known set of correct examples (ground truth). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.
-## The `Eval` function
+Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time.
-Coming soon The primary tool for the Measure stage is the `Eval` function, which will be available in the `axiom/ai` package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
+## Prerequisites
-An `Eval` is structured around a few key parameters:
+Follow the [Quickstart](/ai-engineering/quickstart):
+- To run evals within the context of an existing AI app, follow the instrumentation setup in the [Quickstart](/ai-engineering/quickstart).
+- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.
-* `data`: An async function that returns your `collection` of `{ input, expected }` pairs, which serve as your ground truth.
-* `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
-* `scorers`: An array of `grader` functions that score the `output` against the `expected` value.
-* `threshold`: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
+## Write evalulation function
-Here is an example of a complete evaluation suite:
+The `Eval` function provides a simple, declarative way to define a test suite for your capability directly in your codebase.
-```ts /evals/text-match.eval.ts
-import { Levenshtein } from 'autoevals';
-import { Eval } from 'axiom/ai/evals';
+The key parameters of the `Eval` function:
-Eval('text-match-eval', {
- // 1. Your ground truth dataset
- data: async () => {
- return [
+- `data`: An async function that returns your collection of `{ input, expected }` pairs, which serve as your ground truth.
+- `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
+- `scorers`: An array of scorer functions that score the `output` against the `expected` value.
+- `metadata`: Optional metadata for the evaluation, such as a description.
+
+The example below creates an evaluation for a support ticket classification system in the file `/src/evals/ticket-classification.eval.ts`.
+
+```ts /src/evals/ticket-classification.eval.ts expandable
+import { experimental_Eval as Eval, Scorer } from 'axiom/ai/evals';
+import { generateObject } from 'ai';
+import { openai } from '@ai-sdk/openai';
+import { wrapAISDKModel } from 'axiom/ai';
+import { flag, pickFlags } from '../lib/app-scope';
+import { z } from 'zod';
+
+// The function you want to evaluate
+async function classifyTicket({ subject, content }: { subject?: string; content: string }) {
+ const model = flag('ticketClassification.model');
+
+ const result = await generateObject({
+ model: wrapAISDKModel(openai(model)),
+ messages: [
{
- input: 'test',
- expected: 'hi, test!',
+ role: 'system',
+ content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report.
+ If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`,
},
{
- input: 'foobar',
- expected: 'hello, foobar!',
+ role: 'user',
+ content: subject ? `Subject: ${subject}\n\n${content}` : content,
+ },
+ ],
+ schema: z.object({
+ category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
+ response: z.string()
+ }),
+ });
+
+ return result.object;
+}
+
+// Custom exact-match scorer that returns score and metadata
+const ExactMatchScorer = Scorer(
+ 'Exact-Match',
+ ({ output, expected }: { output: { response: string }; expected: { response: string } }) => {
+ const normalizedOutput = output.response.trim().toLowerCase();
+ const normalizedExpected = expected.response.trim().toLowerCase();
+
+ return {
+ score: normalizedOutput === normalizedExpected,
+ metadata: {
+ details: 'A scorer that checks for exact match',
+ },
+ };
+ });
+ }
+);
+
+// Custom spam classification scorer
+const SpamClassificationScorer = Scorer(
+ "Spam-Classification",
+ ({ output, expected }: {
+ output: { category: string };
+ expected: { category: string };
+ }) => {
+ const isSpam = (item: { category: string }) => item.category === "spam";
+ return isSpam(output) === isSpam(expected) ? 1 : 0;
+ }
+);
+
+// Define the evaluation
+Eval('spam-classification', {
+ // Specify which flags this eval uses
+ configFlags: pickFlags('ticketClassification'),
+
+ // Test data with input/expected pairs
+ data: () => [
+ {
+ input: {
+ subject: "Congratulations! You've Been Selected for an Exclusive Reward",
+ content: 'Claim your $500 gift card now by clicking this link!',
+ },
+ expected: {
+ category: 'spam',
+ response: "We're sorry, but your message has been automatically closed.",
+ },
+ },
+ {
+ input: {
+ subject: 'FREE CA$H',
+ content: 'BUY NOW ON WWW.BEST-DEALS.COM!',
+ },
+ expected: {
+ category: 'spam',
+ response: "We're sorry, but your message has been automatically closed.",
},
- ];
+ },
+ ],
+
+ // The task to run for each test case
+ task: async ({ input }) => {
+ return await classifyTicket(input);
},
-
- // 2. The task that runs your capability
- task: async (input: string) => {
- return `hi, ${input}!`;
+
+ // Scorers to measure performance
+ scorers: [SpamClassificationScorer, ExactMatchScorer],
+
+ // Optional metadata
+ metadata: {
+ description: 'Classify support tickets as spam or not spam',
},
+});
+```
+
+## Set up flags
- // 3. The scorers that grade the output
- scorers: [Levenshtein],
+Create the file `src/lib/app-scope.ts`:
- // 4. The pass/fail threshold for the scores
- threshold: 1,
+```ts /src/lib/app-scope.ts
+import { createAppScope } from 'axiom/ai/evals';
+import { z } from 'zod';
+
+export const flagSchema = z.object({
+ ticketClassification: z.object({
+ model: z.string().default('gpt-4o-mini'),
+ }),
});
+
+const { flag, pickFlags } = createAppScope({ flagSchema });
+
+export { flag, pickFlags };
```
-## Grading with scorers
+## Run evaluations
-Coming soon A grader is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score.
+To run your evaluation suites from your terminal, [install the Axiom CLI](/reference/cli) and use the following commands.
-## Running evaluations
+| Description | Command |
+| ----------- | ------- |
+| Run all evals | `axiom eval` |
+| Run specific eval file | `axiom eval src/evals/ticket-classification.eval.ts` |
+| Run evals matching a glob pattern | `axiom eval "**/*spam*.eval.ts"` |
+| Run eval by name | `axiom eval "spam-classification"` |
+| List available evals without running | `axiom eval --list` |
-Coming soon You will run your evaluation suites from your terminal using the `axiom` CLI.
+## Analyze results in Console
-```bash
-axiom run evals/text-match.eval.ts
-```
+When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.
+
+The results of evals:
+- Pass/fail status for each test case
+- Scores from each scorer
+- Comparison to baseline (if available)
+- Links to view detailed traces in Axiom
+
+The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
+
+## Additional configuration options
+
+### Custom scorers
+
+A scorer is a function that scores a capability’s output. Scorers receive the `input`, the generated `output`, and the `expected` value, and return a score.
+
+The example above uses two custom scorers. Scorers can return metadata alongside the score.
-This command will execute the specified test file using `vitest` in the background. Note that `vitest` will be a peer dependency for this functionality.
+You can use the [`autoevals` library](https://github.com/braintrustdata/autoevals) instead of custom scorers. `autoevals` provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.
-## Analyzing results in the console
+### Run experiments
-Coming soon When you run an eval, the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.
+Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime.
-The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
+The example above uses the `ticketClassification` flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:
+
+- Override flags directly when you run the eval:
+
+ ```bash
+ axiom eval --flag.ticketClassification.model=gpt-4o
+ ```
+
+- Alternatively, specify the flag overrides in a JSON file.
+
+ ```json experiment.json
+ {
+ "ticketClassification": {
+ "model": "gpt-4o"
+ }
+ }
+ ```
+
+ And then specify the JSON file as the value of the `flags-config` parameter when you run the eval:
+
+ ```bash
+ axiom eval --flags-config=experiment.json
+ ```
## What’s next?
-Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic.
+A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:
+
+- **Baseline comparisons**: Run evals multiple times to track regression over time.
+- **Experiment with flags**: Test different models or strategies using flag overrides.
+- **Advanced scorers**: Build custom scorers for domain-specific metrics.
+- **CI/CD integration**: Add `axiom eval` to your CI pipeline to catch regressions.
-Learn more about this step of the AI engineering workflow in the [Observe](/ai-engineering/observe) docs.
\ No newline at end of file
+The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see [Observe](/ai-engineering/observe).
diff --git a/ai-engineering/observe/manual-instrumentation.mdx b/ai-engineering/observe/manual-instrumentation.mdx
index 514281586..9e097aeb6 100644
--- a/ai-engineering/observe/manual-instrumentation.mdx
+++ b/ai-engineering/observe/manual-instrumentation.mdx
@@ -188,7 +188,7 @@ Example of a properly structured chat completion trace:
```typescript TypeScript expandable
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
-const tracer = trace.getTracer('my-ai-app');
+const tracer = trace.getTracer('my-app');
// Create a span for the AI operation
return tracer.startActiveSpan('chat gpt-4', {
@@ -233,7 +233,7 @@ from opentelemetry import trace
from opentelemetry.trace import SpanKind
import json
-tracer = trace.get_tracer("my-ai-app")
+tracer = trace.get_tracer("my-app")
# Create a span for the AI operation
with tracer.start_as_current_span("chat gpt-4", kind=SpanKind.CLIENT) as span:
diff --git a/ai-engineering/quickstart.mdx b/ai-engineering/quickstart.mdx
index 72058199b..857907e87 100644
--- a/ai-engineering/quickstart.mdx
+++ b/ai-engineering/quickstart.mdx
@@ -5,6 +5,7 @@ keywords: ["ai engineering", "getting started", "install", "setup", "configurati
---
import ReplaceDatasetToken from "/snippets/replace-dataset-token.mdx"
+import ReplaceDomain from "/snippets/replace-domain.mdx"
import Prerequisites from "/snippets/standard-prerequisites.mdx"
import AIInstrumentationApproaches from "/snippets/ai-instrumentation-approaches.mdx"
@@ -98,56 +99,106 @@ To send data to Axiom, configure a tracer. For example, use a dedicated instrume
-1. Create instrumentation file:
+1. Create an instrumentation file:
- ```typescript /src/instrumentation.ts
-
- import 'dotenv/config'; // Make sure to load environment variables
+ ```ts /src/instrumentation.ts expandable
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { resourceFromAttributes } from '@opentelemetry/resources';
- import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
- import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-node';
+ import { BatchSpanProcessor, NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
import { trace } from "@opentelemetry/api";
import { initAxiomAI, RedactionPolicy } from 'axiom/ai';
+ import type { AxiomEvalInstrumentationHook } from 'axiom/ai/config';
const tracer = trace.getTracer("my-tracer");
- // Configure the provider to export traces to your Axiom dataset
- const provider = new NodeTracerProvider({
- resource: resourceFromAttributes({
- [ATTR_SERVICE_NAME]: 'my-ai-app', // Replace with your service name
- },
- {
- // Use the latest schema version
- // Info: https://opentelemetry.io/docs/specs/semconv/
- schemaUrl: 'https://opentelemetry.io/schemas/1.37.0',
- }),
- spanProcessor: new SimpleSpanProcessor(
- new OTLPTraceExporter({
- url: `https://api.axiom.co/v1/traces`,
- headers: {
- Authorization: `Bearer ${process.env.AXIOM_TOKEN}`,
- 'X-Axiom-Dataset': process.env.AXIOM_DATASET!,
- },
- })
- ),
- });
-
- // Register the provider
- provider.register();
-
- // Initialize Axiom AI SDK with the configured tracer
- initAxiomAI({ tracer, redactionPolicy: RedactionPolicy.AxiomDefault });
+ let provider: NodeTracerProvider | undefined;
+
+ // Wrap your logic in the AxiomEvalInstrumentationHook function
+ export const setupAppInstrumentation: AxiomEvalInstrumentationHook = async ({
+ dataset,
+ url,
+ token,
+ }) => {
+ if (provider) {
+ return { provider };
+ }
+
+ if (!dataset || !url || !token) {
+ throw new Error('Missing environment variables');
+ }
+
+ // Replace the environment variables with the parameters passed to the function
+ const exporter = new OTLPTraceExporter({
+ url: `${url}/v1/traces`,
+ headers: {
+ Authorization: `Bearer ${token}`,
+ 'X-Axiom-Dataset': dataset,
+ },
+ })
+
+ // Configure the provider to export traces to your Axiom dataset
+ provider = new NodeTracerProvider({
+ resource: resourceFromAttributes({
+ [ATTR_SERVICE_NAME]: 'my-app', // Replace with your service name
+ },
+ {
+ // Use the latest schema version
+ // Info: https://opentelemetry.io/docs/specs/semconv/
+ schemaUrl: 'https://opentelemetry.io/schemas/1.37.0',
+ }),
+ spanProcessor: new BatchSpanProcessor(exporter),
+ });
+
+ // Register the provider
+ provider.register();
+
+ // Initialize Axiom AI SDK with the configured tracer
+ initAxiomAI({ tracer, redactionPolicy: RedactionPolicy.AxiomDefault });
+
+ return { provider };
+ };
```
For more information on specifying redaction policies, see [Redaction policies](/ai-engineering/redaction-policies).
+### Create Axiom configuration file
+
+At the root of your project, create the Axiom configuration file `/axiom.config.ts`:
+
+```ts /axiom.config.ts
+import { defineConfig } from 'axiom/ai/config';
+import { setupAppInstrumentation } from './src/instrumentation';
+
+export default defineConfig({
+ eval: {
+ url: process.env.AXIOM_URL,
+ token: process.env.AXIOM_TOKEN,
+ dataset: process.env.AXIOM_DATASET,
+
+ // Optional: customize which files to run
+ include: ['**/*.eval.{ts,js}'],
+
+ // Optional: exclude patterns
+ exclude: [],
+
+ // Optional: timeout for eval execution
+ timeoutMs: 60_000,
+
+ // Optional: instrumentation hook for OpenTelemetry
+ // (created this in the "Create instrumentation setup" step)
+ instrumentation: ({ url, token, dataset }) =>
+ setupAppInstrumentation({ url, token, dataset }),
+ },
+});
+```
+
## Store environment variables
Store environment variables in an `.env` file in the root of your project:
```bash .env
+AXIOM_URL="AXIOM_DOMAIN"
AXIOM_TOKEN="API_TOKEN"
AXIOM_DATASET="DATASET_NAME"
OPENAI_API_KEY=""
@@ -158,6 +209,7 @@ ANTHROPIC_API_KEY=""
+
Enter the API keys for the LLMs you want to work with.