diff --git a/ai-engineering/concepts.mdx b/ai-engineering/concepts.mdx index cc56f634c..ce4ade9ac 100644 --- a/ai-engineering/concepts.mdx +++ b/ai-engineering/concepts.mdx @@ -1,7 +1,7 @@ --- title: "Concepts" -description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Graders, Evals, and more." -keywords: ["ai engineering", "AI engineering", "concepts", "capability", "grader", "eval"] +description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Evals, Scorers, Annotations, and User Feedback." +keywords: ["ai engineering", "AI engineering", "concepts", "capability", "collection", "eval", "scorer", "annotations", "feedback", "flags"] --- import { definitions } from '/snippets/definitions.mdx' @@ -17,10 +17,10 @@ The concepts in AI engineering are best understood within the context of the dev Development starts by defining a task and prototyping a capability with a prompt to solve it. - The prototype is then tested against a collection of reference examples (so called “ground truth”) to measure its quality and effectiveness using graders. This process is known as an eval. + The prototype is then tested against a collection of reference examples (so called "ground truth") to measure its quality and effectiveness using scorers. This process is known as an eval. - Once a capability meets quality benchmarks, it’s deployed. In production, graders can be applied to live traffic (online evals) to monitor performance and cost in real-time. + Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic (online evals) to monitor performance and cost in real-time. Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew. @@ -33,40 +33,53 @@ The concepts in AI engineering are best understood within the context of the dev A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs. -Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket’s intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution). +Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures: + +- **Single-turn model interactions**: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document. +- **Workflows**: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation. +- **Single-agent**: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses. +- **Multi-agent**: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review. ### Collection A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering. -### Record - -Records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth). - -### Reference +### Collection record -A reference is a historical example of a task completed successfully, serving as a benchmark for AI performance. References provide the input-output pairs that demonstrate the expected behavior and quality standards. +Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth). ### Ground truth Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match. -### Annotation +### Scorer + +A scorer is a function that evaluates a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score. + +### Evaluation or "eval" -Annotations are expert-provided labels, corrections, or outputs added to records to establish or refine ground truth. +An evaluation, or eval, is the process of testing a capability against a collection of ground truth data using one or more scorers. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance. -### Grader +### Flag -A grader is a function that scores a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation. +A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, tool availability, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best. -### Evaluator (eval) +### Experiment + +An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability. + +### Online evaluation + +An online evaluation is the process of applying a scorer to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement. + +### Annotation -An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance. +Annotations are expert-provided observations, labels, or corrections added to production traces or evaluation results. Domain experts review AI capability runs and document what went wrong, what should have happened differently, or categorize failure modes. These annotations help identify patterns in capability failures, validate scorer accuracy, and create new test cases for collections. -### Online eval +### User feedback -An online eval is the process of applying a grader to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement. +User feedback is direct signal from end users about AI capability performance, typically collected through ratings (thumbs up/down, stars) or text comments. Feedback events are associated with traces to provide context about both system behavior and user perception. Aggregated feedback reveals quality trends, helps prioritize improvements, and surfaces issues that might not appear in evaluations. ## What’s next? -Now that you understand the core concepts, see them in action in the AI engineering [workflow](/ai-engineering/quickstart). \ No newline at end of file +Now that you understand the core concepts, get started with the [Quickstart](/ai-engineering/quickstart) or dive into [Evaluate](/ai-engineering/evaluate/overview) to learn about systematic testing. \ No newline at end of file diff --git a/ai-engineering/create.mdx b/ai-engineering/create.mdx index 9f9f23232..5e54acfef 100644 --- a/ai-engineering/create.mdx +++ b/ai-engineering/create.mdx @@ -1,133 +1,159 @@ --- title: "Create" -description: "Learn how to create and define AI capabilities using structured prompts and typed arguments with Axiom." -keywords: ["ai engineering", "AI engineering", "create", "prompt", "template", "schema"] +description: "Build AI capabilities using any framework, with best support for TypeScript-based tools." +keywords: ["ai engineering", "create", "prompt", "capability", "vercel ai sdk"] --- -import { Badge } from "/snippets/badge.jsx" import { definitions } from '/snippets/definitions.mdx' -The **Create** stage is about defining a new AI capability as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering. +Building an AI capability starts with prototyping. You can use whichever framework you prefer. Axiom is focused on helping you evaluate and observe your capabilities rather than prescribing how to build them. + +TypeScript-based frameworks like Vercel’s [AI SDK](https://sdk.vercel.ai) do integrate most seamlessly with Axiom’s tooling today, but that’s likely to evolve over time. + +## Build your capability + +Define your capability using your framework of choice. Here’s an example using Vercel's [AI SDK](https://ai-sdk.dev/), which includes [many examples](https://sdk.vercel.ai/examples) covering different capability design patterns. Popular alternatives like [Mastra](https://mastra.ai) also exist. + +```ts src/lib/capabilities/classify-ticket.ts expandable +import { generateObject } from 'ai'; +import { openai } from '@ai-sdk/openai'; +import { wrapAISDKModel } from 'axiom/ai'; +import { z } from 'zod'; + +export async function classifyTicket(input: { + subject?: string; + content: string +}) { + const result = await generateObject({ + model: wrapAISDKModel(openai('gpt-4o-mini')), + messages: [ + { + role: 'system', + content: 'Classify support tickets as: question, bug_report, or feature_request.', + }, + { + role: 'user', + content: input.subject + ? `Subject: ${input.subject}\n\n${input.content}` + : input.content, + }, + ], + schema: z.object({ + category: z.enum(['question', 'bug_report', 'feature_request']), + confidence: z.number().min(0).max(1), + }), + }); -### Defining a capability as a prompt object + return result.object; +} +``` -In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments. +The `wrapAISDKModel` function instruments your model calls for Axiom’s observability features. Learn more in the [Observe](/ai-engineering/observe) section. -For now, these `Prompt` objects can be defined and managed as TypeScript files within your own project repository. +## Gather reference examples -A typical `Prompt` object looks like this: +As you prototype, collect examples of inputs and their correct outputs. + +```ts +const referenceExamples = [ + { + input: { + subject: 'How do I reset my password?', + content: 'I forgot my password and need help.' + }, + expected: { category: 'question' }, + }, + { + input: { + subject: 'App crashes on startup', + content: 'The app immediately crashes when I open it.' + }, + expected: { category: 'bug_report' }, + }, +]; +``` -```ts /src/prompts/email-summarizer.prompt.ts +These become your ground truth for evaluation. Learn more in the [Evaluate](/ai-engineering/evaluate/overview) section. + +## Structured prompt management + + +The features below are experimental. Axiom’s current focus is on the evaluation and observability stages of the AI engineering workflow. + + +For teams wanting more structure around prompt definitions, Axiom’s SDK includes experimental utilities for managing prompts as versioned objects. + +### Define prompts as objects + +Represent capabilities as structured `Prompt` objects: + +```ts src/prompts/ticket-classifier.prompt.ts import { experimental_Type, type experimental_Prompt } from 'axiom/ai'; -export const emailSummarizerPrompt = { - name: "Email Summarizer", - slug: "email-summarizer", +export const ticketClassifierPrompt = { + name: "Ticket Classifier", + slug: "ticket-classifier", version: "1.0.0", - model: "gpt-4o", + model: "gpt-4o-mini", messages: [ { role: "system", - content: - `Summarize emails concisely, highlighting action items. - The user is named {{ username }}.`, + content: "Classify support tickets as: {{ categories }}", }, { role: "user", - content: "Please summarize this email: {{ email_content }}", + content: "{{ ticket_content }}", }, ], arguments: { - username: experimental_Type.String(), - email_content: experimental_Type.String(), + categories: experimental_Type.String(), + ticket_content: experimental_Type.String(), }, } satisfies experimental_Prompt; ``` -### Strongly-typed arguments with `Template` - -To ensure that prompts are used correctly, the Axiom’s AI SDK includes a `Template` type system (exported as `Type`) for defining the schema of a prompt’s `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects. - -The `arguments` object uses `Template` helpers to define the shape of the context: - -```typescript /src/prompts/report-generator.prompt.ts -import { - experimental_Type, - type experimental_Prompt -} from 'axiom/ai'; - -export const reportGeneratorPrompt = { - // ... other properties - arguments: { - company: experimental_Type.Object({ - name: experimental_Type.String(), - isActive: experimental_Type.Boolean(), - departments: experimental_Type.Array( - experimental_Type.Object({ - name: experimental_Type.String(), - budget: experimental_Type.Number(), - }) - ), - }), - priority: experimental_Type.Union([ - experimental_Type.Literal("high"), - experimental_Type.Literal("medium"), - experimental_Type.Literal("low"), - ]), - }, -} satisfies experimental_Prompt; +### Type-safe arguments + +The `experimental_Type` system provides type safety for prompt arguments: + +```ts +arguments: { + user: experimental_Type.Object({ + name: experimental_Type.String(), + preferences: experimental_Type.Array(experimental_Type.String()), + }), + priority: experimental_Type.Union([ + experimental_Type.Literal("high"), + experimental_Type.Literal("medium"), + experimental_Type.Literal("low"), + ]), +} ``` -You can even infer the exact TypeScript type for a prompt’s context using the `InferContext` utility. - -### Prototyping and local testing +### Local testing -Before using a prompt in your application, you can test it locally using the `parse` function. This function takes a `Prompt` object and a `context` object, rendering the templated messages to verify the output. This is a quick way to ensure your templating logic is correct. +Test prompts locally before using them: -```typescript +```ts import { experimental_parse } from 'axiom/ai'; -import { - reportGeneratorPrompt -} from './prompts/report-generator.prompt'; - -const context = { - company: { - name: 'Axiom', - isActive: true, - departments: [ - { name: 'Engineering', budget: 500000 }, - { name: 'Marketing', budget: 150000 }, - ], - }, - priority: 'high' as const, -}; - -// Render the prompt with the given context -const parsedPrompt = await experimental_parse( - reportGeneratorPrompt, { context } -); - -console.log(parsedPrompt.messages); -// [ -// { -// role: 'system', -// content: 'Generate a report for Axiom.\nCompany Status: Active...' -// } -// ] -``` -### Managing prompts with Axiom +const parsed = await experimental_parse(ticketClassifierPrompt, { + context: { + categories: 'question, bug_report, feature_request', + ticket_content: 'How do I reset my password?', + }, +}); -To enable more advanced workflows and collaboration, Axiom is building tools to manage your prompt assets centrally. +console.log(parsed.messages); +``` -* Coming soon The `axiom` CLI will allow you to `push`, `pull`, and `list` prompt versions directly from your terminal, synchronizing your local files with the Axiom platform. -* Coming soon The SDK will include methods like `axiom.prompts.create()` and `axiom.prompts.load()` for programmatic access to your managed prompts. This will be the foundation for A/B testing, version comparison, and deploying new prompts without changing your application code. +These utilities help organize prompts in your codebase. Centralized prompt management and versioning features may be added in future releases. -### What’s next? +## What's next? -Now that you’ve created and structured your capability, the next step is to measure its quality against a set of known good examples. +Once you have a working capability and reference examples, systematically evaluate its performance. -Learn more about this step of the AI engineering workflow in the [Measure](/ai-engineering/measure) docs. \ No newline at end of file +To learn how to set up and run evaluations, see [Evaluate](/ai-engineering/evaluate/overview). diff --git a/ai-engineering/evaluate/analyze-results.mdx b/ai-engineering/evaluate/analyze-results.mdx new file mode 100644 index 000000000..e636dacca --- /dev/null +++ b/ai-engineering/evaluate/analyze-results.mdx @@ -0,0 +1,110 @@ +--- +title: "Analyze results" +description: "Understand how changes to your AI capabilities impact performance, cost, and quality." +keywords: ["ai engineering", "console", "results", "analysis", "comparison", "baseline"] +--- + +After running an evaluation, the CLI provides a link to view results in the Axiom Console: + +``` +your-eval-name (your-eval.eval.ts) + + • scorer-one 95.00% + • scorer-two 87.50% + • scorer-three 100.00% + +View full report: +https://app.axiom.co/:org-id/ai-engineering/evaluations?runId=:run-id + +Test Files 1 passed (1) +Tests 4 passed (4) +Duration 5.2 s +``` + +The evaluation interface helps you answer three core questions: +1. How well does this configuration perform? +2. How does it compare to previous versions? +3. Which tradeoffs are acceptable? + +## Compare configurations + +To understand the impact of changes, compare evaluation runs to see deltas in accuracy, latency, and cost. + +### Using the Console + +Run your evaluation before and after making changes, then compare both runs in the Axiom Console: + +```bash +# Run baseline +axiom eval your-eval-name + +# Make changes to your capability (update prompt, switch models, etc.) + +# Run again +axiom eval your-eval-name +``` + +The Console shows both runs where you can analyze differences side-by-side. + +### Using the baseline flag + +For direct CLI comparison, specify a baseline evaluation ID: + +```bash +# Run baseline and note the trace ID from the output +axiom eval your-eval-name + +# Make changes, then run with baseline +axiom eval your-eval-name --baseline +``` + +The CLI output will show deltas for each metric. + + +The `--baseline` flag expects a trace ID. After running an evaluation, copy the trace ID from the CLI output or Console URL to use as a baseline for comparison. + + +Example: Switching from `gpt-4o-mini` to `gpt-4o` might show: +- Accuracy: 85% → 95% (+10%) +- Latency: 800 ms → 1.6 s (+100%) +- Cost per run: $0.002 → $0.020 (+900%) + +This data helps you decide whether the quality improvement justifies the cost and latency increase for your use case. + +## Investigate failures + +When test cases fail, click into them to see: +- The exact input that triggered the failure +- What your capability output vs what was expected +- The full trace of LLM calls and tool executions + +Look for patterns: +- Do failures cluster around specific input types? +- Are certain scorers failing consistently? +- Is high token usage correlated with failures? + +Use these insights to add targeted test cases or refine your capability. + +## Experiment with flags + +Flags let you test multiple configurations systematically. Run several experiments: + +```bash +# Compare model and retrieval configurations +axiom eval --flag.model=gpt-4o-mini --flag.retrieval.topK=3 +axiom eval --flag.model=gpt-4o-mini --flag.retrieval.topK=10 +axiom eval --flag.model=gpt-4o --flag.retrieval.topK=3 +axiom eval --flag.model=gpt-4o --flag.retrieval.topK=10 +``` + +Compare all four runs in the Console to find the configuration that best balances quality, cost, and latency for your requirements. + +## Track progress over time + +For teams running evaluations regularly (nightly or in CI), the Console shows whether your capability is improving or regressing across iterations. + +Compare your latest run against your initial baseline to verify that accumulated changes are moving in the right direction. + +## What's next? + +To learn how to use flags for experimentation, see [Flags and experiments](/ai-engineering/evaluate/flags-experiments). diff --git a/ai-engineering/evaluate/flags-experiments.mdx b/ai-engineering/evaluate/flags-experiments.mdx new file mode 100644 index 000000000..07e980ebf --- /dev/null +++ b/ai-engineering/evaluate/flags-experiments.mdx @@ -0,0 +1,357 @@ +--- +title: "Flags and experiments" +description: "Use flags to parameterize AI capabilities and run experiments comparing different configurations." +keywords: ["ai engineering", "flags", "experiments", "configuration", "parameters", "testing"] +--- + +import { definitions } from "/snippets/definitions.mdx" + +Flags are configuration parameters that control how your AI capability behaves. By defining flags, you can run experiments that systematically compare different models, prompts, retrieval strategies, or architectural approaches - all without changing your code. + +This is one of Axiom’s key differentiators: type-safe, version-controlled configuration that integrates seamlessly with your evaluation workflow. + +## Why flags matter + +AI capabilities have many tunable parameters: which model to use, which tools to enable, which prompting strategy, how to structure retrieval, and more. Without flags, you’d need to: + +- Hard-code values and manually change them between tests +- Maintain multiple versions of the same code +- Lose track of which configuration produced which results +- Struggle to reproduce experiments + +Flags solve this by: + +- **Parameterizing behavior**: Define what can vary in your capability +- **Enabling experimentation**: Test multiple configurations systematically +- **Tracking results**: Axiom records which flag values produced which scores +- **Automating optimization**: Run experiments in CI/CD to find the best configuration + +## Setting up flags + +Flags are defined using [Zod](https://zod.dev/) schemas in an "app scope" file. This provides type safety and ensures flag values are validated at runtime. + +### Create the app scope + +Create a file to define your flags (typically `src/lib/app-scope.ts`): + +```ts src/lib/app-scope.ts +import { createAppScope } from 'axiom/ai'; +import { z } from 'zod'; + +export const flagSchema = z.object({ + // Flags for ticket classification capability + ticketClassification: z.object({ + model: z.string().default('gpt-4o-mini'), + systemPrompt: z.enum(['concise', 'detailed']).default('concise'), + useStructuredOutput: z.boolean().default(true), + }), + + // Flags for document summarization capability + summarization: z.object({ + model: z.string().default('gpt-4o'), + maxTokens: z.number().default(500), + style: z.enum(['bullet-points', 'paragraph']).default('bullet-points'), + }), +}); + +const { flag, pickFlags } = createAppScope({ flagSchema }); + +export { flag, pickFlags }; +``` + +### Use flags in your capability + +Reference flags in your capability code using the `flag()` function: + +```ts src/lib/capabilities/classify-ticket/prompts.ts +import { generateObject } from 'ai'; +import { openai } from '@ai-sdk/openai'; +import { wrapAISDKModel } from 'axiom/ai'; +import { flag } from '../../app-scope'; +import { z } from 'zod'; + +const systemPrompts = { + concise: 'Classify tickets briefly as: spam, question, feature_request, or bug_report.', + detailed: `You are an expert customer support engineer. Carefully analyze each ticket + and classify it as spam, question, feature_request, or bug_report. Consider context and intent.`, +}; + +export async function classifyTicket(input: { subject?: string; content: string }) { + // Get flag values + const model = flag('ticketClassification.model'); + const promptStyle = flag('ticketClassification.systemPrompt'); + const useStructured = flag('ticketClassification.useStructuredOutput'); + + const result = await generateObject({ + model: wrapAISDKModel(openai(model)), + messages: [ + { + role: 'system', + content: systemPrompts[promptStyle], + }, + { + role: 'user', + content: input.subject + ? `Subject: ${input.subject}\n\n${input.content}` + : input.content, + }, + ], + schema: z.object({ + category: z.enum(['spam', 'question', 'feature_request', 'bug_report']), + }), + }); + + return result.object; +} +``` + +### Declare flags in evaluations + +Tell your evaluation which flags it depends on using `pickFlags()`. This provides two key benefits: + +- **Documentation**: Makes flag dependencies explicit and visible +- **Validation**: Warns about undeclared flag usage, catching configuration drift early + +```ts src/lib/capabilities/classify-ticket/evaluations/spam-classification.eval.ts +import { Eval, Scorer } from 'axiom/ai/evals'; +import { pickFlags } from '../../../app-scope'; +import { classifyTicket } from '../prompts'; + +Eval('spam-classification', { + // Declare which flags this eval uses + configFlags: pickFlags('ticketClassification'), + + capability: 'classify-ticket', + data: [/* test cases */], + task: async ({ input }) => await classifyTicket(input), + scorers: [/* scorering functions */], +}); +``` + +## Running experiments + +With flags defined, you can run experiments by overriding flag values at runtime. + +### CLI flag overrides + +Override individual flags directly in the command: + +```bash +# Test with GPT-4o instead of the default +axiom eval --flag.ticketClassification.model=gpt-4o + +# Test with different prompt style +axiom eval --flag.ticketClassification.systemPrompt=detailed + +# Test multiple flags +axiom eval \ + --flag.ticketClassification.model=gpt-4o \ + --flag.ticketClassification.systemPrompt=detailed \ + --flag.ticketClassification.useStructuredOutput=false +``` + +### JSON configuration files + +For complex experiments, define flag overrides in JSON files: + +```json experiments/gpt4-detailed.json +{ + "ticketClassification": { + "model": "gpt-4o", + "systemPrompt": "detailed", + "useStructuredOutput": true + } +} +``` + +```json experiments/gpt4-mini-concise.json +{ + "ticketClassification": { + "model": "gpt-4o-mini", + "systemPrompt": "concise", + "useStructuredOutput": false + } +} +``` + +Run evaluations with these configurations: + +```bash +# Run with first configuration +axiom eval --flags-config=experiments/gpt4-detailed.json + +# Run with second configuration +axiom eval --flags-config=experiments/gpt4-mini-concise.json +``` + + +Store experiment configurations in version control. This makes it easy to reproduce results and track which experiments you've tried. + + +### Comparing experiments + +Run the same evaluation with different flag values to compare approaches: + +```bash +# Baseline: default flags (gpt-4o-mini, concise, structured output) +axiom eval spam-classification + +# Experiment 1: Try GPT-4o +axiom eval spam-classification --flag.ticketClassification.model=gpt-4o + +# Experiment 2: Use detailed prompting +axiom eval spam-classification --flag.ticketClassification.systemPrompt=detailed + +# Experiment 3: Test without structured output +axiom eval spam-classification --flag.ticketClassification.useStructuredOutput=false +``` + +Axiom tracks all these runs in the Console, making it easy to compare scores and identify the best configuration. + +## Best practices + +### Organize flags by capability + +Group related flags together to make them easier to manage: + +```ts +export const flagSchema = z.object({ + // One group per capability + ticketClassification: z.object({ + model: z.string().default('gpt-4o-mini'), + temperature: z.number().default(0.7), + }), + + emailGeneration: z.object({ + model: z.string().default('gpt-4o'), + tone: z.enum(['formal', 'casual']).default('formal'), + }), + + documentRetrieval: z.object({ + topK: z.number().default(5), + similarityThreshold: z.number().default(0.7), + }), +}); +``` + +### Set sensible defaults + +Choose defaults that work well for most cases. Experiments then test variations: + +```ts +ticketClassification: z.object({ + model: z.enum(['gpt-4o', 'gpt-4o-mini', 'gpt-4-turbo']).default('gpt-4o-mini'), + systemPrompt: z.enum(['concise', 'detailed']).default('concise'), + useStructuredOutput: z.boolean().default(true), +}), +``` + + +For evaluations that test your application code, it’s best to use the same defaults as your production configuration. + + +### Use enums for discrete choices + +When flags have a fixed set of valid values, use enums for type safety: + +```ts +// Good: type-safe, prevents invalid values +model: z.enum(['gpt-4o', 'gpt-4o-mini', 'gpt-4-turbo']).default('gpt-4o-mini'), +tone: z.enum(['formal', 'casual', 'friendly']).default('formal'), + +// Avoid: any string is valid, causes runtime errors with AI SDK +model: z.string().default('gpt-4o-mini'), +tone: z.string().default('formal'), +``` + +## Advanced patterns + +### Model comparison matrix + +Test your capability across multiple models systematically: + +```bash +# Create experiment configs for each model +echo '{"ticketClassification":{"model":"gpt-4o-mini"}}' > exp-mini.json +echo '{"ticketClassification":{"model":"gpt-4o"}}' > exp-4o.json +echo '{"ticketClassification":{"model":"gpt-4-turbo"}}' > exp-turbo.json + +# Run all experiments +axiom eval --flags-config=exp-mini.json +axiom eval --flags-config=exp-4o.json +axiom eval --flags-config=exp-turbo.json +``` + +### Prompt strategy testing + +Compare different prompting approaches: + +```ts +export const flagSchema = z.object({ + summarization: z.object({ + strategy: z.enum([ + 'chain-of-thought', + 'few-shot', + 'zero-shot', + 'structured-output', + ]).default('zero-shot'), + }), +}); +``` + +```bash +# Test each strategy +for strategy in chain-of-thought few-shot zero-shot structured-output; do + axiom eval --flag.summarization.strategy=$strategy +done +``` + +### Cost vs quality optimization + +Find the sweet spot between performance and cost: + +```json experiments/cost-quality-matrix.json +[ + { "model": "gpt-4o-mini", "temperature": 0.7 }, + { "model": "gpt-4o-mini", "temperature": 0.3 }, + { "model": "gpt-4o", "temperature": 0.7 }, + { "model": "gpt-4o", "temperature": 0.3 } +] +``` + +Run experiments and compare cost (from telemetry) against accuracy scores to find the optimal configuration. + +### CI/CD integration + +Run experiments automatically in your CI pipeline: + +```yaml .github/workflows/eval.yml +name: Run Evaluations + +on: [pull_request] + +jobs: + eval: + runs-on: ubuntu-latest + strategy: + matrix: + model: [gpt-4o-mini, gpt-4o] + steps: + - uses: actions/checkout@v3 + - uses: actions/setup-node@v3 + - run: npm install + - run: | + npx axiom eval \ + --flag.ticketClassification.model=${{ matrix.model }} + env: + AXIOM_TOKEN: ${{ secrets.AXIOM_TOKEN }} + AXIOM_DATASET: ${{ secrets.AXIOM_DATASET }} +``` + +This automatically tests your capability with different configurations on every pull request. + +## What's next? + +- To learn all CLI commands for running evaluations, see [Run evaluations](/ai-engineering/evaluate/run-evaluations). +- To view results in the Console and compare experiments, see [Analyze results](/ai-engineering/evaluate/analyze-results). + diff --git a/ai-engineering/evaluate/overview.mdx b/ai-engineering/evaluate/overview.mdx new file mode 100644 index 000000000..c72df0615 --- /dev/null +++ b/ai-engineering/evaluate/overview.mdx @@ -0,0 +1,51 @@ +--- +title: "Evaluation overview" +description: "Systematically measure and improve your AI capabilities through offline evaluation." +sidebarTitle: Overview +keywords: ["ai engineering", "evaluation", "evals", "testing", "quality"] +--- + +import { definitions } from '/snippets/definitions.mdx' + +Evaluation is the systematic process of measuring how well your AI capability performs against known correct examples. Instead of relying on manual spot-checks or subjective assessments, evaluations provide quantitative, repeatable benchmarks that let you confidently improve your AI systems over time. + +## Why systematic evaluation matters + +AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale. + +Systematic evaluation solves this by: + +- **Establishing baselines**: Measure current performance before making changes +- **Preventing regressions**: Catch quality degradation before it reaches production +- **Enabling experimentation**: Compare different models, prompts, or architectures +- **Building confidence**: Deploy changes knowing they improve aggregate performance + +## The evaluation workflow + +Axiom's evaluation framework follows a simple pattern: + + + +Build a dataset of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time. + + + +Write functions that compare your capability’s output against the expected result. Use custom logic or prebuilt scorers from libraries like `autoevals`. + + + +Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost. + + + +Review results in the Axiom Console. Compare against baselines. Identify failures. Make improvements and re-evaluate. + + + +## What’s next? + +- To set up your environment and authenticate, see [Setup and authentication](/ai-engineering/evaluate/setup). +- To learn how to write evaluation functions, see [Write evaluations](/ai-engineering/evaluate/write-evaluations). +- To understand flags and experiments, see [Flags and experiments](/ai-engineering/evaluate/flags-experiments). +- To view results in the Console, see [Analyze results](/ai-engineering/evaluate/analyze-results). + diff --git a/ai-engineering/evaluate/run-evaluations.mdx b/ai-engineering/evaluate/run-evaluations.mdx new file mode 100644 index 000000000..bbb3a2b95 --- /dev/null +++ b/ai-engineering/evaluate/run-evaluations.mdx @@ -0,0 +1,94 @@ +--- +title: "Run evaluations" +description: "Learn how to run evaluations using the Axiom CLI and interpret the results." +keywords: ["ai engineering", "cli", "run evals", "commands", "testing"] +--- + +The Axiom AI SDK CLI provides commands for running evaluations locally or in CI/CD pipelines. + +## Run evaluations + +The simplest way to run evaluations is to execute all of them in your project: + +```bash +axiom eval +``` + +You can also target specific evaluations by name, file path, or glob pattern: + +```bash +# By evaluation name +axiom eval spam-classification + +# By file path +axiom eval src/evals/spam-classification.eval.ts + +# By glob pattern +axiom eval "**/*spam*.eval.ts" +``` + +To see which evaluations are available without running them: + +```bash +axiom eval --list +``` + +## Common options + +For quick local testing without sending traces to Axiom, use debug mode: + +```bash +axiom eval --debug +``` + +To compare results against a previous evaluation, view both runs in the Axiom Console where you can analyze differences in scores, latency, and cost. + +## Run experiments with flags + +Flags let you test different configurations without changing code. Override flag values directly in the command: + +```bash +# Single flag +axiom eval --flag.ticketClassification.model=gpt-4o + +# Multiple flags +axiom eval \ + --flag.ticketClassification.model=gpt-4o \ + --flag.ticketClassification.temperature=0.3 +``` + +For complex experiments, load flag overrides from a JSON file: + +```bash +axiom eval --flags-config=experiments/gpt4.json +``` + +## Understanding output + +When you run an evaluation, the CLI shows progress, scores, and a link to view detailed results in the Axiom Console: + +``` +✓ spam-classification (4/4 passed) + ✓ Test case 1: spam detection + ✓ Test case 2: legitimate question + +Scorers: + category-match: 100% (4/4) + high-confidence: 75% (3/4) + +Results: + Total: 4 test cases + Passed: 4 (100%) + Duration: 3.2s + Cost: $0.0024 + +View full report: +https://app.axiom.co/your-org/ai-engineering/evaluations?runId=ABC123 +``` + +Click the link to view results in the Console, compare runs, and analyze performance. + +## What's next? + +To learn how to view and analyze evaluation results, see [Analyze results](/ai-engineering/evaluate/analyze-results). + diff --git a/ai-engineering/evaluate/setup.mdx b/ai-engineering/evaluate/setup.mdx new file mode 100644 index 000000000..a75fb5399 --- /dev/null +++ b/ai-engineering/evaluate/setup.mdx @@ -0,0 +1,201 @@ +--- +title: "Setup and authentication" +description: "Install the Axiom AI SDK and authenticate with the CLI to run evaluations." +keywords: ["ai engineering", "setup", "authentication", "cli", "install"] +sidebarTitle: "Setup and auth" +--- + +import ReplaceEdgeDomain from "/snippets/replace-edge-domain.mdx" +import ReplaceDatasetToken from "/snippets/replace-dataset-token.mdx" + +This guide walks you through installing the Axiom AI SDK and authenticating with the Axiom AI SDK CLI so your evaluation results are tracked and attributed correctly in the Axiom Console. + +## Prerequisites + +- Node.js 18 or later +- An Axiom account with a dataset for storing evaluation traces +- A TypeScript or JavaScript project (evaluations work best with TypeScript frameworks like Vercel AI SDK) + +## Install the Axiom AI SDK + +Install the `axiom` package in your project: + +```bash +npm install axiom +``` + +This package provides: +- The `Eval` function for defining evaluations +- The `Scorer` wrapper for creating custom scorers +- Instrumentation helpers for capturing AI telemetry +- The Axiom AI SDK CLI for running evaluations + +## Authenticate with Axiom AI SDK CLI + +The Axiom AI SDK includes a dedicated CLI for running evaluations. This CLI is separate from Axiom's main data platform CLI and is focused specifically on AI engineering workflows. + + +Authenticating with the CLI ensures that evaluation runs are recorded in Axiom and attributed to your user account. This makes it easy to track who ran which experiments and compare results across your team. + + +### Login with OAuth + +Run the login command to authenticate via OAuth: + +```bash +npx axiom auth login +``` + +This opens your browser and prompts you to authorize the CLI with your Axiom account. Once authorized, the CLI stores your credentials securely on your machine. + +### Check authentication status + +Verify you're logged in and see which organization you're using: + +```bash +npx axiom auth status +``` + +This displays your current authentication state, including your username and active organization. + +### Switch organizations + +If you belong to multiple Axiom organizations, switch between them: + +```bash +npx axiom auth switch +``` + +This presents a list of organizations you can access and lets you select which one to use for evaluations. + +### Logout + +To remove stored credentials: + +```bash +npx axiom auth logout +``` + +## Authenticate with environment variables + +Instead of using OAuth, you can authenticate using environment variables: + +```bash +export AXIOM_TOKEN="API_TOKEN" +export AXIOM_DATASET="DATASET_NAME" +export AXIOM_ORG_ID="ORGANIZATION_ID" +export AXIOM_URL="AXIOM_EDGE_DOMAIN" +``` + + + +Replace `ORGANIZATION_ID` with the organization ID. + + + +## Create the Axiom configuration file + +Create an `axiom.config.ts` file in your project root to configure how evaluations run: + +```ts axiom.config.ts +import { defineConfig } from 'axiom/ai/config'; +import { setupAppInstrumentation } from './src/instrumentation'; + +export default defineConfig({ + eval: { + // Glob patterns for evaluation files + include: ['**/*.eval.{ts,js,mts,mjs,cts,cjs}'], + exclude: ['**/node_modules/**'], + + // Instrumentation hook - called before running evals + instrumentation: ({ url, token, dataset, orgId }) => + setupAppInstrumentation({ url, token, dataset, orgId }), + + // Timeout for individual test cases (milliseconds) + timeoutMs: 60_000, + }, +}); +``` + +### Set up instrumentation + +The `instrumentation` function initializes OpenTelemetry tracing so evaluation runs are captured as traces in Axiom. Create a file to set up your tracing provider: + +```ts src/instrumentation.ts +import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; +import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'; +import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node'; +import { initAxiomAI } from 'axiom/ai'; +import type { AxiomEvalInstrumentationHook } from 'axiom/ai/config'; + +let provider: NodeTracerProvider | undefined; + +export const setupAppInstrumentation: AxiomEvalInstrumentationHook = async (options) => { + if (provider) { + return { provider }; + } + + const exporter = new OTLPTraceExporter({ + url: `${options.url}/v1/traces`, + headers: { + Authorization: `Bearer ${options.token}`, + 'X-Axiom-Dataset': options.dataset, + ...(options.orgId ? { 'X-AXIOM-ORG-ID': options.orgId } : {}), + }, + }); + + provider = new NodeTracerProvider({ + spanProcessors: [new BatchSpanProcessor(exporter)], + }); + + provider.register(); + + // Initialize Axiom AI instrumentation + initAxiomAI({ + tracer: provider.getTracer('axiom-ai'), + }); + + return { provider }; +}; +``` + + +If you’re already using Axiom for observability in your application, you can reuse your existing tracing setup. The evaluation framework integrates seamlessly with your existing instrumentation. + + +## Recommended folder structure + +Organize your evaluation files for easy discovery and maintenance: + +``` +your-project/ +├── axiom.config.ts +├── src/ +│ ├── lib/ +│ │ ├── app-scope.ts +│ │ └── capabilities/ +│ │ └── capability-name/ +│ │ ├── prompts.ts +│ │ ├── schemas.ts +│ │ └── evaluations/ +│ │ └── eval-name.eval.ts +│ ├── instrumentation.ts +│ └── ... +``` + +Name evaluation files with the `.eval.ts` extension so they’re automatically discovered by the CLI. + +## Verify your setup + +Test that everything is configured correctly: + +```bash +npx axiom eval --list +``` + +This lists all evaluation files found in your project without running them. If you see your evaluation files listed, you're ready to start writing evaluations. + +## What's next? + +Now that you're set up, learn how to write your first evaluation in [Write evaluations](/ai-engineering/evaluate/write-evaluations). + diff --git a/ai-engineering/evaluate/write-evaluations.mdx b/ai-engineering/evaluate/write-evaluations.mdx new file mode 100644 index 000000000..1c0b961c6 --- /dev/null +++ b/ai-engineering/evaluate/write-evaluations.mdx @@ -0,0 +1,316 @@ +--- +title: "Write evaluations" +description: "Learn how to create evaluation functions with collections, tasks, and scorers." +keywords: ["ai engineering", "evals", "evaluation", "scorers", "collections", "ground truth"] +--- + +import { definitions } from "/snippets/definitions.mdx" + +An evaluation is a test suite for your AI capability. It runs your capability against a collection of test cases and scores the results using scorers. This page explains how to write evaluation functions using Axiom's `Eval` API. + +## Anatomy of an evaluation + +The `Eval` function defines a complete test suite for your capability. Here’s the basic structure: + +```ts +import { Eval, Scorer } from 'axiom/ai/evals'; + +Eval('evaluation-name', { + data: [/* test cases */], + task: async ({ input }) => {/* run capability */}, + scorers: [/* scoring functions */], + metadata: {/* optional metadata */}, +}); +``` + +### Key parameters + +- **`data`**: An array of test cases, or a function that returns an array of test cases. Each test case has an `input` (what you send to your capability) and an `expected` output (the ground truth). +- **`task`**: An async function that executes your capability for a given input and returns the output. +- **`scorers`**: An array of scorer functions that evaluate the output against the expected result. +- **`metadata`**: Optional metadata like a description or tags. + +## Creating collections + +The `data` parameter defines your collection of test cases. Start with a small set of examples and grow it over time as you discover edge cases. + +### Inline collections + +For small collections, define test cases directly in the evaluation: + +```ts +Eval('classify-sentiment', { + data: [ + { + input: { text: 'I love this product!' }, + expected: { sentiment: 'positive' }, + }, + { + input: { text: 'This is terrible.' }, + expected: { sentiment: 'negative' }, + }, + { + input: { text: 'It works as expected.' }, + expected: { sentiment: 'neutral' }, + }, + ], + // ... rest of eval +}); +``` + +### External collections + +For larger collections, load test cases from external files or databases: + +```ts +import { readFile } from 'fs/promises'; + +Eval('classify-sentiment', { + data: async () => { + const content = await readFile('./test-cases/sentiment.json', 'utf-8'); + return JSON.parse(content); + }, + // ... rest of eval +}); +``` + + +We recommend storing collections in version control alongside your code. This makes it easy to track how your test suite evolves and ensures evaluations are reproducible. + + +## Defining the task + +The `task` function executes your AI capability for each test case. It receives the `input` from the test case and should return the output your capability produces. + +```ts +import { generateText } from 'ai'; +import { openai } from '@ai-sdk/openai'; +import { wrapAISDKModel } from 'axiom/ai'; + +async function classifySentiment(text: string) { + const result = await generateText({ + model: wrapAISDKModel(openai('gpt-4o-mini')), + prompt: `Classify the sentiment of this text as positive, negative, or neutral: "${text}"`, + }); + + return { sentiment: result.text }; +} + +Eval('classify-sentiment', { + data: [/* ... */], + task: async ({ input }) => { + return await classifySentiment(input.text); + }, + scorers: [/* ... */], +}); +``` + + +The task function should generally be the same code you use in your actual capability. This ensures your evaluations accurately reflect real-world behavior. + + +## Creating scorers + +Scorers evaluate your capability's output. They receive the `input`, `output`, and `expected` values, and return a score (a number between 0-1, or boolean). + +### Custom scorers + +Create custom scorers using the `Scorer` wrapper: + +```ts +import { Scorer } from 'axiom/ai/evals'; + +const ExactMatchScorer = Scorer( + 'exact-match', + ({ output, expected }) => { + return output.sentiment === expected.sentiment ? true : false; + } +); +``` + +Scorers can return just a score, or an object with a score and metadata: + +```ts +const DetailedScorer = Scorer( + 'detailed-match', + ({ output, expected }) => { + const match = output.sentiment === expected.sentiment; + return { + score: match ? true : false, + metadata: { + outputValue: output.sentiment, + expectedValue: expected.sentiment, + matched: match, + }, + }; + } +); +``` + +### Using autoevals + +The [`autoevals`](https://github.com/braintrustdata/autoevals) library provides prebuilt scorers for common tasks: + +```bash +npm install autoevals +``` + +```ts +import { Scorer } from 'axiom/ai/evals'; +import { Levenshtein, FactualityScorer } from 'autoevals'; + +// Wrap autoevals scorers with Axiom's Scorer +const LevenshteinScorer = Scorer( + 'levenshtein', + ({ output, expected }) => { + return Levenshtein({ output: output.text, expected: expected.text }); + } +); + +const FactualityCheck = Scorer( + 'factuality', + async ({ output, expected }) => { + return await FactualityScorer({ + output: output.text, + expected: expected.text, + }); + } +); +``` + + +Use multiple scorers to evaluate different aspects of your capability. For example, check both exact accuracy and semantic similarity to get a complete picture of performance. + + +## Complete example + +Here's a complete evaluation for a support ticket classification system: + +```ts src/lib/capabilities/classify-ticket/evaluations/spam-classification.eval.ts expandable +import { Eval, Scorer } from 'axiom/ai/evals'; +import { generateObject } from 'ai'; +import { openai } from '@ai-sdk/openai'; +import { wrapAISDKModel } from 'axiom/ai'; +import { z } from 'zod'; + +// The capability function +async function classifyTicket({ + subject, + content +}: { + subject?: string; + content: string +}) { + const result = await generateObject({ + model: wrapAISDKModel(openai('gpt-4o-mini')), + messages: [ + { + role: 'system', + content: `You are a customer support engineer. Classify tickets as: + spam, question, feature_request, or bug_report.`, + }, + { + role: 'user', + content: subject ? `Subject: ${subject}\n\n${content}` : content, + }, + ], + schema: z.object({ + category: z.enum(['spam', 'question', 'feature_request', 'bug_report']), + confidence: z.number().min(0).max(1), + }), + }); + + return result.object; +} + +// Custom scorer for category matching +const CategoryScorer = Scorer( + 'category-match', + ({ output, expected }) => { + return output.category === expected.category ? true : false; + } +); + +// Custom scorer for high-confidence predictions +const ConfidenceScorer = Scorer( + 'high-confidence', + ({ output }) => { + return output.confidence >= 0.8 ? true : false; + } +); + +// Define the evaluation +Eval('spam-classification', { + data: [ + { + input: { + subject: "Congratulations! You've Won!", + content: 'Claim your $500 gift card now!', + }, + expected: { + category: 'spam', + }, + }, + { + input: { + subject: 'How do I reset my password?', + content: 'I forgot my password and need help resetting it.', + }, + expected: { + category: 'question', + }, + }, + { + input: { + subject: 'Feature request: Dark mode', + content: 'Would love to see a dark mode option in the app.', + }, + expected: { + category: 'feature_request', + }, + }, + { + input: { + subject: 'App crashes on startup', + content: 'The app crashes immediately when I try to open it.', + }, + expected: { + category: 'bug_report', + }, + }, + ], + + task: async ({ input }) => { + return await classifyTicket(input); + }, + + scorers: [CategoryScorer, ConfidenceScorer], + + metadata: { + description: 'Classify support tickets into categories', + }, +}); +``` + +## File naming conventions + +Name your evaluation files with the `.eval.ts` extension so they're automatically discovered by the Axiom CLI: + +``` +src/ +└── lib/ + └── capabilities/ + └── classify-ticket/ + └── evaluations/ + ├── spam-classification.eval.ts + ├── category-accuracy.eval.ts + └── edge-cases.eval.ts +``` + +The CLI will find all files matching `**/*.eval.{ts,js,mts,mjs,cts,cjs}` based on your `axiom.config.ts` configuration. + +## What's next? + +- To parameterize your capabilities and run experiments, see [Flags and experiments](/ai-engineering/evaluate/flags-experiments). +- To run evaluations using the CLI, see [Run evaluations](/ai-engineering/evaluate/run-evaluations). + diff --git a/ai-engineering/iterate.mdx b/ai-engineering/iterate.mdx index a4acf473e..ff20f94b7 100644 --- a/ai-engineering/iterate.mdx +++ b/ai-engineering/iterate.mdx @@ -1,57 +1,145 @@ --- title: "Iterate" -description: "Learn how to iterate on your AI capabilities by using production data and evaluation scores to drive improvements." -keywords: ["ai engineering", "AI engineering", "iterate", "improvement", "a/b testing", "champion challenger"] +description: "Run a systematic improvement loop to continuously enhance your AI capabilities based on production data and evaluation results." +keywords: ["ai engineering", "iterate", "improvement", "feedback", "annotations"] --- -import { Badge } from "/snippets/badge.jsx" import { definitions } from '/snippets/definitions.mdx' +import { Badge } from "/snippets/badge.jsx" + +The Iterate stage closes the loop in AI engineering. By analyzing production performance, validating changes through evaluation, and deploying improvements with confidence, you create a continuous cycle of data-driven enhancement. + +## The improvement loop + +Successful AI engineering follows a systematic pattern: + +1. **Analyze production** - Identify what needs improvement +2. **Create test cases** - Turn failures into ground truth examples +3. **Experiment with changes** - Test variations using flags +4. **Validate improvements** - Run evaluations to confirm progress +5. **Deploy with confidence** - Ship changes backed by data +6. **Repeat** - New production data feeds the next iteration + +## Identify what to improve + +Start by understanding how your capability performs in production. The Axiom Console provides multiple signals to help you prioritize: + +### Production traces + +Review traces in the [Observe](/ai-engineering/observe) section to find: +- Real-world inputs that caused failures or low-quality outputs +- High-cost or high-latency interactions that need optimization +- Unexpected tool calls or reasoning paths +- Edge cases your evaluations didn’t cover + +Filter to AI spans and examine the full interaction path, including model choices, token usage, and intermediate steps. + +### User feedback Coming soon + + +User feedback capture is coming soon. [Contact Axiom](https://www.axiom.co/contact) to join the design partner program. + + +User feedback provides direct signal about which interactions matter most to your customers. Axiom's AI SDK will include lightweight functions to capture both explicit and implicit feedback as timestamped event data: + +- **Explicit feedback** includes direct user signals like thumbs up, thumbs down, and comments on AI-generated outputs. - -The iteration workflow described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools. - +- **Implicit feedback** captures behavioral signals like copying generated text, regenerating responses, or abandoning interactions. -The Iterate stage is where the Axiom AI engineering workflow comes full circle. It’s the process of taking the real-world performance data from the [Observe](/ai-engineering/observe) stage and the quality benchmarks from the [Measure](/ai-engineering/measure) stage, and using them to make concrete improvements to your AI capability. This creates a cycle of continuous, data-driven enhancement. +Because user feedback is stored as timestamped events linked to specific AI runs, you can easily correlate feedback with traces to understand exactly what went wrong and prioritize high-value failures over edge cases that rarely occur. -## Identifying opportunities for improvement +### Domain expert annotations Coming soon -Iteration begins with insight. The telemetry you gather while observing your capability in production is a goldmine for finding areas to improve. By analyzing traces in the Axiom Console, you can: + +Annotation workflows are coming soon. [Contact Axiom](https://www.axiom.co/contact) to join the design partner program. + -* Find real-world user inputs that caused your capability to fail or produce low-quality output. -* Identify high-cost or high-latency interactions that could be optimized. -* Discover common themes in user feedback that point to systemic weaknesses. +Axiom will provide a seamless workflow for domain experts to review production traces and identify patterns in AI capability failures. The Console will surface traces that warrant attention, such as those with negative user feedback or anomalous behavior, and provide an interface for reviewing conversations and annotating failure modes. -These examples can be used to create a new, more robust collection of ground truth data for offline testing. +Annotations can be categorized into failures modes to guide prioritization. For example: -## Testing changes against ground truth +* **Critical failures** - Complete breakdowns like API outages, unhandled exceptions, or timeout errors +* **Quality degradation** - Declining accuracy scores, increased hallucinations, or off-topic responses +* **Coverage gaps** - Out-of-distribution inputs the system wasn’t designed to handle, like unexpected languages or domains +* **User dissatisfaction** - Negative feedback on outputs that technically succeeded but didn’t meet user needs -Coming soon Once you’ve created a new version of your `Prompt` object, you need to verify that it’s actually an improvement. The best way to do this is to run an "offline evaluation"—testing your new version against the same ground truth collection you used in the **Measure** stage. +This structured analysis helps teams coordinate improvement efforts, prioritize which failure modes to address first, and track patterns over time. -The Axiom Console will provide views to compare these evaluation runs side-by-side: +## Create test cases from production -* **A/B Comparison Views:** See the outputs of two different prompt versions for the same input, making it easy to spot regressions or improvements. -* **Leaderboards:** Track evaluation scores across all versions of a capability to see a clear history of its quality over time. +Once you've identified high-priority failures, turn them into test cases for your evaluation collections. Organizations typically maintain multiple collections for different scenarios, failure modes, or capability variants: -This ensures you can validate changes with data before they ever reach your users. +```ts +const newTestCases = [ + { + input: { + // Real production input that failed + subject: 'Refund request for order #12345', + content: 'I need a refund because the product arrived damaged.' + }, + expected: { + category: 'refund_request', + priority: 'high' + }, + }, +]; +``` + +## Experiment with changes + +Use [flags](/ai-engineering/evaluate/flags-experiments) to test different approaches without changing your code: + +```bash +# Test with a more capable model +axiom eval ticket-classification --flag.model=gpt-4o + +# Try a different temperature +axiom eval ticket-classification --flag.temperature=0.3 + +# Experiment with prompt variations +axiom eval ticket-classification --flag.promptStrategy=detailed +``` + +Run multiple experiments to understand the tradeoffs between accuracy, cost, and latency. -## Deploying with confidence +## Validate improvements -Coming soon After a new version of your capability has proven its superiority in offline tests, you can deploy it with confidence. The Axiom AI engineering workflow will support a champion/challenger pattern, where you can deploy a new "challenger" version to run in shadow mode against a portion of production traffic. This allows for a final validation on real-world data without impacting the user experience. +Before deploying any change, validate it against your full test collection using baseline comparison: -Once you’re satisfied with the challenger’s performance, you can promote it to become the new "champion" using the SDK’s `deploy` function. +```bash +# Run baseline evaluation +axiom eval ticket-classification +# Note the run ID: run_abc123xyz -```typescript -import { axiom } from './axiom-client'; +# Make your changes (update prompt, adjust config, etc.) -// Promote a new version of a prompt to the production environment -await axiom.prompts.deploy('prompt_123', { - environment: 'production', - version: '1.1.0', -}); +# Run again with baseline comparison +axiom eval ticket-classification --baseline run_abc123xyz ``` -## What’s next? +The Console shows you exactly how your changes impact: +- **Accuracy**: Did scores improve or regress? +- **Cost**: Is it more or less expensive? +- **Latency**: Is it faster or slower? + +Only deploy changes that show clear improvements without unacceptable tradeoffs. + +## Deploy with confidence + +Once your evaluations confirm an improvement, deploy the change to production. Because you've validated against ground truth data, you can ship with confidence that the new version handles both existing cases and the new failures you discovered. + +After deployment, return to the **Observe** stage to monitor performance and identify the next opportunity for improvement. + +## Best practices + +* **Build your collections over time.** Your evaluation collections should grow as you discover new failure modes. Each production issue that makes it through is an opportunity to strengthen your test coverage. +* **Track improvements systematically.** Use baseline comparisons for every change. This creates a clear history of how your capability has improved and prevents regressions. +* **Prioritize high-impact changes.** Focus on failures that affect many users or high-value interactions. Not every edge case deserves immediate attention. +* **Experiment before committing.** Flags let you test multiple approaches quickly. Run several experiments to understand the solution space before making code changes. +* **Close the loop.** The improvement cycle never ends. Each deployment generates new production data that reveals the next set of improvements to make. + +## What's next? -By completing the Iterate stage, you have closed the loop. Your improved capability is now in production, and you can return to the **Observe** stage to monitor its performance and identify the next opportunity for improvement. +To learn more about the evaluation framework that powers this improvement loop, see [Evaluate](/ai-engineering/evaluate/overview). -This cycle of creating, measuring, observing, and iterating is the core of the AI engineering workflow, enabling you to build better AI systems, backed by data. \ No newline at end of file +To understand how to capture rich telemetry from production, see [Observe](/ai-engineering/observe). diff --git a/ai-engineering/measure.mdx b/ai-engineering/measure.mdx index b288e8b0f..12362935e 100644 --- a/ai-engineering/measure.mdx +++ b/ai-engineering/measure.mdx @@ -1,11 +1,11 @@ --- title: "Measure" description: "Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data." -keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "graders"] +keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "scorers", "graders", "scores"] --- import { Badge } from "/snippets/badge.jsx" -import { definitions } from '/snippets/definitions.mdx' +import { definitions } from "/snippets/definitions.mdx" The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools. @@ -13,73 +13,222 @@ The evaluation framework described here is in active development. Axiom is worki The **Measure** stage is where you quantify the quality and effectiveness of your AI capability. Instead of relying on anecdotal checks, this stage uses a systematic process called an eval to score your capability’s performance against a known set of correct examples (ground truth). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time. -## The `Eval` function +Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time. -Coming soon The primary tool for the Measure stage is the `Eval` function, which will be available in the `axiom/ai` package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase. +## Prerequisites -An `Eval` is structured around a few key parameters: +Follow the [Quickstart](/ai-engineering/quickstart): +- To run evals within the context of an existing AI app, follow the instrumentation setup in the [Quickstart](/ai-engineering/quickstart). +- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app. -* `data`: An async function that returns your `collection` of `{ input, expected }` pairs, which serve as your ground truth. -* `task`: The function that executes your AI capability, taking an `input` and producing an `output`. -* `scorers`: An array of `grader` functions that score the `output` against the `expected` value. -* `threshold`: A score between 0 and 1 that determines the pass/fail condition for the evaluation. +## Write evalulation function -Here is an example of a complete evaluation suite: +The `Eval` function provides a simple, declarative way to define a test suite for your capability directly in your codebase. -```ts /evals/text-match.eval.ts -import { Levenshtein } from 'autoevals'; -import { Eval } from 'axiom/ai/evals'; +The key parameters of the `Eval` function: -Eval('text-match-eval', { - // 1. Your ground truth dataset - data: async () => { - return [ +- `data`: An async function that returns your collection of `{ input, expected }` pairs, which serve as your ground truth. +- `task`: The function that executes your AI capability, taking an `input` and producing an `output`. +- `scorers`: An array of scorer functions that score the `output` against the `expected` value. +- `metadata`: Optional metadata for the evaluation, such as a description. + +The example below creates an evaluation for a support ticket classification system in the file `/src/evals/ticket-classification.eval.ts`. + +```ts /src/evals/ticket-classification.eval.ts expandable +import { Eval, Scorer } from 'axiom/ai/evals'; +import { generateObject } from 'ai'; +import { openai } from '@ai-sdk/openai'; +import { wrapAISDKModel } from 'axiom/ai'; +import { flag, pickFlags } from '../lib/app-scope'; +import { z } from 'zod'; + +// The function you want to evaluate +async function classifyTicket({ subject, content }: { subject?: string; content: string }) { + const model = flag('ticketClassification.model'); + + const result = await generateObject({ + model: wrapAISDKModel(openai(model)), + messages: [ { - input: 'test', - expected: 'hi, test!', + role: 'system', + content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report. + If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`, }, { - input: 'foobar', - expected: 'hello, foobar!', + role: 'user', + content: subject ? `Subject: ${subject}\n\n${content}` : content, + }, + ], + schema: z.object({ + category: z.enum(['spam', 'question', 'feature_request', 'bug_report']), + response: z.string() + }), + }); + + return result.object; +} + +// Custom exact-match scorer that returns score and metadata +const ExactMatchScorer = Scorer( + 'Exact-Match', + ({ output, expected }: { output: { response: string }; expected: { response: string } }) => { + const normalizedOutput = output.response.trim().toLowerCase(); + const normalizedExpected = expected.response.trim().toLowerCase(); + + return { + score: normalizedOutput === normalizedExpected, + metadata: { + details: 'A scorer that checks for exact match', + }, + }; + }); + } +); + +// Custom spam classification scorer +const SpamClassificationScorer = Scorer( + "Spam-Classification", + ({ output, expected }: { + output: { category: string }; + expected: { category: string }; + }) => { + const isSpam = (item: { category: string }) => item.category === "spam"; + return isSpam(output) === isSpam(expected) ? 1 : 0; + } +); + +// Define the evaluation +Eval('spam-classification', { + // Specify which flags this eval uses + configFlags: pickFlags('ticketClassification'), + + // Test data with input/expected pairs + data: [ + { + input: { + subject: "Congratulations! You've Been Selected for an Exclusive Reward", + content: 'Claim your $500 gift card now by clicking this link!', + }, + expected: { + category: 'spam', + response: "We're sorry, but your message has been automatically closed.", + }, + }, + { + input: { + subject: 'FREE CA$H', + content: 'BUY NOW ON WWW.BEST-DEALS.COM!', + }, + expected: { + category: 'spam', + response: "We're sorry, but your message has been automatically closed.", }, - ]; + }, + ], + + // The task to run for each test case + task: async ({ input }) => { + return await classifyTicket(input); }, - - // 2. The task that runs your capability - task: async (input: string) => { - return `hi, ${input}!`; + + // Scorers to measure performance + scorers: [SpamClassificationScorer, ExactMatchScorer], + + // Optional metadata + metadata: { + description: 'Classify support tickets as spam or not spam', }, +}); +``` + +## Set up flags - // 3. The scorers that grade the output - scorers: [Levenshtein], +Create the file `src/lib/app-scope.ts`: - // 4. The pass/fail threshold for the scores - threshold: 1, +```ts /src/lib/app-scope.ts +import { createAppScope } from 'axiom/ai'; +import { z } from 'zod'; + +export const flagSchema = z.object({ + ticketClassification: z.object({ + model: z.string().default('gpt-4o-mini'), + }), }); + +const { flag, pickFlags } = createAppScope({ flagSchema }); + +export { flag, pickFlags }; ``` -## Grading with scorers +## Run evaluations -Coming soon A grader is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score. +To run your evaluation suites from your terminal, [install the Axiom CLI](/reference/cli) and use the following commands. -## Running evaluations +| Description | Command | +| ----------- | ------- | +| Run all evals | `axiom eval` | +| Run specific eval file | `axiom eval src/evals/ticket-classification.eval.ts` | +| Run evals matching a glob pattern | `axiom eval "**/*spam*.eval.ts"` | +| Run eval by name | `axiom eval "spam-classification"` | +| List available evals without running | `axiom eval --list` | -Coming soon You will run your evaluation suites from your terminal using the `axiom` CLI. +## Analyze results in Console -```bash -axiom run evals/text-match.eval.ts -``` +When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console. + +The results of evals: +- Pass/fail status for each test case +- Scores from each scorer +- Comparison to baseline (if available) +- Links to view detailed traces in Axiom + +The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements. + +## Additional configuration options + +### Custom scorers + +A scorer is a function that scores a capability’s output. Scorers receive the `input`, the generated `output`, and the `expected` value, and return a score. + +The example above uses two custom scorers. Scorers can return metadata alongside the score. -This command will execute the specified test file using `vitest` in the background. Note that `vitest` will be a peer dependency for this functionality. +You can use the [`autoevals` library](https://github.com/braintrustdata/autoevals) instead of custom scorers. `autoevals` provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching. -## Analyzing results in the console +### Run experiments -Coming soon When you run an eval, the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console. +Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime. -The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements. +The example above uses the `ticketClassification` flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways: + +- Override flags directly when you run the eval: + + ```bash + axiom eval --flag.ticketClassification.model=gpt-4o + ``` + +- Alternatively, specify the flag overrides in a JSON file. + + ```json experiment.json + { + "ticketClassification": { + "model": "gpt-4o" + } + } + ``` + + And then specify the JSON file as the value of the `flags-config` parameter when you run the eval: + + ```bash + axiom eval --flags-config=experiment.json + ``` ## What’s next? -Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic. +A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following: + +- **Baseline comparisons**: Run evals multiple times to track regression over time. +- **Experiment with flags**: Test different models or strategies using flag overrides. +- **Advanced scorers**: Build custom scorers for domain-specific metrics. +- **CI/CD integration**: Add `axiom eval` to your CI pipeline to catch regressions. -Learn more about this step of the AI engineering workflow in the [Observe](/ai-engineering/observe) docs. \ No newline at end of file +The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see [Observe](/ai-engineering/observe). diff --git a/ai-engineering/observe/manual-instrumentation.mdx b/ai-engineering/observe/manual-instrumentation.mdx index 514281586..9e097aeb6 100644 --- a/ai-engineering/observe/manual-instrumentation.mdx +++ b/ai-engineering/observe/manual-instrumentation.mdx @@ -188,7 +188,7 @@ Example of a properly structured chat completion trace: ```typescript TypeScript expandable import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api'; -const tracer = trace.getTracer('my-ai-app'); +const tracer = trace.getTracer('my-app'); // Create a span for the AI operation return tracer.startActiveSpan('chat gpt-4', { @@ -233,7 +233,7 @@ from opentelemetry import trace from opentelemetry.trace import SpanKind import json -tracer = trace.get_tracer("my-ai-app") +tracer = trace.get_tracer("my-app") # Create a span for the AI operation with tracer.start_as_current_span("chat gpt-4", kind=SpanKind.CLIENT) as span: diff --git a/ai-engineering/overview.mdx b/ai-engineering/overview.mdx index 4079c882d..cf6572474 100644 --- a/ai-engineering/overview.mdx +++ b/ai-engineering/overview.mdx @@ -1,6 +1,6 @@ --- title: "AI engineering overview" -description: "Introduction to Axiom’s methodology for designing, evaluating, monitoring, and iterating generative-AI capabilities." +description: "Build increasingly sophisticated AI capabilities with confidence using systematic evaluation and observability." sidebarTitle: Overview keywords: ["ai engineering", "AI engineering", "prompt engineering", "generative ai"] tag: "NEW" @@ -8,22 +8,20 @@ tag: "NEW" import { definitions } from '/snippets/definitions.mdx' -Generative AI development is fundamentally different from traditional software engineering. Its outputs are probabilistic, not deterministic. The same input can produce different results. This variability makes it challenging to guarantee quality and predict failure modes without the right infrastructure. +Generative AI development is fundamentally different from traditional software engineering. Its outputs are probabilistic, not deterministic. This variability makes it challenging to guarantee quality and predict failure modes without the right infrastructure. -Axiom’s data intelligence platform is ideally suited to address the unique challenges of AI engineering. Building on the foundational EventDB and Console components, Axiom provides an essential toolkit for the next generation of software builders. - -This section of the documentation introduces the concepts and workflows for building production-ready AI capabilities with confidence. The goal is to help developers move from experimental "vibe coding" to building increasingly sophisticated systems with observable outcomes. +Axiom's AI engineering capabilities build on the foundational EventDB and Console to provide systematic evaluation and observability for AI systems. Whether you're building single-turn model interactions, multi-step workflows, or complex multi-agent systems, Axiom helps you push boundaries and ship with confidence. ### Axiom AI engineering workflow -Axiom provides a structured, iterative workflow—the Axiom AI engineering method—for developing AI capabilities. The workflow is designed to build statistical confidence in systems that aren’t entirely predictable, and is grounded in systematic evaluation and continuous improvement, from initial prototype to production monitoring. +Axiom provides a structured, iterative workflow for developing AI capabilities. The workflow builds statistical confidence in systems that aren’t entirely predictable through systematic evaluation and continuous improvement, from initial prototype to production monitoring. The core stages are: -* **Create**: Define a new AI capability, prototype it with various models, and gather reference examples to establish ground truth. -* **Measure**: Systematically evaluate the capability’s performance against reference data using custom graders to score for accuracy, quality, and cost. -* **Observe**: Cultivate the capability in production by collecting rich telemetry on every LLM call and tool execution. Use online evaluations to monitor for performance degradation and discover edge cases. -* **Iterate**: Use insights from production to refine prompts, augment reference datasets, and improve the capability over time. +* **Create**: Prototype your AI capability using any framework. TypeScript-based frameworks like Vercel AI SDK integrate most seamlessly with Axiom's tooling. As you build, gather reference examples to establish ground truth for evaluation. +* **Evaluate**: Systematically test your capability’s performance against reference data using custom scorers to measure accuracy, quality, and cost. Use Axiom's evaluation framework to run experiments with different configurations and track improvements over time. +* **Observe**: Deploy your capability and collect rich telemetry on every LLM call and tool execution. Use online evaluations to monitor for performance degradation and discover edge cases in production. +* **Iterate**: Use insights from production monitoring and evaluation results to refine prompts, augment reference datasets, and improve the capability. Run new evaluations to verify improvements before deploying changes. ### What’s next? diff --git a/ai-engineering/quickstart.mdx b/ai-engineering/quickstart.mdx index 72058199b..0a156502c 100644 --- a/ai-engineering/quickstart.mdx +++ b/ai-engineering/quickstart.mdx @@ -5,13 +5,14 @@ keywords: ["ai engineering", "getting started", "install", "setup", "configurati --- import ReplaceDatasetToken from "/snippets/replace-dataset-token.mdx" +import ReplaceEdgeDomain from "/snippets/replace-edge-domain.mdx" import Prerequisites from "/snippets/standard-prerequisites.mdx" import AIInstrumentationApproaches from "/snippets/ai-instrumentation-approaches.mdx" -Quickly start capturing telemetry data from your generative AI apps. After installation and configuration, follow the Axiom AI engineering workflow to create, measure, observe, and iterate on your capabilities. +Quickly start capturing telemetry data from your generative AI capabilities. After installation and configuration, follow the Axiom AI engineering workflow to create, evaluate, observe, and iterate. -This page explains how to set up instrumentation with Axiom AI SDK. Expand the section below to chooose the right instrumentation approach for your needs. +This page explains how to set up instrumentation with Axiom AI SDK. Expand the section below to choose the right instrumentation approach for your needs. @@ -45,7 +46,7 @@ bun add axiom -The `axiom` package includes the `axiom` command-line interface (CLI) for managing your AI assets, which will be used in later stages of the Axiom AI engineering workflow. +The `axiom` package includes the `axiom` command-line interface (CLI) for running evaluations, which you'll use to systematically test and improve your AI capabilities. ## Configure tracer @@ -98,56 +99,108 @@ To send data to Axiom, configure a tracer. For example, use a dedicated instrume -1. Create instrumentation file: +1. Create an instrumentation file: - ```typescript /src/instrumentation.ts - - import 'dotenv/config'; // Make sure to load environment variables + ```ts /src/instrumentation.ts expandable import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { resourceFromAttributes } from '@opentelemetry/resources'; - import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'; - import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-node'; + import { BatchSpanProcessor, NodeTracerProvider } from '@opentelemetry/sdk-trace-node'; import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions'; import { trace } from "@opentelemetry/api"; import { initAxiomAI, RedactionPolicy } from 'axiom/ai'; + import type { AxiomEvalInstrumentationHook } from 'axiom/ai/config'; const tracer = trace.getTracer("my-tracer"); - // Configure the provider to export traces to your Axiom dataset - const provider = new NodeTracerProvider({ - resource: resourceFromAttributes({ - [ATTR_SERVICE_NAME]: 'my-ai-app', // Replace with your service name - }, - { - // Use the latest schema version - // Info: https://opentelemetry.io/docs/specs/semconv/ - schemaUrl: 'https://opentelemetry.io/schemas/1.37.0', - }), - spanProcessor: new SimpleSpanProcessor( - new OTLPTraceExporter({ - url: `https://api.axiom.co/v1/traces`, - headers: { - Authorization: `Bearer ${process.env.AXIOM_TOKEN}`, - 'X-Axiom-Dataset': process.env.AXIOM_DATASET!, - }, - }) - ), - }); - - // Register the provider - provider.register(); - - // Initialize Axiom AI SDK with the configured tracer - initAxiomAI({ tracer, redactionPolicy: RedactionPolicy.AxiomDefault }); + let provider: NodeTracerProvider | undefined; + + // Wrap your logic in the AxiomEvalInstrumentationHook function + export const setupAppInstrumentation: AxiomEvalInstrumentationHook = async ({ + dataset, + url, + token, + }) => { + if (provider) { + return { provider }; + } + + if (!dataset || !url || !token) { + throw new Error('Missing environment variables'); + } + + // Replace the environment variables with the parameters passed to the function + const exporter = new OTLPTraceExporter({ + url: `${url}/v1/traces`, + headers: { + Authorization: `Bearer ${token}`, + 'X-Axiom-Dataset': dataset, + }, + }) + + // Configure the provider to export traces to your Axiom dataset + provider = new NodeTracerProvider({ + resource: resourceFromAttributes({ + [ATTR_SERVICE_NAME]: 'my-app', // Replace with your service name + }, + { + // Use the latest schema version + // Info: https://opentelemetry.io/docs/specs/semconv/ + schemaUrl: 'https://opentelemetry.io/schemas/1.37.0', + }), + spanProcessor: new BatchSpanProcessor(exporter), + }); + + // Register the provider + provider.register(); + + // Initialize Axiom AI SDK with the configured tracer + initAxiomAI({ tracer, redactionPolicy: RedactionPolicy.AxiomDefault }); + + return { provider }; + }; ``` For more information on specifying redaction policies, see [Redaction policies](/ai-engineering/redaction-policies). +### Create Axiom configuration file + +The Axiom configuration file enables the evaluation framework, allowing you to run systematic tests against your AI capabilities and track improvements over time. + +At the root of your project, create the Axiom configuration file `/axiom.config.ts`: + +```ts /axiom.config.ts +import { defineConfig } from 'axiom/ai/config'; +import { setupAppInstrumentation } from './src/instrumentation'; + +export default defineConfig({ + eval: { + url: process.env.AXIOM_URL, + token: process.env.AXIOM_TOKEN, + dataset: process.env.AXIOM_DATASET, + + // Optional: customize which files to run + include: ['**/*.eval.{ts,js}'], + + // Optional: exclude patterns + exclude: [], + + // Optional: timeout for eval execution + timeoutMs: 60_000, + + // Optional: instrumentation hook for OpenTelemetry + // (created this in the "Create instrumentation setup" step) + instrumentation: ({ url, token, dataset }) => + setupAppInstrumentation({ url, token, dataset }), + }, +}); +``` + ## Store environment variables Store environment variables in an `.env` file in the root of your project: ```bash .env +AXIOM_URL="AXIOM_EDGE_DOMAIN" AXIOM_TOKEN="API_TOKEN" AXIOM_DATASET="DATASET_NAME" OPENAI_API_KEY="" @@ -158,11 +211,17 @@ ANTHROPIC_API_KEY="" + Enter the API keys for the LLMs you want to work with. + +To run evaluations, you’ll need to authenticate the Axiom CLI. See [Setup and authentication](/ai-engineering/evaluate/setup) for details on using `axiom auth login`. + + ## What’s next? -- **Explore the AI engineering workflow**: Start building systematic AI capabilities beginning with [Create](/ai-engineering/create). -- **Continue with Axiom AI SDK**: Learn about instrumenting your AI model and tool calls in [Observe](/ai-engineering/observe). +- **Build your first capability**: Start prototyping with [Create](/ai-engineering/create). +- **Set up evaluations**: Learn how to systematically test your capabilities with [Evaluate](/ai-engineering/evaluate/overview). +- **Capture production telemetry**: Instrument your AI calls for observability with [Observe](/ai-engineering/observe). diff --git a/docs.json b/docs.json index eaab37ba2..311565277 100644 --- a/docs.json +++ b/docs.json @@ -201,7 +201,17 @@ "group": "Workflow", "pages": [ "ai-engineering/create", - "ai-engineering/measure", + { + "group": "Evaluate", + "pages": [ + "ai-engineering/evaluate/overview", + "ai-engineering/evaluate/setup", + "ai-engineering/evaluate/write-evaluations", + "ai-engineering/evaluate/flags-experiments", + "ai-engineering/evaluate/run-evaluations", + "ai-engineering/evaluate/analyze-results" + ] + }, { "group": "Observe", "pages": [ diff --git a/snippets/definitions.mdx b/snippets/definitions.mdx index 844927f04..931aa3592 100644 --- a/snippets/definitions.mdx +++ b/snippets/definitions.mdx @@ -3,8 +3,8 @@ export const definitions = { 'Collection': 'A curated set of reference records that are used for the development, testing, and evaluation of a capability.', 'Console': "Axiom’s intuitive web app built for exploration, visualization, and monitoring of your data.", 'Eval': 'The process of testing a capability against a collection of ground truth references using one or more graders.', - 'Grader': 'A function that scores a capability’s output.', 'GroundTruth': 'The validated, expert-approved correct output for a given input.', 'EventDB': "Axiom’s robust, cost-effective, and scalable datastore specifically optimized for timestamped event data.", 'OnlineEval': 'The process of applying a grader to a capability’s live production traffic.', + 'Scorer': 'A function that measures a capability’s output.', } \ No newline at end of file