Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 24 additions & 11 deletions ai-engineering/concepts.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Concepts"
description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Graders, Evals, and more."
keywords: ["ai engineering", "AI engineering", "concepts", "capability", "grader", "eval"]
description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Scorers, Evals, and more."
keywords: ["ai engineering", "AI engineering", "concepts", "capability", "scorer", "eval", "flags"]
---

import { definitions } from '/snippets/definitions.mdx'
Expand All @@ -17,10 +17,10 @@ The concepts in AI engineering are best understood within the context of the dev
Development starts by defining a task and prototyping a <Tooltip tip={definitions.Capability}>capability</Tooltip> with a prompt to solve it.
</Step>
<Step title="Evaluate with ground truth">
The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called <Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>) to measure its quality and effectiveness using <Tooltip tip={definitions.Grader}>graders</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called "<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>") to measure its quality and effectiveness using <Tooltip tip={definitions.Scorer}>scorers</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
</Step>
<Step title="Observe in production">
Once a capability meets quality benchmarks, it’s deployed. In production, graders can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
</Step>
<Step title="Iterate with new insights">
Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.
Expand All @@ -33,7 +33,12 @@ The concepts in AI engineering are best understood within the context of the dev

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs.

Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket’s intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).
Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:

- **Single-turn model interactions**: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
- **Workflows**: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
- **Single-agent**: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
- **Multi-agent**: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.

### Collection

Expand All @@ -55,18 +60,26 @@ Ground truth is the validated, expert-approved correct output for a given input.

Annotations are expert-provided labels, corrections, or outputs added to records to establish or refine ground truth.

### Grader
### Scorer

A scorer is a function that evaluates a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Scorers are the reusable, atomic scoring logic used in all forms of evaluation.

### Flag

A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, temperature, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best.

### Experiment

A grader is a function that scores a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.
An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability.

### Evaluator (eval)
### Evaluator or "eval"

An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.
An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more scorers. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.

### Online eval

An online eval is the process of applying a grader to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.
An online eval is the process of applying a scorer to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.

## What’s next?

Now that you understand the core concepts, see them in action in the AI engineering [workflow](/ai-engineering/quickstart).
Now that you understand the core concepts, get started with the [Quickstart](/ai-engineering/quickstart) or dive into [Evaluate](/ai-engineering/evaluate/overview) to learn about systematic testing.
210 changes: 118 additions & 92 deletions ai-engineering/create.mdx
Original file line number Diff line number Diff line change
@@ -1,133 +1,159 @@
---
title: "Create"
description: "Learn how to create and define AI capabilities using structured prompts and typed arguments with Axiom."
keywords: ["ai engineering", "AI engineering", "create", "prompt", "template", "schema"]
description: "Build AI capabilities using any framework, with best support for TypeScript-based tools."
keywords: ["ai engineering", "create", "prompt", "capability", "vercel ai sdk"]
---

import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'

The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.
Building an AI <Tooltip tip={definitions.Capability}>capability</Tooltip> starts with prototyping. You can use whichever framework you prefer. Axiom is focused on helping you evaluate and observe your capabilities rather than prescribing how to build them.

TypeScript-based frameworks like Vercel’s [AI SDK](https://sdk.vercel.ai) do integrate most seamlessly with Axiom’s tooling today, but that’s likely to evolve over time.

## Build your capability

Define your capability using your framework of choice. Here’s an example using Vercel's [AI SDK](https://ai-sdk.dev/), which includes [many examples](https://sdk.vercel.ai/examples) covering different capability design patterns. Popular alternatives like [Mastra](https://mastra.ai) also exist.

```ts src/lib/capabilities/classify-ticket.ts expandable
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { z } from 'zod';

export async function classifyTicket(input: {
subject?: string;
content: string
}) {
const result = await generateObject({
model: wrapAISDKModel(openai('gpt-4o-mini')),
messages: [
{
role: 'system',
content: 'Classify support tickets as: question, bug_report, or feature_request.',
},
{
role: 'user',
content: input.subject
? `Subject: ${input.subject}\n\n${input.content}`
: input.content,
},
],
schema: z.object({
category: z.enum(['question', 'bug_report', 'feature_request']),
confidence: z.number().min(0).max(1),
}),
});

### Defining a capability as a prompt object
return result.object;
}
```

In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.
The `wrapAISDKModel` function instruments your model calls for Axiom’s observability features. Learn more in the [Observe](/ai-engineering/observe) section.

For now, these `Prompt` objects can be defined and managed as TypeScript files within your own project repository.
## Gather reference examples

A typical `Prompt` object looks like this:
As you prototype, collect examples of inputs and their correct outputs.

```ts
const referenceExamples = [
{
input: {
subject: 'How do I reset my password?',
content: 'I forgot my password and need help.'
},
expected: { category: 'question' },
},
{
input: {
subject: 'App crashes on startup',
content: 'The app immediately crashes when I open it.'
},
expected: { category: 'bug_report' },
},
];
```

```ts /src/prompts/email-summarizer.prompt.ts
These become your ground truth for evaluation. Learn more in the [Evaluate](/ai-engineering/evaluate/overview) section.

## Structured prompt management

<Note>
The features below are experimental. Axiom’s current focus is on the evaluation and observability stages of the AI engineering workflow.
</Note>

For teams wanting more structure around prompt definitions, Axiom’s SDK includes experimental utilities for managing prompts as versioned objects.

### Define prompts as objects

Represent capabilities as structured `Prompt` objects:

```ts src/prompts/ticket-classifier.prompt.ts
import {
experimental_Type,
type experimental_Prompt
} from 'axiom/ai';

export const emailSummarizerPrompt = {
name: "Email Summarizer",
slug: "email-summarizer",
export const ticketClassifierPrompt = {
name: "Ticket Classifier",
slug: "ticket-classifier",
version: "1.0.0",
model: "gpt-4o",
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
`Summarize emails concisely, highlighting action items.
The user is named {{ username }}.`,
content: "Classify support tickets as: {{ categories }}",
},
{
role: "user",
content: "Please summarize this email: {{ email_content }}",
content: "{{ ticket_content }}",
},
],
arguments: {
username: experimental_Type.String(),
email_content: experimental_Type.String(),
categories: experimental_Type.String(),
ticket_content: experimental_Type.String(),
},
} satisfies experimental_Prompt;
```

### Strongly-typed arguments with `Template`

To ensure that prompts are used correctly, the Axiom’s AI SDK includes a `Template` type system (exported as `Type`) for defining the schema of a prompt’s `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects.

The `arguments` object uses `Template` helpers to define the shape of the context:

```typescript /src/prompts/report-generator.prompt.ts
import {
experimental_Type,
type experimental_Prompt
} from 'axiom/ai';

export const reportGeneratorPrompt = {
// ... other properties
arguments: {
company: experimental_Type.Object({
name: experimental_Type.String(),
isActive: experimental_Type.Boolean(),
departments: experimental_Type.Array(
experimental_Type.Object({
name: experimental_Type.String(),
budget: experimental_Type.Number(),
})
),
}),
priority: experimental_Type.Union([
experimental_Type.Literal("high"),
experimental_Type.Literal("medium"),
experimental_Type.Literal("low"),
]),
},
} satisfies experimental_Prompt;
### Type-safe arguments

The `experimental_Type` system provides type safety for prompt arguments:

```ts
arguments: {
user: experimental_Type.Object({
name: experimental_Type.String(),
preferences: experimental_Type.Array(experimental_Type.String()),
}),
priority: experimental_Type.Union([
experimental_Type.Literal("high"),
experimental_Type.Literal("medium"),
experimental_Type.Literal("low"),
]),
}
```

You can even infer the exact TypeScript type for a prompt’s context using the `InferContext` utility.

### Prototyping and local testing
### Local testing

Before using a prompt in your application, you can test it locally using the `parse` function. This function takes a `Prompt` object and a `context` object, rendering the templated messages to verify the output. This is a quick way to ensure your templating logic is correct.
Test prompts locally before using them:

```typescript
```ts
import { experimental_parse } from 'axiom/ai';
import {
reportGeneratorPrompt
} from './prompts/report-generator.prompt';

const context = {
company: {
name: 'Axiom',
isActive: true,
departments: [
{ name: 'Engineering', budget: 500000 },
{ name: 'Marketing', budget: 150000 },
],
},
priority: 'high' as const,
};

// Render the prompt with the given context
const parsedPrompt = await experimental_parse(
reportGeneratorPrompt, { context }
);

console.log(parsedPrompt.messages);
// [
// {
// role: 'system',
// content: 'Generate a report for Axiom.\nCompany Status: Active...'
// }
// ]
```

### Managing prompts with Axiom
const parsed = await experimental_parse(ticketClassifierPrompt, {
context: {
categories: 'question, bug_report, feature_request',
ticket_content: 'How do I reset my password?',
},
});

To enable more advanced workflows and collaboration, Axiom is building tools to manage your prompt assets centrally.
console.log(parsed.messages);
```

* <Badge>Coming soon</Badge> The `axiom` CLI will allow you to `push`, `pull`, and `list` prompt versions directly from your terminal, synchronizing your local files with the Axiom platform.
* <Badge>Coming soon</Badge> The SDK will include methods like `axiom.prompts.create()` and `axiom.prompts.load()` for programmatic access to your managed prompts. This will be the foundation for A/B testing, version comparison, and deploying new prompts without changing your application code.
These utilities help organize prompts in your codebase. Centralized prompt management and versioning features may be added in future releases.

### Whats next?
## What's next?

Now that you’ve created and structured your capability, the next step is to measure its quality against a set of known good examples.
Once you have a working capability and reference examples, systematically evaluate its performance.

Learn more about this step of the AI engineering workflow in the [Measure](/ai-engineering/measure) docs.
To learn how to set up and run evaluations, see [Evaluate](/ai-engineering/evaluate/overview).
Loading