Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
{/* vale off */}

# Contributor covenant code of conduct

## Our pledge
Expand Down Expand Up @@ -119,7 +121,7 @@ version 2.0, available at
[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0].

Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
[Mozillas code of conduct enforcement ladder][Mozilla CoC].

For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available
Expand Down
4 changes: 2 additions & 2 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contribution guidelines

We encourage you to participate in this documentation project. We appreciate your help in making Axiom as easy to understand and work with as possible.
Axiom encourages you to participate in this documentation project. The community appreciates your help in making Axiom as easy to understand and work with as possible.

To contribute, fork this repo, and then clone it. For more information, see the [GitHub documentation](https://docs.github.com/en/get-started/exploring-projects-on-github/contributing-to-a-project).

Expand Down Expand Up @@ -32,7 +32,7 @@ If you want to contribute but don’t know where to start, browse the open issue
- When you review a PR, use GitHub suggestions for changes where discussion is necessary. For major changes or uncontroversial smaller fixes, commit directly to the branch.
- Let the original creator merge the PR. The reviewer only approves or asks for changes.
- In your comments, be kind, considerate, and constructive.
- If a comment does not apply to the review of the PR, post it on the related issue.
- If a comment doesn’t apply to the review of the PR, post it on the related issue.

## Commits

Expand Down
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/docs.json
2 changes: 2 additions & 0 deletions .vale.ini
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ Google.Headings = NO
Google.Parens = NO
Google.Colons = NO
Google.Ordinal = NO
Google.Will = NO
Google.EmDash = NO

# Ignore code surrounded by backticks or plus sign, parameters defaults, URLs, and angle brackets.
TokenIgnores = (<\/?[A-Z].+>), (\x60[^\n\x60]+\x60), ([^\n]+=[^\n]*), (\+[^\n]+\+), (http[^\n]+\[)
Expand Down
12 changes: 6 additions & 6 deletions ai-engineering/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: "Learn about the core concepts in Rudder: Capabilities, Collections
keywords: ["ai engineering", "rudder", "concepts", "capability", "grader", "eval"]
---

import { definitions } from '/snippets/definitions.mdx';
import { definitions } from '/snippets/definitions.mdx'

This page defines the core terms used in the Rudder workflow. Understanding these concepts is the first step toward building robust and reliable generative AI capabilities.

Expand All @@ -20,7 +20,7 @@ The concepts in Rudder are best understood within the context of the development
The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called “<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>”) to measure its quality and effectiveness using <Tooltip tip={definitions.Grader}>graders</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
</Step>
<Step title="Observe in production">
Once a capability meets quality benchmarks, it's deployed. In production, graders can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
Once a capability meets quality benchmarks, its deployed. In production, graders can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
</Step>
<Step title="Iterate with new insights">
Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.
Expand All @@ -33,7 +33,7 @@ The concepts in Rudder are best understood within the context of the development

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs.

Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket's intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).
Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support tickets intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).

### Collection

Expand All @@ -57,16 +57,16 @@ Annotations are expert-provided labels, corrections, or outputs added to records

### Grader

A grader is a function that scores a capability's output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.
A grader is a function that scores a capabilitys output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.

### Evaluator (Eval)

An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.

### Online Eval

An online eval is the process of applying a grader to a capability's live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.
An online eval is the process of applying a grader to a capabilitys live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.

### What's next?
### Whats next?

Now that you understand the core concepts, see them in action in the Rudder [workflow](/ai-engineering/quickstart).
14 changes: 7 additions & 7 deletions ai-engineering/create.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ description: "Learn how to create and define AI capabilities using structured pr
keywords: ["ai engineering", "rudder", "create", "prompt", "template", "schema"]
---

import { Badge } from "/snippets/badge.jsx";
import { definitions } from '/snippets/definitions.mdx';
import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'

The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.

### Defining a capability as a prompt object

In Rudder, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability's logic, including its messages, metadata, and the schema for its arguments.
In Rudder, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capabilitys logic, including its messages, metadata, and the schema for its arguments.

For now, these `Prompt` objects can be defined and managed as TypeScript files within your own project repository.

Expand Down Expand Up @@ -47,7 +47,7 @@ export const emailSummarizerPrompt = {

### Strongly-typed arguments with `Template`

To ensure that prompts are used correctly, the `@axiomhq/ai` package includes a `Template` type system (exported as `Type`) for defining the schema of a prompt's `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects.
To ensure that prompts are used correctly, the `@axiomhq/ai` package includes a `Template` type system (exported as `Type`) for defining the schema of a prompts `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects.

The `arguments` object uses `Template` helpers to define the shape of the context:

Expand Down Expand Up @@ -78,7 +78,7 @@ export const reportGeneratorPrompt = {
} satisfies Prompt;
```

You can even infer the exact TypeScript type for a prompt's context using the `InferContext` utility.
You can even infer the exact TypeScript type for a prompts context using the `InferContext` utility.

### Prototyping and local testing

Expand Down Expand Up @@ -119,8 +119,8 @@ To enable more advanced workflows and collaboration, Axiom is building tools to
* <Badge>Coming soon</Badge> The `axiom` CLI will allow you to `push`, `pull`, and `list` prompt versions directly from your terminal, synchronizing your local files with the Axiom platform.
* <Badge>Coming soon</Badge> The SDK will include methods like `axiom.prompts.create()` and `axiom.prompts.load()` for programmatic access to your managed prompts. This will be the foundation for A/B testing, version comparison, and deploying new prompts without changing your application code.

### What's next?
### Whats next?

Now that you've created and structured your capability, the next step is to measure its quality against a set of known good examples.
Now that youve created and structured your capability, the next step is to measure its quality against a set of known good examples.

Learn more about this step of the Rudder workflow in the [Measure](/ai-engineering/measure) docs.
12 changes: 6 additions & 6 deletions ai-engineering/iterate.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ description: "Learn how to iterate on your AI capabilities by using production d
keywords: ["ai engineering", "rudder", "iterate", "improvement", "a/b testing", "champion challenger"]
---

import { Badge } from "/snippets/badge.jsx";
import { definitions } from '/snippets/definitions.mdx';
import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'

<Warning>
The iteration workflow described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a small group of teams shaping these tools.
</Warning>

The **Iterate** stage is where the Rudder workflow comes full circle. It's the process of taking the real-world performance data from the [Observe](/ai-engineering/observe) stage and the quality benchmarks from the [Measure](/ai-engineering/measure) stage, and using them to make concrete improvements to your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. This creates a cycle of continuous, data-driven enhancement.
The **Iterate** stage is where the Rudder workflow comes full circle. Its the process of taking the real-world performance data from the [Observe](/ai-engineering/observe) stage and the quality benchmarks from the [Measure](/ai-engineering/measure) stage, and using them to make concrete improvements to your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. This creates a cycle of continuous, data-driven enhancement.

## Identifying opportunities for improvement

Expand All @@ -25,7 +25,7 @@ These examples can be used to create a new, more robust <Tooltip tip={definition

## Testing changes against ground truth

<Badge>Coming soon</Badge> Once you've created a new version of your `Prompt` object, you need to verify that it's actually an improvement. The best way to do this is to run an "offline evaluation"—testing your new version against the same ground truth collection you used in the **Measure** stage.
<Badge>Coming soon</Badge> Once youve created a new version of your `Prompt` object, you need to verify that its actually an improvement. The best way to do this is to run an "offline evaluation"—testing your new version against the same ground truth collection you used in the **Measure** stage.

The Axiom Console will provide views to compare these evaluation runs side-by-side:

Expand All @@ -38,7 +38,7 @@ This ensures you can validate changes with data before they ever reach your user

<Badge>Coming soon</Badge> After a new version of your capability has proven its superiority in offline tests, you can deploy it with confidence. The Rudder workflow will support a champion/challenger pattern, where you can deploy a new "challenger" version to run in shadow mode against a portion of production traffic. This allows for a final validation on real-world data without impacting the user experience.

Once you're satisfied with the challenger's performance, you can promote it to become the new "champion" using the SDK's `deploy` function.
Once youre satisfied with the challengers performance, you can promote it to become the new "champion" using the SDKs `deploy` function.

```typescript
import { axiom } from './axiom-client';
Expand All @@ -50,7 +50,7 @@ await axiom.prompts.deploy('prompt_123', {
});
```

## What's next?
## Whats next?

By completing the Iterate stage, you have closed the loop. Your improved capability is now in production, and you can return to the **Observe** stage to monitor its performance and identify the next opportunity for improvement.

Expand Down
12 changes: 6 additions & 6 deletions ai-engineering/measure.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ description: "Learn how to measure the quality of your AI capabilities by runnin
keywords: ["ai engineering", "rudder", "measure", "evals", "evaluation", "scoring", "graders"]
---

import { Badge } from "/snippets/badge.jsx";
import { definitions } from '/snippets/definitions.mdx';
import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'

<Warning>
The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a small group of teams shaping these tools.
</Warning>

The **Measure** stage is where you quantify the quality and effectiveness of your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. Instead of relying on anecdotal checks, this stage uses a systematic process called an <Tooltip tip={definitions.Eval}>eval</Tooltip> to score your capability's performance against a known set of correct examples (<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.
The **Measure** stage is where you quantify the quality and effectiveness of your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. Instead of relying on anecdotal checks, this stage uses a systematic process called an <Tooltip tip={definitions.Eval}>eval</Tooltip> to score your capabilitys performance against a known set of correct examples (<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.

## The `Eval` function

Expand Down Expand Up @@ -62,7 +62,7 @@ Eval('text-match-eval', {

## Grading with scorers

<Badge>Coming soon</Badge> A <Tooltip tip={definitions.Grader}>grader</Tooltip> is a function that scores a capability's output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score.
<Badge>Coming soon</Badge> A <Tooltip tip={definitions.Grader}>grader</Tooltip> is a function that scores a capabilitys output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score.

## Running evaluations

Expand All @@ -80,8 +80,8 @@ This command will execute the specified test file using `vitest` in the backgrou

The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.

## What's next?
## Whats next?

Once your capability meets your quality benchmarks in the Measure stage, it's ready to be deployed. The next step is to monitor its performance with real-world traffic.
Once your capability meets your quality benchmarks in the Measure stage, its ready to be deployed. The next step is to monitor its performance with real-world traffic.

Learn more about this step of the Rudder workflow in the [Observe](/ai-engineering/observe) docs.
Loading