Skip to content

alphaXiv/fugu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fugu

JSR

A Drizzle-style, fully-typed schema and query layer for Turbopuffer.

Define a namespace from column builders and get, for free:

  • an inferred document type from your schema (no hand-written interfaces),
  • capability-checked ranking so ann only accepts vector fields and bm25 only accepts BM25 fields,
  • typed filters with field names and value types checked against the document,
  • result rows narrowed to exactly the attributes you request, with $dist present only when the query actually scores.

Install

# Deno
deno add jsr:@alphaxiv/fugu

# pnpm (first-class JSR support)
pnpm add jsr:@alphaxiv/fugu

# npm / yarn / bun (via the JSR bridge)
npx jsr add @alphaxiv/fugu

Quick start

import { bool, datetime, fugu, int, string, uuid, vector } from "@alphaxiv/fugu";

const db = fugu(); // reads env vars by default

// Define a typed namespace.
const papers = db.namespace({
  name: "papers",
  version: 1,
  columns: {
    id: uuid({ version: 7 }),
    embedding: vector(3072),
    body: string({ fullTextSearch: true }).array(),
    title: string(),
    votes: int(),
    published_at: datetime(),
    blocked: bool(),
  },
});

// The document type is inferred from the columns.
type Paper = typeof papers.$inferDocument;
// { id: string; embedding: number[]; body: string[]; title: string;
//   votes: number; published_at: string; blocked: boolean }

// Write.
await papers.upsert([
  {
    id: "0190a1b2-c3d4-7e5f-8a9b-0c1d2e3f4a5b",
    embedding: [/* 3072 floats */],
    body: ["page 1 text", "page 2 text"],
    title: "Attention Is All You Need",
    votes: 9001,
    published_at: "2024-01-18T00:00:00Z",
    blocked: false,
  },
]);

// Query: vector search, filtered, returning only the fields you need.
const { ann } = papers.rank;
const { and, eq, gte } = papers.filter;

const results = await papers.query({
  rankBy: ann("embedding", queryEmbedding),
  topK: 20,
  filters: and(eq("blocked", false), gte("votes", 10)),
  includeAttributes: ["title", "votes"],
});

results[0].title; // string  (requested)
results[0].votes; // number  (requested)
results[0].$dist; // number  (ANN scores, so $dist is present and required)
results[0].body; // compile error: not requested

Connecting

fugu(...) takes a Turbopuffer client. Provide it however suits you:

// From explicit options.
const db = fugu({ client: { apiKey, region } });

// From a client you construct (for custom timeouts, retries, base URL, etc.).
import { Turbopuffer } from "@turbopuffer/turbopuffer";
const db = fugu({ client: new Turbopuffer({ apiKey, region, timeout: 30_000 }) });

// Or omit `client` to read TURBOPUFFER_API_KEY and TURBOPUFFER_REGION from the environment.
const db = fugu();

The environment prefix is optional and separate from credentials; see Environments and versioning.

Defining a schema

Columns mirror Turbopuffer's real attribute types exactly. Each builder takes optional config; the only method that transforms the value type is .array().

Builder Turbopuffer type Value type Notes
string<T>(opts?) string T extends string { fullTextSearch } enables BM25
int(opts?) int number signed 64-bit
uint(opts?) uint number unsigned 64-bit
float(opts?) float number 64-bit
bool(opts?) bool boolean
uuid<T>(opts?) uuid T extends string { version: 7 } asserts UUIDv7 on write
datetime(opts?) datetime string RFC 3339 / ISO 8601
vector(n, opts?) [n]f32 number[] dense vector, ANN-indexed
column.array() []… T[] wraps any scalar into its array form

Every scalar builder accepts { filterable } (defaults to true; set false to skip the filter index on attributes you never filter or sort by).

Branded ids

The string and uuid builders take a type parameter, so branded id types flow straight into the inferred document:

import type { PaperId } from "./ids.ts";

const papers = db.namespace({
  name: "papers",
  version: 1,
  columns: {
    id: uuid<PaperId>({ version: 7 }), // document `id` is PaperId, not just string
    // …
  },
});

A namespace must include an id column; it doubles as the document key, so re-upserting the same id overwrites the row. Turbopuffer ids are uint64, UUID, or string, so id can be a uuid(), a string(), or a uint() (int() works too) column.

Full-text and vectors

// BM25 over an array of page strings. Full-text attributes are non-filterable by default.
body: string({ fullTextSearch: true }).array(),

// Dense vector. Defaults to f32 with an ANN index.
embedding: vector(3072),

// Half precision, or stored-without-an-ANN-index:
embedding16: vector(3072, { encoding: "f16" }),
stored:      vector(256, { ann: false }),

The distance metric is set once per namespace, not per vector, because Turbopuffer applies a single metric to every vector column. It defaults to cosine_distance:

db.namespace({
  name: "papers",
  version: 1,
  distanceMetric: "euclidean_squared", // optional; applies to all vectors, defaults to cosine_distance
  columns: { embedding: vector(3072) /* … */ },
});

Writing

await papers.upsert(document); // a single document
await papers.upsert(documents); // or an array, batched into one write
  • Documents are typed as the inferred document, so the shape, value types, and branded ids are all checked.
  • Re-upserting an existing id overwrites it (use it to reprocess a row or recompute an embedding).
  • The compiled schema is sent on every write, and any uuid({ version: 7 }) columns validate their values before the request.
  • An empty array is a no-op.

Patching and deleting

Patch partially updates documents: pass the id plus only the fields you want to change. Other fields are left untouched.

await papers.patch({ id, votes: 42 }); // one record
await papers.patch([{ id: a, votes: 1 }, { id: b }]); // or an array
await papers.patchWhere(eq("blocked", true), { votes: 0 }); // every matching document

Delete by id or by filter:

await papers.delete(id); // one id
await papers.delete([a, b, c]); // or many
await papers.deleteWhere(lt("votes", 0)); // every matching document

A patch's id must reference an existing document, only the document's real fields are accepted (typed and partial), and version-checked columns still validate. Unlike upsert, patches and deletes send no schema. Empty patch([]) / delete([]) are no-ops.

Querying

const rows = await papers.query({
  rankBy, // required: how to rank (see below)
  topK, // optional: max rows
  filters, // optional: a filter expression
  includeAttributes, // optional: which attributes to return
});

Result narrowing

Turbopuffer returns only the id by default and lets you opt into attributes. fugu mirrors that in the type, so a row contains exactly what you asked for:

includeAttributes Row shape
omitted / false { id, $dist? } only
["title", "votes"] { id, title, votes } (those fields required) plus $dist if scored
true every attribute

Touching a field you did not request is a compile error, which also catches forgetting to include an attribute you then use.

The $dist score

$dist is present and required for scoring ranks (ann, knn, bm25, and the score combinators) and absent when you order by an attribute (asc / desc), matching Turbopuffer's behavior. No optional-chaining needed on the common path.

const scored = await papers.query({ rankBy: ann("embedding", q) });
scored[0].$dist; // number

const ordered = await papers.query({ rankBy: papers.rank.desc("votes") });
ordered[0].$dist; // compile error: attribute ordering produces no score

Filters

namespace.filter is an operator set bound to the document. Destructure what you need:

const { and, or, not, eq, ne, lt, lte, gt, gte, inArray, notInArray } = papers.filter;

const filters = and(
  eq("blocked", false),
  gte("published_at", "2024-01-01T00:00:00Z"),
  or(gt("votes", 100), inArray("title", ["A", "B"])),
);
Operator Meaning
eq(field, value) / ne(field, value) equals / not equals
lt / lte / gt / gte comparisons
inArray(field, values) / notInArray(field, values) membership
and(...filters) / or(...filters) / not(filter) composition

Field names and value types are checked against the document, so eq("blocked", 5) or a misspelled field will not compile.

Operators are bound to the namespace (via namespace.filter) rather than standalone imports, because the document type only appears in keyof T positions of a filter, which TypeScript cannot infer back from a filter value. Binding fixes the document so the field checks actually fire.

Ranking

namespace.rank is an operator set whose field arguments are constrained by index capability, not just value type:

const { ann, knn, bm25, asc, desc, sum, max, product } = papers.rank;

ann("embedding", queryVector); // ok: embedding is a vector field
bm25("body", "transformers"); // ok: body is a BM25 field
bm25("title", "transformers"); // compile error: title has no BM25 index
ann("body", queryVector); // compile error: body is not a vector field
Operator Produces
ann(field, vector) / knn(field, vector) approximate / exact vector search (vector fields only)
bm25(field, query, params?) BM25 full-text ranking (BM25 fields only)
asc(field) / desc(field) order by an attribute, no score
sum(...terms) / max(...terms) / product(weight, term) combine BM25 sub-scores

Hybrid search

Turbopuffer does hybrid search by running vector and BM25 queries and fusing the rankings on the client. Because results are typed, the fusion is a few lines:

const { ann, bm25 } = papers.rank;
const { eq } = papers.filter;

const [vec, text] = await Promise.all([
  papers.query({
    rankBy: ann("embedding", queryVector),
    topK: 100,
    filters: eq("blocked", false),
    includeAttributes: ["title"],
  }),
  papers.query({
    rankBy: bm25("body", queryText),
    topK: 100,
    filters: eq("blocked", false),
    includeAttributes: ["title"],
  }),
]);

const K = 60; // reciprocal-rank-fusion constant
const scores = new Map<string, number>();
for (const list of [vec, text]) {
  list.forEach((row, rank) => scores.set(row.id, (scores.get(row.id) ?? 0) + 1 / (K + rank)));
}
const ranked = [...scores.entries()].sort((a, b) => b[1] - a[1]).map(([id]) => id);

Environments and versioning

The full namespace key is {environment}-{name}-v{version}. Both the environment and version segments are dropped when omitted, so the key can be anything from papers to prod-papers-v2. Two knobs:

  • environment isolates deployments. Pass "prod" / "staging" / "dev" to fugu(...) so they never share an index. Because Turbopuffer namespaces are addressable by prefix, this also lets you list or tear down everything in one environment at once. When omitted, fugu falls back to the FUGU_ENV environment variable, so you can set the environment per deployment without threading it through code (an explicit environment always wins).
  • version is your migration escape hatch, and it is optional. Turbopuffer attribute types are immutable once a namespace holds data, so a breaking schema change (a changed type or removed attribute) means bumping version to roll onto a fresh, empty namespace. Adding indexing to an existing attribute does not need a bump. Omit version entirely for namespaces you never expect to migrate.
const db = fugu({ client, environment: "staging" });
db.namespace({ name: "papers", version: 2, columns }).key; // "staging-papers-v2"
db.namespace({ name: "papers", columns }).key; // "staging-papers"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors