Skip to content

Dataset pipelines#163

Merged
Luca Forstner (lforst) merged 21 commits into
mainfrom
dataset-pipelines
May 27, 2026
Merged

Dataset pipelines#163
Luca Forstner (lforst) merged 21 commits into
mainfrom
dataset-pipelines

Conversation

@ankrgyl
Copy link
Copy Markdown
Contributor

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Latest downloadable build artifacts for this PR commit b4094803988c:

Available artifact names
  • ``artifacts-build-global
  • ``artifacts-build-local-x86_64-apple-darwin
  • ``artifacts-build-local-x86_64-pc-windows-msvc
  • ``artifacts-build-local-aarch64-pc-windows-msvc
  • ``artifacts-build-local-aarch64-apple-darwin
  • ``artifacts-build-local-x86_64-unknown-linux-musl
  • ``artifacts-build-local-x86_64-unknown-linux-gnu
  • ``artifacts-build-local-aarch64-unknown-linux-gnu
  • ``artifacts-plan-dist-manifest
  • ``cargo-dist-cache

@ankrgyl Ankur Goyal (ankrgyl) changed the title wip: dataset pipelines Dataset pipelines May 3, 2026
Comment thread README.md Outdated
bt datasets pipeline run ./pipeline.ts --limit 100

# Staged execution for inspection or agent editing.
bt datasets pipeline fetch ./pipeline.ts --limit 500
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called fetch here, but the enum is called pull? Can we make the naming more consistent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, at some point i renamed from fetch to pull and there were some lingering references.

Comment thread src/datasets/pipeline.rs
Ok((ctx, client, project))
}

fn discovery_filter(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a timestamp filter of some kind? Just to constrain the queries here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a --window argument which defaults to 1d and is always and'd in.

Comment thread src/datasets/pipeline.rs Outdated
/// Maximum number of source refs to discover
#[arg(
long,
alias = "target",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we didn't have this alias. limit aligns better with the rest of the CLI options, and it might be confusing that this is referring to source refs, and not final row count (while things like --target-dataset refer to the output).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Comment thread src/datasets/pipeline.rs
);
}

datasets_api::create_dataset_with_metadata(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto creating datasets like this might be surprising behaviour. Means that folks run into issues if they make a spelling mistake or something similar. But I don't feel strongly about this, just figured being explicit about it might be better for the agents.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i generally agree but our SDKs auto create projects and datasets so i think as-is, this is more consistent with our current semantics.

@lforst
Copy link
Copy Markdown
Member

Luca Forstner (lforst) commented May 27, 2026

Did some changes in the JS version we need to check here:

  • bt should probably just read off the __braintrust_dataset_pipelines global instead of getRegisteredDatasetPipelines() which is there or may not be there, and is an array of definitions (this is to reduce API surface in the sdk and has very little drawback here I believe)
  • I feel like we should not expose the origin property to the user. It should just be filled automatically
  • Changed the TS typing so that trace is always available in the transform input.

General Qs:

  • Should we accept an orgName for the target of the definition? It seems implicit from the source
  • Should we expose the id to users from the transform result? Is this something user-facing?
  • In the design doc we were passing the id of the span to the transform funciton. In the JS implementation this doesn't seem to be the case. Could be pretty important to find a span in the trace.

@ankrgyl
Copy link
Copy Markdown
Contributor Author

Good stuff. comments inline

Did some changes in the JS version we need to check here:

  • bt should probably just read off the __braintrust_dataset_pipelines global instead of getRegisteredDatasetPipelines() which is there or may not be there, and is an array of definitions (this is to reduce API surface in the sdk and has very little drawback here I believe)
  • I feel like we should not expose the origin property to the user. It should just be filled automatically
  • Changed the TS typing so that trace is always available in the transform input.

updated this stuff. See braintrustdata/braintrust-sdk-python#470

General Qs:

  • Should we accept an orgName for the target of the definition? It seems implicit from the source

Yep

  • Should we expose the id to users from the transform result? Is this something user-facing?

braintrustdata/braintrust-sdk-javascript#2055

  • In the design doc we were passing the id of the span to the transform funciton. In the JS implementation this doesn't seem to be the case. Could be pretty important to find a span in the trace.

Fixed

@lforst Luca Forstner (lforst) merged commit 773afa3 into main May 27, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants