Skip to content

feat(bench): add Arrow Flight E2E benchmark + Benchmarks CI workflow#5557

Open
Yicong-Huang wants to merge 7 commits into
apache:mainfrom
Yicong-Huang:bench/arrow-flight-e2e
Open

feat(bench): add Arrow Flight E2E benchmark + Benchmarks CI workflow#5557
Yicong-Huang wants to merge 7 commits into
apache:mainfrom
Yicong-Huang:bench/arrow-flight-e2e

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

A bench-agnostic CI lifecycle that future suites (e.g. JMH for ArrowUtils micros) plug into by appending one line to bin/run-benchmarks.sh, plus the first concrete suite: an end-to-end Arrow Flight + PythonWorkflowWorker micro-bench.

Lifecycle

Trigger Mode PR comment Publish to gh-pages
pull_request (label-gated, mirrors amber-integration's set) pr — 3 configs × 20 batches (~5 min)
push to main pr (post-merge fast signal)
schedule Sundays 08:00 UTC full — 36 configs × 200 batches (~50-60 min)
workflow_dispatch full

PR runs upload the bench as an artifact + render a markdown summary table on the workflow page; the workflow_run-triggered Benchmarks PR Comment listener (separate file because pull_request from forks gets a read-only token and zero secret access) downloads the artifact, sanitizes the CSV, and upserts a single marker-tagged PR comment. Non-blocking — not part of required-checks.yml's aggregator.

First benchmark: Arrow Flight E2E (ArrowFlightActorBench)

Spawns a real PythonWorkflowWorker actor (real Pekko mailbox + real texera_run_python_worker.py subprocess + real Arrow Flight gRPC) wired to an identity Python UDF, then times per-batch send→echo round-trip across a sweep of batch_size × schema_width × string_len. Per-config output: throughput (tuples/s, MB/s), latency p50/p95/p99, total ms. Each config writes incrementally so a killed sweep still leaves usable artifacts.

ASF: benchmark-action/github-action-benchmark is SHA-pinned to 52576c92bccf6ac60c8223ec7eb2565637cae9ba (v1.22.1) per the apache-infrastructure-actions allow-list.

Any related issues, documentation, discussions?

Closes #5556

How was this PR tested?

End-to-end validated on a fork-internal PR — Yicong-Huang/texera#17 ran the full Benchmarks workflow, the workflow_run listener fired, and a marker-tagged comment landed and upserted across two push cycles (rendered example). workflow_run only listens on the default branch, so the loop can't be tested from a non-default branch — that's why the dry-run lived on a fork; after merge, the same flow takes effect on apache/texera:main automatically.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Adds an end-to-end micro-benchmark of the Arrow Flight data path that
spawns a real PythonWorkflowWorker actor (real Pekko + real Python
subprocess via texera_run_python_worker.py + real Arrow Flight gRPC),
wires an identity Python UDF, and times per-batch send→echo round-trip
across a 36-config sweep (batch_size × schema_width × string_len).

CI integration is bench-agnostic: a single `Benchmarks` workflow calls
`bin/run-benchmarks.sh` as an opaque entry point. Future benches (e.g.
JMH for ArrowUtils micros) plug in by appending one line to that script
and adding a Publish step block. Trigger gate mirrors amber-integration
via PR labels (no file-path filters); failure does not block merges
(workflow stays out of required-checks.yml's aggregator). Results upload
to gh-pages dashboard via SHA-pinned benchmark-action (v1.22.1, on ASF
allow-list); auto-push gated on push-to-main so PRs do not pollute the
baseline.
@github-actions github-actions Bot added engine ci changes related to CI dev labels Jun 8, 2026
@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.06%. Comparing base (75b4619) to head (f6c4da0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5557      +/-   ##
============================================
- Coverage     52.16%   52.06%   -0.11%     
+ Complexity     2482     2473       -9     
============================================
  Files          1067     1067              
  Lines         41273    41355      +82     
  Branches       4437     4438       +1     
============================================
+ Hits          21532    21533       +1     
- Misses        18479    18573      +94     
+ Partials       1262     1249      -13     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (ø)
agent-service 33.76% <ø> (ø)
amber 52.88% <ø> (-0.35%) ⬇️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.06% <ø> (ø)
file-service 38.32% <ø> (ø)
frontend 46.40% <ø> (-0.08%) ⬇️
pyamber 90.72% <ø> (+0.02%) ⬆️
python 90.83% <ø> (ø) Carriedforward from 266e5e8
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The bench job's sbt compile transitively reaches `common/auth` which
imports JOOQ-generated classes from `org.apache.texera.dao.jooq.generated.*`.
JOOQ codegen at sbt compile time requires a live Postgres connection to
introspect schema; without it the auth module's `User` / `UserRoleEnum`
symbols fail to resolve and the whole compile aborts. The bench itself
doesn't touch the DB at runtime — this is purely a build dependency.

Mirrors the same `services.postgres` block and `Create Databases` step
that amber-integration in build.yml uses (minus the iceberg / lakefs /
lakekeeper SQL since the bench never reads from those schemas).

Local builds didn't surface this because they had cached JOOQ classes
from prior runs against a developer Postgres.
PR and post-merge runs now use a trimmed 3-config grid (`pr` mode: 3
batch sizes × single schema × single string len × 20 batches) targeting
~5 min in CI. The full 36-config grid (`full` mode: 4 × 3 × 3 × 200
batches) runs weekly on a scheduled trigger and on workflow_dispatch.

Trigger → mode → publish mapping:
  pull_request                        → pr   → no gh-pages push
  push to main                        → pr   → publishes baseline
  schedule (Sundays 08:00 UTC)        → full → publishes baseline
  workflow_dispatch                   → full → no publish

Bench-side: `BENCH_MODE` env var selects between two GridSpec cases.
`BENCH_NUM_BATCHES` still overrides numBatches if set (for local smoke).
Bench step itself ran fine on the prior CI; the Publish throughput /
latency steps failed with `fatal: couldn't find remote ref gh-pages`
because apache/texera has no gh-pages branch yet. The action attempts
to fetch gh-pages even when auto-push is false (it normally wants the
branch to compare against baseline).

Add `skip-fetch-gh-pages: true` to bypass the fetch — auto-push on
push-to-main / schedule still creates the branch on first write. Once
the dashboard is seeded, flip this to false to re-enable baseline
comparison + alert-threshold logic.

Add `continue-on-error: true` on each publish step as a safety net:
the bench data is already preserved in the uploaded artifact, so any
gh-pages-side surprise (permissions glitch, transient git failure)
shouldn't fail the bench job overall.
Three things:

1. Rename bench job's display name to `Bench` (was lowercase `bench`).

2. Bench job → Python 3.12 (was 3.11). Matches the local dev venv and
   the runtime Python texera_run_python_worker.py spawns; consistency
   removes a class of "works locally, drifts in CI" surprises.

3. PR-side comment with bench results. The Benchmarks workflow runs on
   `pull_request` events from forks, where GitHub forces GITHUB_TOKEN
   to read-only and refuses to inject any secret (AUTO_MERGE_TOKEN
   included — that restriction applies to ALL secrets, not just
   GITHUB_TOKEN). The fix is the ASF-approved `workflow_run` pattern:
   a separate workflow file that triggers when Benchmarks completes,
   runs in the base repo's trusted context, and has `pull-requests:
   write`.

   Bench-side: write the PR number to bench-results/pr-number.txt
   (workflow_run.pull_requests is empty for fork PRs, so we ferry the
   number via artifact); render a markdown summary table to the
   $GITHUB_STEP_SUMMARY for one-click visibility on the workflow page.

   Comment-side (benchmarks-pr-comment.yml): download the artifact,
   read + strict-validate (`^[0-9]+$`) the PR number, sanitize the CSV
   (cap at 32 KB, neutralize any triple-backtick sequence so a
   malicious fork can't escape the code fence and inject arbitrary
   markdown), then upsert a marker-tagged comment so subsequent runs
   update in place rather than spam.
The previous comment dumped raw CSV inside a `csv` code block, which
forced reviewers to mentally column-align 13 fields per row. Render
the actionable subset as a right-aligned markdown table (batch /
schema_w / str_len / n_batches / tuples-s / MB-s / p50 ms / p99 ms /
total ms), convert lat_*_us to milliseconds, drop redundant fields
(config_idx, total_tuples, total_bytes, lat_p95_us), and tuck the raw
sanitized CSV into a collapsed `<details>` for verifiability.

Per-cell sanitizer escapes pipes and strips newlines to defeat table
injection from the still-untrusted fork-PR-controlled CSV. Falls back
to the raw-CSV view if header parsing fails.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a bench-agnostic CI workflow for performance benchmarks and introduces the first suite: an end-to-end Arrow Flight ↔ PythonWorkflowWorker benchmark that emits CSV + GitHub Action Benchmark JSON outputs and can publish historical results to gh-pages and post summarized PR comments via a workflow_run listener.

Changes:

  • Add bin/run-benchmarks.sh as a single entry point for running all benchmark suites and writing artifacts to bench-results/.
  • Introduce ArrowFlightActorBench (Scala) to measure per-batch round-trip latency percentiles and throughput across a config grid, emitting CSV + JSON for github-action-benchmark.
  • Add two GitHub Actions workflows: one to run/publish benchmarks (benchmarks.yml) and one to upsert a marker-tagged PR comment with sanitized results (benchmarks-pr-comment.yml).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
bin/run-benchmarks.sh Adds a unified benchmark runner invoked by CI and standardizes artifact output under bench-results/.
amber/src/test/scala/.../ArrowFlightActorBench.scala Implements the Arrow Flight E2E benchmark and writes CSV/JSON outputs for CI consumption and dashboard publishing.
.github/workflows/benchmarks.yml Adds the main benchmark CI workflow with label-gated PR runs and publish-to-gh-pages behavior for main/schedule.
.github/workflows/benchmarks-pr-comment.yml Adds a workflow_run listener that downloads benchmark artifacts and upserts a sanitized PR comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/benchmarks.yml
Comment thread .github/workflows/benchmarks-pr-comment.yml Outdated
- benchmarks.yml: TRIGGER_LABELS used `python`; required-checks.yml's
  LABEL_STACKS keys this stack as `pyamber` per labeler.yml, so PRs
  labeled `pyamber` were silently skipping the bench. Swap to the
  correct key.
- benchmarks-pr-comment.yml: switch listComments to paginate so the
  marker comment can still be located on PRs with >100 comments;
  prevents duplicate-comment-spam on long-running PRs.
- ArrowFlightActorBench.awaitOneDataFrameEcho: each receiveOne now uses
  the remaining time to the absolute deadline rather than the full
  timeout per loop iteration. A flood of ACK/ECM messages can no
  longer extend the overall wait beyond the caller's deadline.
- ArrowFlightActorBench scaladoc: replace stale `benchmark-results.csv`
  with the actual `bench-results/arrow-flight-e2e.{csv,*.json}` paths.

@aglinxinyuan aglinxinyuan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change needs more discussion since it is fundamentally different from our regular test cases.

We can either:

  1. Have a discussion of design for how benchmark results should be shown to developers.
  2. Turn this into a regular test by having the benchmark enforce a threshold and return a pass/fail result, so it can guard against regressions like other tests.

Comment thread .github/workflows/benchmarks.yml
# - The PR number is validated against ^[0-9]+$ before being used in
# any API call, blocking ref injection.

name: Benchmarks PR Comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we interpret this number? We should have a baseline for comparison; otherwise, it's difficult to tell whether the result is good or bad.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to first merge this PR, so that main branch has a base line number. According to GitHub Action Benchmark, it will be posted to gh pages for continues result rendering https://github.com/benchmark-action/github-action-benchmark#charts-on-github-pages

and add alert comment on the PR for performance regression https://github.com/benchmark-action/github-action-benchmark#alert-comment-on-commit-page.

But both needs to be merged to check. I am testing on my fork, and we can follow up to fix issues. See PR description.

* subprocess may end up trying to launch from that path; move it aside for
* the run, or fix `amberHomePath` upstream.
*/
object ArrowFlightActorBench {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the benchmark class in the test folder? Unlike the other tests, it doesn't produce a pass/fail result. Since it's intended for performance measurement, it may be better to place it in a separate benchmark folder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it to bench folder.

@Yicong-Huang

Copy link
Copy Markdown
Contributor Author

It is not a test case. This is a bench case. I hope to use it for informative purpose only, and do not block PR from merging. What do you think?

@aglinxinyuan

Copy link
Copy Markdown
Contributor

It is not a test case. This is a bench case. I hope to use it for informative purpose only, and do not block PR from merging. What do you think?

Sure. Whether to run it in CI or not is a minor issue. Please address the other comments as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci changes related to CI dev engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add end-to-end Arrow Flight benchmark with CI dashboard integration

4 participants