feat(bench): add Arrow Flight E2E benchmark + Benchmarks CI workflow by Yicong-Huang · Pull Request #5557 · apache/texera

Yicong-Huang · 2026-06-08T03:08:38Z

What changes were proposed in this PR?

A bench-agnostic CI lifecycle that future suites (e.g. JMH for ArrowUtils micros) plug into by appending one line to bin/run-benchmarks.sh, plus the first concrete suite: an end-to-end Arrow Flight + PythonWorkflowWorker micro-bench.

Lifecycle

Trigger	Mode	PR comment	Publish to gh-pages
`pull_request` (label-gated, mirrors `amber-integration`'s set)	`pr` — 3 configs × 20 batches (~5 min)	✓	—
`push` to `main`	`pr` (post-merge fast signal)	—	✓
`schedule` Sundays 08:00 UTC	`full` — 36 configs × 200 batches (~50-60 min)	—	✓
`workflow_dispatch`	`full`	—	—

PR runs upload the bench as an artifact + render a markdown summary table on the workflow page; the workflow_run-triggered Benchmarks PR Comment listener (separate file because pull_request from forks gets a read-only token and zero secret access) downloads the artifact, sanitizes the CSV, and upserts a single marker-tagged PR comment. Non-blocking — not part of required-checks.yml's aggregator.

First benchmark: Arrow Flight E2E (ArrowFlightActorBench)

Spawns a real PythonWorkflowWorker actor (real Pekko mailbox + real texera_run_python_worker.py subprocess + real Arrow Flight gRPC) wired to an identity Python UDF, then times per-batch send→echo round-trip across a sweep of batch_size × schema_width × string_len. Per-config output: throughput (tuples/s, MB/s), latency p50/p95/p99, total ms. Each config writes incrementally so a killed sweep still leaves usable artifacts.

ASF: benchmark-action/github-action-benchmark is SHA-pinned to 52576c92bccf6ac60c8223ec7eb2565637cae9ba (v1.22.1) per the apache-infrastructure-actions allow-list.

Any related issues, documentation, discussions?

Closes #5556

How was this PR tested?

End-to-end validated on a fork-internal PR — Yicong-Huang/texera#17 ran the full Benchmarks workflow, the workflow_run listener fired, and a marker-tagged comment landed and upserted across two push cycles (rendered example). workflow_run only listens on the default branch, so the loop can't be tested from a non-default branch — that's why the dry-run lived on a fork; after merge, the same flow takes effect on apache/texera:main automatically.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Adds an end-to-end micro-benchmark of the Arrow Flight data path that spawns a real PythonWorkflowWorker actor (real Pekko + real Python subprocess via texera_run_python_worker.py + real Arrow Flight gRPC), wires an identity Python UDF, and times per-batch send→echo round-trip across a 36-config sweep (batch_size × schema_width × string_len). CI integration is bench-agnostic: a single `Benchmarks` workflow calls `bin/run-benchmarks.sh` as an opaque entry point. Future benches (e.g. JMH for ArrowUtils micros) plug in by appending one line to that script and adding a Publish step block. Trigger gate mirrors amber-integration via PR labels (no file-path filters); failure does not block merges (workflow stays out of required-checks.yml's aggregator). Results upload to gh-pages dashboard via SHA-pinned benchmark-action (v1.22.1, on ASF allow-list); auto-push gated on push-to-main so PRs do not pollute the baseline.

codecov-commenter · 2026-06-08T03:10:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.06%. Comparing base (75b4619) to head (f6c4da0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5557      +/-   ##
============================================
- Coverage     52.16%   52.06%   -0.11%     
+ Complexity     2482     2473       -9     
============================================
  Files          1067     1067              
  Lines         41273    41355      +82     
  Branches       4437     4438       +1     
============================================
+ Hits          21532    21533       +1     
- Misses        18479    18573      +94     
+ Partials       1262     1249      -13

Flag	Coverage Δ		*Carryforward flag
access-control-service	`64.61% <ø> (ø)`
agent-service	`33.76% <ø> (ø)`
amber	`52.88% <ø> (-0.35%)`	⬇️
computing-unit-managing-service	`1.65% <ø> (ø)`
config-service	`56.06% <ø> (ø)`
file-service	`38.32% <ø> (ø)`
frontend	`46.40% <ø> (-0.08%)`	⬇️
pyamber	`90.72% <ø> (+0.02%)`	⬆️
python	`90.83% <ø> (ø)`		Carriedforward from 266e5e8
workflow-compiling-service	`58.69% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The bench job's sbt compile transitively reaches `common/auth` which imports JOOQ-generated classes from `org.apache.texera.dao.jooq.generated.*`. JOOQ codegen at sbt compile time requires a live Postgres connection to introspect schema; without it the auth module's `User` / `UserRoleEnum` symbols fail to resolve and the whole compile aborts. The bench itself doesn't touch the DB at runtime — this is purely a build dependency. Mirrors the same `services.postgres` block and `Create Databases` step that amber-integration in build.yml uses (minus the iceberg / lakefs / lakekeeper SQL since the bench never reads from those schemas). Local builds didn't surface this because they had cached JOOQ classes from prior runs against a developer Postgres.

PR and post-merge runs now use a trimmed 3-config grid (`pr` mode: 3 batch sizes × single schema × single string len × 20 batches) targeting ~5 min in CI. The full 36-config grid (`full` mode: 4 × 3 × 3 × 200 batches) runs weekly on a scheduled trigger and on workflow_dispatch. Trigger → mode → publish mapping: pull_request → pr → no gh-pages push push to main → pr → publishes baseline schedule (Sundays 08:00 UTC) → full → publishes baseline workflow_dispatch → full → no publish Bench-side: `BENCH_MODE` env var selects between two GridSpec cases. `BENCH_NUM_BATCHES` still overrides numBatches if set (for local smoke).

Bench step itself ran fine on the prior CI; the Publish throughput / latency steps failed with `fatal: couldn't find remote ref gh-pages` because apache/texera has no gh-pages branch yet. The action attempts to fetch gh-pages even when auto-push is false (it normally wants the branch to compare against baseline). Add `skip-fetch-gh-pages: true` to bypass the fetch — auto-push on push-to-main / schedule still creates the branch on first write. Once the dashboard is seeded, flip this to false to re-enable baseline comparison + alert-threshold logic. Add `continue-on-error: true` on each publish step as a safety net: the bench data is already preserved in the uploaded artifact, so any gh-pages-side surprise (permissions glitch, transient git failure) shouldn't fail the bench job overall.

Three things: 1. Rename bench job's display name to `Bench` (was lowercase `bench`). 2. Bench job → Python 3.12 (was 3.11). Matches the local dev venv and the runtime Python texera_run_python_worker.py spawns; consistency removes a class of "works locally, drifts in CI" surprises. 3. PR-side comment with bench results. The Benchmarks workflow runs on `pull_request` events from forks, where GitHub forces GITHUB_TOKEN to read-only and refuses to inject any secret (AUTO_MERGE_TOKEN included — that restriction applies to ALL secrets, not just GITHUB_TOKEN). The fix is the ASF-approved `workflow_run` pattern: a separate workflow file that triggers when Benchmarks completes, runs in the base repo's trusted context, and has `pull-requests: write`. Bench-side: write the PR number to bench-results/pr-number.txt (workflow_run.pull_requests is empty for fork PRs, so we ferry the number via artifact); render a markdown summary table to the $GITHUB_STEP_SUMMARY for one-click visibility on the workflow page. Comment-side (benchmarks-pr-comment.yml): download the artifact, read + strict-validate (`^[0-9]+$`) the PR number, sanitize the CSV (cap at 32 KB, neutralize any triple-backtick sequence so a malicious fork can't escape the code fence and inject arbitrary markdown), then upsert a marker-tagged comment so subsequent runs update in place rather than spam.

The previous comment dumped raw CSV inside a `csv` code block, which forced reviewers to mentally column-align 13 fields per row. Render the actionable subset as a right-aligned markdown table (batch / schema_w / str_len / n_batches / tuples-s / MB-s / p50 ms / p99 ms / total ms), convert lat_*_us to milliseconds, drop redundant fields (config_idx, total_tuples, total_bytes, lat_p95_us), and tuck the raw sanitized CSV into a collapsed `<details>` for verifiability. Per-cell sanitizer escapes pipes and strips newlines to defeat table injection from the still-untrusted fork-PR-controlled CSV. Falls back to the raw-CSV view if header parsing fails.

Copilot

Pull request overview

Adds a bench-agnostic CI workflow for performance benchmarks and introduces the first suite: an end-to-end Arrow Flight ↔ PythonWorkflowWorker benchmark that emits CSV + GitHub Action Benchmark JSON outputs and can publish historical results to gh-pages and post summarized PR comments via a workflow_run listener.

Changes:

Add bin/run-benchmarks.sh as a single entry point for running all benchmark suites and writing artifacts to bench-results/.
Introduce ArrowFlightActorBench (Scala) to measure per-batch round-trip latency percentiles and throughput across a config grid, emitting CSV + JSON for github-action-benchmark.
Add two GitHub Actions workflows: one to run/publish benchmarks (benchmarks.yml) and one to upsert a marker-tagged PR comment with sanitized results (benchmarks-pr-comment.yml).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`bin/run-benchmarks.sh`	Adds a unified benchmark runner invoked by CI and standardizes artifact output under `bench-results/`.
`amber/src/test/scala/.../ArrowFlightActorBench.scala`	Implements the Arrow Flight E2E benchmark and writes CSV/JSON outputs for CI consumption and dashboard publishing.
`.github/workflows/benchmarks.yml`	Adds the main benchmark CI workflow with label-gated PR runs and publish-to-`gh-pages` behavior for main/schedule.
`.github/workflows/benchmarks-pr-comment.yml`	Adds a `workflow_run` listener that downloads benchmark artifacts and upserts a sanitized PR comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- benchmarks.yml: TRIGGER_LABELS used `python`; required-checks.yml's LABEL_STACKS keys this stack as `pyamber` per labeler.yml, so PRs labeled `pyamber` were silently skipping the bench. Swap to the correct key. - benchmarks-pr-comment.yml: switch listComments to paginate so the marker comment can still be located on PRs with >100 comments; prevents duplicate-comment-spam on long-running PRs. - ArrowFlightActorBench.awaitOneDataFrameEcho: each receiveOne now uses the remaining time to the absolute deadline rather than the full timeout per loop iteration. A flood of ACK/ECM messages can no longer extend the overall wait beyond the caller's deadline. - ArrowFlightActorBench scaladoc: replace stale `benchmark-results.csv` with the actual `bench-results/arrow-flight-e2e.{csv,*.json}` paths.

aglinxinyuan

I think this change needs more discussion since it is fundamentally different from our regular test cases.

We can either:

Have a discussion of design for how benchmark results should be shown to developers.
Turn this into a regular test by having the benchmark enforce a threshold and return a pass/fail result, so it can guard against regressions like other tests.

aglinxinyuan · 2026-06-08T05:30:25Z

+#   - The PR number is validated against ^[0-9]+$ before being used in
+#     any API call, blocking ref injection.
+
+name: Benchmarks PR Comment


How should we interpret this number? We should have a baseline for comparison; otherwise, it's difficult to tell whether the result is good or bad.

We need to first merge this PR, so that main branch has a base line number. According to GitHub Action Benchmark, it will be posted to gh pages for continues result rendering https://github.com/benchmark-action/github-action-benchmark#charts-on-github-pages

and add alert comment on the PR for performance regression https://github.com/benchmark-action/github-action-benchmark#alert-comment-on-commit-page.

But both needs to be merged to check. I am testing on my fork, and we can follow up to fix issues. See PR description.

aglinxinyuan · 2026-06-08T05:32:37Z

+  * subprocess may end up trying to launch from that path; move it aside for
+  * the run, or fix `amberHomePath` upstream.
+  */
+object ArrowFlightActorBench {


Should we keep the benchmark class in the test folder? Unlike the other tests, it doesn't produce a pass/fail result. Since it's intended for performance measurement, it may be better to place it in a separate benchmark folder.

I can move it to bench folder.

Yicong-Huang · 2026-06-08T05:42:50Z

It is not a test case. This is a bench case. I hope to use it for informative purpose only, and do not block PR from merging. What do you think?

aglinxinyuan · 2026-06-08T05:46:17Z

It is not a test case. This is a bench case. I hope to use it for informative purpose only, and do not block PR from merging. What do you think?

Sure. Whether to run it in CI or not is a minor issue. Please address the other comments as well.

github-actions Bot assigned Yicong-Huang Jun 8, 2026

github-actions Bot added engine ci changes related to CI dev labels Jun 8, 2026

Yicong-Huang added 5 commits June 7, 2026 20:13

Yicong-Huang requested review from aglinxinyuan and bobbai00 June 8, 2026 05:10

aglinxinyuan requested a review from Copilot June 8, 2026 05:17

Copilot started reviewing on behalf of aglinxinyuan June 8, 2026 05:17 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

aglinxinyuan requested changes Jun 8, 2026

View reviewed changes

aglinxinyuan approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): add Arrow Flight E2E benchmark + Benchmarks CI workflow#5557

feat(bench): add Arrow Flight E2E benchmark + Benchmarks CI workflow#5557
Yicong-Huang wants to merge 7 commits into
apache:mainfrom
Yicong-Huang:bench/arrow-flight-e2e

Yicong-Huang commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aglinxinyuan left a comment •

edited

Loading

Uh oh!

Uh oh!

aglinxinyuan Jun 8, 2026

Uh oh!

Yicong-Huang Jun 8, 2026

Uh oh!

aglinxinyuan Jun 8, 2026

Uh oh!

Yicong-Huang Jun 8, 2026

Uh oh!

Yicong-Huang commented Jun 8, 2026

Uh oh!

aglinxinyuan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Yicong-Huang commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

codecov-commenter commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aglinxinyuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aglinxinyuan Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

aglinxinyuan Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Jun 8, 2026

Uh oh!

aglinxinyuan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yicong-Huang commented Jun 8, 2026 •

edited

Loading

codecov-commenter commented Jun 8, 2026 •

edited

Loading

aglinxinyuan left a comment •

edited

Loading