[MINOR][CI] Raise federated request timeout for multitenant Spark tests by Baunsgaard · Pull Request #2505 · apache/systemds

Baunsgaard · 2026-06-23T23:16:55Z

The **.functions.federated.monitoring.**,**.functions.federated.multitenant.** CI shard has been intermittently failing and, when reruns pile up, timing out at the 30-minute job cap (e.g. run 28041416088).

Root cause: the multi-tenant test config pins sysds.federated.timeout to 16s. This value bounds both federated instruction execution (FederationMap/FederatedData) and end-of-run stats collection (FederatedStatistics.collectFedStats). For the Spark-backed (*SP) variants of the reuse tests, Spark context creation alone is ~14s, so under shared CI load a single federated request regularly exceeds 16s and throws TimeoutException:

SPARK°rightIndex°… DMLRuntimeException -- java.util.concurrent.TimeoutException
Exception … thrown while getting the federated stats of the federated response

The spurious timeouts make FederatedReuseReadTest.testModifiedValLineageSP and FederatedSerializationReuseTest.testRowSumsSP fail; surefire then reruns each failing test, and the accumulated rerun time repeatedly pushes the shard past the 30-minute cap, cancelling the whole job.

This bumps the timeout to 60s: still a hard bound on a genuinely runaway request (the suite cannot hang silently — the reason it was lowered from 128 → 16 in 8f5a42c0), but enough headroom for the Spark variants to pass on the first attempt, which also removes the expensive reruns and brings the shard comfortably back under the time cap.

Evidence (recurring flake, last ~2 weeks)

Same two tests, same TimeoutException signature, across independent runs:

Run	Date	Failing test(s)
27910160773	Jun 21	`testModifiedValLineageSP`, `testRowSumsSP`
27645002589	Jun 16	`testModifiedValLineageSP`
27542041413	Jun 15	`testModifiedValLineageSP`, `testPlusScalarCP`, `testRowSumsSP`
27363185073	Jun 11	`testModifiedValLineageSP`, `testRow

Set sysds.federated.timeout in the multi-tenant test config from 16s to 60s. The 16s bound was too aggressive for the Spark-backed (SP) variants of the federated multitenant reuse tests: Spark context creation alone takes ~14s, so under shared CI load a single federated request (both the rightIndex/rblk instruction execution and the end-of-run stats collection) routinely exceeded 16s and threw TimeoutException. These spurious timeouts caused FederatedReuseReadTest.testModifiedValLineageSP and FederatedSerializationReuseTest.testRowSumsSP to fail intermittently; surefire then reran each failing test, and the accumulated rerun time repeatedly pushed the monitoring/multitenant CI job past its 30-minute cap, cancelling the whole shard. 60s keeps a hard bound on a genuinely runaway request (so the suite still cannot hang silently) while giving the Spark variants enough headroom to complete on the first attempt, which also removes the expensive reruns.

Replace the verbose explanatory comment with the original one-line description; the rationale lives in the commit history.

codecov · 2026-06-24T00:06:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.51%. Comparing base (3871809) to head (f02420a).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2505      +/-   ##
============================================
+ Coverage     71.45%   71.51%   +0.06%     
- Complexity    48864    48947      +83     
============================================
  Files          1573     1573              
  Lines        189239   189334      +95     
  Branches      37128    37149      +21     
============================================
+ Hits         135215   135406     +191     
+ Misses        43576    43456     -120     
- Partials      10448    10472      +24

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-project-automation Bot added this to SystemDS PR Queue Jun 23, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 23, 2026

Shorten federated timeout comment in multitenant test config

f02420a

Replace the verbose explanatory comment with the original one-line description; the rationale lives in the commit history.

Baunsgaard merged commit be823ef into apache:main Jun 23, 2026
44 checks passed

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MINOR][CI] Raise federated request timeout for multitenant Spark tests#2505

[MINOR][CI] Raise federated request timeout for multitenant Spark tests#2505
Baunsgaard merged 2 commits into
apache:mainfrom
Baunsgaard:fix/fed-multitenant-timeout

Baunsgaard commented Jun 23, 2026

Uh oh!

Uh oh!

codecov Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Baunsgaard commented Jun 23, 2026

Evidence (recurring flake, last ~2 weeks)

Uh oh!

Uh oh!

codecov Bot commented Jun 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant