Skip to content

[MINOR][CI] Raise federated request timeout for multitenant Spark tests#2505

Merged
Baunsgaard merged 2 commits into
apache:mainfrom
Baunsgaard:fix/fed-multitenant-timeout
Jun 23, 2026
Merged

[MINOR][CI] Raise federated request timeout for multitenant Spark tests#2505
Baunsgaard merged 2 commits into
apache:mainfrom
Baunsgaard:fix/fed-multitenant-timeout

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

The **.functions.federated.monitoring.**,**.functions.federated.multitenant.** CI shard has been intermittently failing and, when reruns pile up, timing out at the 30-minute job cap (e.g. run 28041416088).

Root cause: the multi-tenant test config pins sysds.federated.timeout to 16s. This value bounds both federated instruction execution (FederationMap/FederatedData) and end-of-run stats collection (FederatedStatistics.collectFedStats). For the Spark-backed (*SP) variants of the reuse tests, Spark context creation alone is ~14s, so under shared CI load a single federated request regularly exceeds 16s and throws TimeoutException:

  • SPARK°rightIndex°… DMLRuntimeException -- java.util.concurrent.TimeoutException
  • Exception … thrown while getting the federated stats of the federated response

The spurious timeouts make FederatedReuseReadTest.testModifiedValLineageSP and FederatedSerializationReuseTest.testRowSumsSP fail; surefire then reruns each failing test, and the accumulated rerun time repeatedly pushes the shard past the 30-minute cap, cancelling the whole job.

This bumps the timeout to 60s: still a hard bound on a genuinely runaway request (the suite cannot hang silently — the reason it was lowered from 128 → 16 in 8f5a42c0), but enough headroom for the Spark variants to pass on the first attempt, which also removes the expensive reruns and brings the shard comfortably back under the time cap.

Evidence (recurring flake, last ~2 weeks)

Same two tests, same TimeoutException signature, across independent runs:

Run Date Failing test(s)
27910160773 Jun 21 testModifiedValLineageSP, testRowSumsSP
27645002589 Jun 16 testModifiedValLineageSP
27542041413 Jun 15 testModifiedValLineageSP, testPlusScalarCP, testRowSumsSP
27363185073 Jun 11 testModifiedValLineageSP, `testRow

Set sysds.federated.timeout in the multi-tenant test config from 16s to
60s. The 16s bound was too aggressive for the Spark-backed (SP) variants
of the federated multitenant reuse tests: Spark context creation alone
takes ~14s, so under shared CI load a single federated request (both the
rightIndex/rblk instruction execution and the end-of-run stats
collection) routinely exceeded 16s and threw TimeoutException.

These spurious timeouts caused FederatedReuseReadTest.testModifiedValLineageSP
and FederatedSerializationReuseTest.testRowSumsSP to fail intermittently;
surefire then reran each failing test, and the accumulated rerun time
repeatedly pushed the monitoring/multitenant CI job past its 30-minute
cap, cancelling the whole shard.

60s keeps a hard bound on a genuinely runaway request (so the suite still
cannot hang silently) while giving the Spark variants enough headroom to
complete on the first attempt, which also removes the expensive reruns.
Replace the verbose explanatory comment with the original one-line
description; the rationale lives in the commit history.
@Baunsgaard Baunsgaard merged commit be823ef into apache:main Jun 23, 2026
44 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 23, 2026
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.51%. Comparing base (3871809) to head (f02420a).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2505      +/-   ##
============================================
+ Coverage     71.45%   71.51%   +0.06%     
- Complexity    48864    48947      +83     
============================================
  Files          1573     1573              
  Lines        189239   189334      +95     
  Branches      37128    37149      +21     
============================================
+ Hits         135215   135406     +191     
+ Misses        43576    43456     -120     
- Partials      10448    10472      +24     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant