Skip to content

VSB-TUO/Cherry-pick flaky IT fixes from dtq-dev (#1321, #1344)#1360

Merged
milanmajchrak merged 2 commits into
customer/vsb-tuofrom
fix/cherry-pick-flaky-it-fixes-vsb-tuo
Jul 3, 2026
Merged

VSB-TUO/Cherry-pick flaky IT fixes from dtq-dev (#1321, #1344)#1360
milanmajchrak merged 2 commits into
customer/vsb-tuofrom
fix/cherry-pick-flaky-it-fixes-vsb-tuo

Conversation

@Kasinhou

@Kasinhou Kasinhou commented Jul 3, 2026

Copy link
Copy Markdown

Summary

Fixes the VersioningWithRelationshipsIT failures reported in dataquest-dev/dspace-customers#772 (the failures that surfaced once #1359 unblocked the suite) by cherry-picking the flaky-IT fixes that already landed on dtq-dev:

Both cherry-picks are -x annotated. dtq-dev has been 20/20 green on scheduled CI since these landed; customer/vsb-tuo was missing them.

Root cause of the #772 failures

VersioningWithRelationshipsIT.test_placeRecalculationNoUseForPlace → destroy:201 » RuntimeException: java.util.ConcurrentModificationException
VersioningWithRelationshipsIT.test_createNewVersionOfItemOnRightSideOfRelationships:872 NullPointer

The CME comes from a cross-thread race fixed by #1321: AbstractHibernateDAO$1.finalize() (GC Finalizer thread) closed the Hibernate stream returned by iterate(), mutating the owning Session's non-thread-safe JDBC ResourceRegistry concurrently with the test thread's commit during @After builder cleanup (releaseResources → HashMap.forEach → CME) — the exact stack in the #772 report. Which IT gets hit depends on GC timing, which is why the victim moved around (this time VersioningWithRelationshipsIT). The NPE at line 872 (an ItemBuilder chain) is collateral damage in the same class once cleanup state is corrupted. #1321 removes the finalizer and closes the stream on the owning thread; the test file itself is identical on both branches, and no vsb-tuo-specific code is involved — this is missing-fix drift, not a customization issue.

What was deliberately NOT picked / preserved

  • test: de-flake ItemHandleCheckerIT (mock the live handle resolver) #1346 (de-flake ItemHandleCheckerIT) skipped: ItemHandleChecker and its IT don't exist on customer/vsb-tuo (dtq-dev-only feature), so there is nothing to de-flake here.
  • vsb-tuo ORCID customization preserved: this branch intentionally disables live ORCID lookups (CachingOrcidRestConnector.getLabel() returns null) and @Ignores the 3 unit tests exercising it. The cherry-pick keeps both; only the httpGet-mockability change and log cleanup came through. Verified: CachingOrcidRestConnectorTest = 8 run / 3 skipped / 0 failures.

⚠️ Flag for follow-up (pre-existing on dtq-dev too)

The new guard test AbstractHibernateDAOIteratorIT from #1321 is silently excluded from every CI run by the parent pom's failsafe/surefire <exclude>**/Abstract*</exclude> rule (its name starts with "Abstract"). It passes when run explicitly (-Dit.test=AbstractHibernateDAOIteratorIT, 1/1), but as-is it guards nothing in CI — on dtq-dev as well. Suggest a follow-up PR on dtq-dev renaming it (e.g. HibernateDAOIteratorIT); I kept the cherry-pick verbatim here to avoid diverging from dtq-dev.

Test evidence

Check Result
VersioningWithRelationshipsIT (local, targeted) 9/9 pass (was 2 errors in CI)
VersioningWithRelationshipsIT (local, inside full suite) 9/9 pass
EmbargoImportIT (regression check for #1359) 5/5 pass
CachingOrcidRestConnectorTest 8 run, 3 skipped (customization), 0 failures
dspace-server-webapp test sources (4 ITs touched) compile clean
Full dspace-api IT suite (local, Windows) 458 run; only failures are the known Windows-local ItemImportCLIIT temp-file locks (FileSystemException: process cannot access the file), pre-existing and identical to the pre-cherry-pick baseline run; passes on Linux CI
dtq-dev scheduled CI with these fixes 20/20 success

Closes dataquest-dev/dspace-customers#772 (vsb-tuo BE part).

cc @Kasinhou

🤖 Generated with Claude Code

milanmajchrak and others added 2 commits July 3, 2026 12:55
* Fixed integration tests because they use to fail sometimes

* test: stabilize flaky CI tests (Hibernate cleanup retry, Shibboleth auth sequence reset, ORCID assertion hardening)

* test: fix flaky ITs at the source (live ORCID, Shibboleth config-reload) + Hibernate CME diagnostics

ORCID CachingOrcidRestConnectorTest no longer hits the live ORCID sandbox:
search/getLabel/search_fail mock the HTTP layer (httpGet made protected) with a
canned expanded-search response, so they are deterministic instead of asserting
against fluctuating sandbox data.

Shibboleth WWW-Authenticate flakiness: add a test-only config-definition.xml with
config-reload=false. Runtime setProperty(...AuthenticationMethod...) overrides were
silently discarded whenever the auto-reload listener rebuilt the combined config
(restoring clarin-dspace.cfg's [Password, ClarinShib] default), intermittently
leaking 'password realm' into the header. Verified: with auto-reload off the override
survives; the explicit reloadConfig() reset in @after still works.

Hibernate ConcurrentModificationException in @after cleanup: the per-session JDBC
ResourceRegistry is not thread-safe, so the CME means two threads touch one Session.
Capture a full thread dump on CME (target/cme-dumps/) to identify the colliding
thread in CI; keep a resilient retry so an already-passed test isn't failed by this
teardown race. (Context.finalize() ruled out: sessions are thread-local.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: revert IT-env config-reload=false override

Disabling config auto-reload globally in the test environment broke
AuthorizeConfigIT.testReloadConfiguration, which deliberately verifies that
AuthorizeConfiguration picks up live changes written to local.cfg via the
auto-reload mechanism. Auto-reload is a tested feature here, so it must not be
disabled to work around the Shibboleth WWW-Authenticate flakiness.

The Shibboleth flakiness (runtime setProperty override discarded when the combined
config is rebuilt) needs a reload-safe fix in the auth test instead; tracked
separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: make Shibboleth auth-sequence override reload-safe (fix WWW-Authenticate flakiness)

The flaky 'password realm' leak in AuthenticationRestControllerIT had this root cause:
configurationService.setProperty(plugin.sequence...AuthenticationMethod, ...) only
updates the in-memory view of the combined configuration. That view is discarded
whenever it is rebuilt, and the auto-reload listener rebuilds it as soon as any
reloadable cfg file's mtime changes mid-run (e.g. another test writing local.cfg).
When that rebuild lands between the override and the request, clarin-dspace.cfg's
default [PasswordAuthentication, ClarinShibAuthentication] returns and 'password realm'
leaks into the header. The previous clear-then-set helper did not help (it is
equivalent to a plain setProperty).

Fix: set the sequence via a JVM system property (highest-precedence override layer,
re-read on every rebuild) + reloadConfig(), and clear it in @after. This survives
auto-reload without disabling it (so AuthorizeConfigIT, which verifies auto-reload,
still passes).

Verified in the real /api/authn/status endpoint: an explicit reloadConfig() after a
setProperty override reproduces the leak, while the system-property approach keeps the
header Shibboleth-only across rebuilds. Full AuthenticationRestControllerIT (43 tests)
passes, and running it alongside ClarinAuthenticationRestControllerIT /
AnonymousAdditionalAuthorizationFilterIT confirms the property does not leak across classes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: add Hibernate concurrency monitor + CI upload to pinpoint @after CME

The intermittent ConcurrentModificationException in @after cleanup is a genuine
cross-thread data race on Hibernate's per-session, non-thread-safe JDBC
ResourceRegistry (xref): a second thread mutates the test thread's session while
it commits/rolls back. Verified against hibernate-core-5.6.15 sources that the
releaseResources forEach lambda never touches xref, so single-thread re-entrancy
is impossible (this disproves the earlier HHH-15116 single-thread theory). The
window is microseconds, so it does not reproduce locally even with deliberate
cross-thread session sharing; it only surfaces under CI load.

A live thread dump of a running IT JVM shows NO legitimate background thread ever
touches Hibernate (all are Solr/HTTP/Jetty/JVM). So the culprit is a transient
thread, and any non-test thread caught inside Hibernate JDBC/session code is by
definition the offender.

- HibernateConcurrencyMonitor: JVM-wide background sampler that records (de-duped)
  any non-test thread found inside org.hibernate.{resource.jdbc,engine.jdbc,
  internal.SessionImpl}; flushed to target/cme-dumps/ on CME and at JVM shutdown.
  Pure observer, never changes test behaviour.
- AbstractIntegrationTestWithDatabase: start the monitor and mark the JUnit thread
  in setUp; flush it alongside the existing thread dump on a captured CME.
- build.yml: always-upload **/target/cme-dumps/** (not gated on failure) so a
  successful cleanup retry no longer hides the diagnostic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix: don't close iterate() Hibernate stream from a finalize() (root cause of flaky CME)

Root cause of the intermittent ConcurrentModificationException in @after integration-test
cleanup, identified via the HibernateConcurrencyMonitor CI dumps: the GC Finalizer thread,
running org.dspace.core.AbstractHibernateDAO$1.finalize(), closed the Hibernate Stream
returned by AbstractHibernateDAO.iterate(). Closing a stream closes its ScrollableResults,
which mutates the owning Session's per-session, non-thread-safe JDBC ResourceRegistry (xref)
- but on the Finalizer thread, concurrently with the thread that owns the session. When that
collided with the owning thread's commit/rollback (releaseResources -> xref.forEach), it threw
ConcurrentModificationException. The CI dumps showed this exact finalizer stack as the only
non-test thread inside Hibernate in dspace-api, and present in dspace-server-webapp too.

This was confirmed genuine cross-thread access (not the previously assumed single-thread/HHH
bug): verified against hibernate-core-5.6.15 sources that the releaseResources forEach lambda
never touches xref, so single-thread re-entrancy is impossible.

Fix: close the backing stream on the owning thread when iteration is exhausted, and remove the
finalize() override. An iterator abandoned before exhaustion is released safely when its
Context/Session is closed (releaseResources then runs on the owning thread).

Adds AbstractHibernateDAOIteratorIT to guard against reintroducing a stream-closing finalizer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix: remove broken Context.finalize() that leaked finalizer-thread sessions

Context.finalize() ran on the GC Finalizer thread and called
dbConnection.isTransActionAlive()/abort(), which resolve sessionFactory.getCurrentSession()
to a brand-new session bound to the Finalizer thread - never the (now-unreachable) thread
that opened the Context. So it could not roll back the Context's transaction anyway; it only
opened and leaked a throwaway Hibernate session on the Finalizer thread, and threw
IllegalStateException once the SessionFactory was closed (seen in the CI thread dumps used to
diagnose the flaky integration-test ConcurrentModificationException).

Abandoned Contexts are cleaned up safely when their owning thread's session ends; callers
already close Contexts via complete()/abort()/try-with-resources (Context is AutoCloseable).

Removes the now-redundant ContextTest.testFinalize (close()/abort() are covered by
testClose/testAbort/testAbort2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: remove flaky-CME diagnostic scaffolding and teardown retry (root cause fixed)

The intermittent @after ConcurrentModificationException is now fixed at its source
(AbstractHibernateDAO.iterate no longer closes its Hibernate stream from a finalizer; broken
Context.finalize() removed). The temporary diagnostics that pinpointed it are no longer needed:

- Restore AbstractIntegrationTestWithDatabase.destroy() to its plain form (drop the 3x cleanup
  retry and the per-CME thread dump) and remove the HibernateConcurrencyMonitor wiring.
- Delete HibernateConcurrencyMonitor.
- Revert the build.yml always-upload of target/cme-dumps.

CI keeps -Dfailsafe.rerunFailingTestsCount=2 as the generic flaky-test safety net.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert: keep Context.finalize() (out of scope, not the CME cause)

Reverts the Context.finalize() removal (and the ContextTest.testFinalize deletion). The flaky
@after ConcurrentModificationException is fully fixed by the AbstractHibernateDAO.iterate()
change alone; Context.finalize() runs on a single GC Finalizer thread against its own
finalizer-thread session and provably cannot cause that cross-thread xref race. Removing a
finalizer from this core, widely-used class is a riskier change that does not belong in a
flaky-test fix, so leave Context untouched. The (pre-existing, harmless) finalizer-thread
session it opens can be addressed separately if desired.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: address review comments on flaky-test fix

- AbstractHibernateDAOIteratorIT: add Javadoc to the test method and walk the iterator's full
  class hierarchy (up to Object) when asserting no finalize() override, so a finalizer
  reintroduced on a superclass/helper is also caught (per CodeRabbit review).
- AuthenticationRestControllerIT: wrap an over-length (122 char) Javadoc line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 8f4b80d)
* test: de-flake ORCID cache tests and ZIP-download IT

Two independent flaky tests keep turning the dtq-dev pipeline red after #1321:

1. CachingOrcidRestConnectorTest.testCachable / testCacheableWithError still
   hit the live ORCID sandbox (#1321 only mocked getLabel/search/search_fail).
   They use the real Spring @Cacheable bean (to exercise the CGLIB caching
   proxy), so they could not be spied. Point the bean's apiURL at a local
   MockWebServer serving the canned orcid-expanded-search.xml instead -- keeps
   the real caching proxy and HTTP transport under test, removes the network
   dependency. No production change.

2. MetadataBitstreamControllerIT.downloadAllZip compared the response to a
   locally-built ZIP byte-for-byte; a ZIP entry's DOS timestamp (2s resolution)
   defaults to "now" on both sides and differs across a 2s boundary. Assert the
   unzipped entry name + content instead of raw bytes.

Test-only changes. Verified locally (ORCID class 8/8 green, repeated offline
runs; webapp test module compiles; checkstyle clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: assert exactly one ZIP entry in downloadAllZip

Address CodeRabbit: a Map keyed by entry name could mask duplicate ZIP entries
(same filename overwrites). Track the entry count separately and assert it is 1,
so an unexpected extra entry fails the test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test: de-flake SSR authorization and Solr TopCountries ITs

Two more intermittent failures on dtq-dev, both addressed at the root cause
(test-only changes):

1. AuthorizationRestRepositoryIT.findByObjectSSRTest flaked with 400 instead of
   200. The test sets dspace.server.ssr.url (used by Utils.getBaseObjectRestFromUri
   to resolve the request URI) and the AlwaysThrowExceptionFeature.turnoff flag via
   configurationService.setProperty(...). Such in-memory overrides are silently
   dropped when the combined config is rebuilt by the auto-reload listener (fires on
   any reloadable cfg file mtime change mid-run). When ssr.url is dropped the URI no
   longer resolves -> 400; if turnoff were also dropped the /search/object path would
   let alwaysexception throw -> 500. Fix: set both via JVM system properties (+
   reloadConfig) so they sit in the highest-precedence override layer and survive
   auto-reload, cleared in @after. Same pattern as #1321's Shibboleth fix
   (AuthenticationRestControllerIT#setAuthenticationMethodSequence). Applied to all
   three SSR tests.

2. StatisticsRestRepositoryIT.topCountriesReport_Community_Visited flaked with an
   empty report (points: []). postView() commits with waitSearcher=false, so the
   just-posted view events can be invisible to the immediately-following report query
   (and a dropped solr-statistics.autoCommit override would skip the commit entirely).
   Fix: after posting the view events, force solrLoggerService.commit() (waitSearcher
   =true) so the events are flushed and visible before the report is queried.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit f6f1356)
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51eea2e3-d34d-4804-ac31-f97bccd928d2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@Kasinhou Kasinhou self-assigned this Jul 3, 2026
@Kasinhou Kasinhou requested review from Copilot and milanmajchrak July 3, 2026 12:17
@milanmajchrak milanmajchrak merged commit ec516e9 into customer/vsb-tuo Jul 3, 2026
14 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks test-stability fixes to address flaky integration test failures (notably VersioningWithRelationshipsIT) by removing a cross-thread Hibernate stream-close race, and by making several tests deterministic (no network/time/config-reload races).

Changes:

  • Fix AbstractHibernateDAO#iterate() to close the underlying Hibernate stream on the owning thread when iteration is exhausted (removing the finalize()-based close).
  • De-flake multiple tests by eliminating external/time-sensitive dependencies: mocked ORCID responses, robust ZIP assertions, forced Solr commit, and config overrides via JVM system properties to survive auto-reload.
  • Add a regression IT to prevent reintroducing a finalizer-based stream close (note: currently named in a way that appears excluded from CI).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
dspace-server-webapp/src/test/java/org/dspace/app/rest/StatisticsRestRepositoryIT.java Forces a Solr commit (waitSearcher=true) to prevent empty report results due to visibility timing.
dspace-server-webapp/src/test/java/org/dspace/app/rest/MetadataBitstreamControllerIT.java Replaces byte-for-byte ZIP comparison with assertions on ZIP entry name/content to avoid timestamp flakiness.
dspace-server-webapp/src/test/java/org/dspace/app/rest/AuthorizationRestRepositoryIT.java Switches SSR-related overrides to JVM system properties + cleanup to survive config auto-reload mid-test.
dspace-server-webapp/src/test/java/org/dspace/app/rest/AuthenticationRestControllerIT.java Moves auth plugin sequence overrides to JVM system properties + cleanup to prevent intermittent config-reload leaks.
dspace-api/src/test/resources/org/dspace/external/orcid-expanded-search.xml Adds a canned ORCID expanded-search fixture to avoid live sandbox dependency.
dspace-api/src/test/java/org/dspace/external/CachingOrcidRestConnectorTest.java Uses canned responses / MockWebServer to make ORCID tests deterministic and validate caching without live network.
dspace-api/src/test/java/org/dspace/core/AbstractHibernateDAOIteratorIT.java Adds regression coverage for “no finalizer on iterate() iterator” (but name suggests it may be excluded from CI).
dspace-api/src/main/java/org/dspace/external/CachingOrcidRestConnector.java Makes httpGet protected to allow tests to stub transport.
dspace-api/src/main/java/org/dspace/core/AbstractHibernateDAO.java Removes finalize()-based stream closing and closes the stream when iteration is exhausted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 890 to +892
String siteUri = "http://ssr.example.com/api/core/sites/" + siteRest.getId();

// disarm the alwaysThrowExceptionFeature
configurationService.setProperty("org.dspace.app.rest.authorization.AlwaysThrowExceptionFeature.turnoff", true);
configurationService.setProperty("dspace.server.ssr.url", "http://ssr.example.com/api");
// Disarm the alwaysThrowExceptionFeature and define the SSR base URL via system properties so the
//'joh' is alphabetic, so the connector turns it into an edismax wildcard query ("joh || joh*") that matches
//many authors; the canned response carries num-found=1725.
assertEquals("Unexpected num-found for the canned ORCID response", 1725L, (long) search.numFound());
assertEquals("Connor, John", search.results().get(0).label());
* stream on the owning thread once the iteration is exhausted. This test guards against reintroducing any
* stream-closing finalizer on the returned iterator.</p>
*/
public class AbstractHibernateDAOIteratorIT extends AbstractIntegrationTestWithDatabase {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants