VSB-TUO/Cherry-pick flaky IT fixes from dtq-dev (#1321, #1344)#1360
Conversation
* Fixed integration tests because they use to fail sometimes * test: stabilize flaky CI tests (Hibernate cleanup retry, Shibboleth auth sequence reset, ORCID assertion hardening) * test: fix flaky ITs at the source (live ORCID, Shibboleth config-reload) + Hibernate CME diagnostics ORCID CachingOrcidRestConnectorTest no longer hits the live ORCID sandbox: search/getLabel/search_fail mock the HTTP layer (httpGet made protected) with a canned expanded-search response, so they are deterministic instead of asserting against fluctuating sandbox data. Shibboleth WWW-Authenticate flakiness: add a test-only config-definition.xml with config-reload=false. Runtime setProperty(...AuthenticationMethod...) overrides were silently discarded whenever the auto-reload listener rebuilt the combined config (restoring clarin-dspace.cfg's [Password, ClarinShib] default), intermittently leaking 'password realm' into the header. Verified: with auto-reload off the override survives; the explicit reloadConfig() reset in @after still works. Hibernate ConcurrentModificationException in @after cleanup: the per-session JDBC ResourceRegistry is not thread-safe, so the CME means two threads touch one Session. Capture a full thread dump on CME (target/cme-dumps/) to identify the colliding thread in CI; keep a resilient retry so an already-passed test isn't failed by this teardown race. (Context.finalize() ruled out: sessions are thread-local.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: revert IT-env config-reload=false override Disabling config auto-reload globally in the test environment broke AuthorizeConfigIT.testReloadConfiguration, which deliberately verifies that AuthorizeConfiguration picks up live changes written to local.cfg via the auto-reload mechanism. Auto-reload is a tested feature here, so it must not be disabled to work around the Shibboleth WWW-Authenticate flakiness. The Shibboleth flakiness (runtime setProperty override discarded when the combined config is rebuilt) needs a reload-safe fix in the auth test instead; tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: make Shibboleth auth-sequence override reload-safe (fix WWW-Authenticate flakiness) The flaky 'password realm' leak in AuthenticationRestControllerIT had this root cause: configurationService.setProperty(plugin.sequence...AuthenticationMethod, ...) only updates the in-memory view of the combined configuration. That view is discarded whenever it is rebuilt, and the auto-reload listener rebuilds it as soon as any reloadable cfg file's mtime changes mid-run (e.g. another test writing local.cfg). When that rebuild lands between the override and the request, clarin-dspace.cfg's default [PasswordAuthentication, ClarinShibAuthentication] returns and 'password realm' leaks into the header. The previous clear-then-set helper did not help (it is equivalent to a plain setProperty). Fix: set the sequence via a JVM system property (highest-precedence override layer, re-read on every rebuild) + reloadConfig(), and clear it in @after. This survives auto-reload without disabling it (so AuthorizeConfigIT, which verifies auto-reload, still passes). Verified in the real /api/authn/status endpoint: an explicit reloadConfig() after a setProperty override reproduces the leak, while the system-property approach keeps the header Shibboleth-only across rebuilds. Full AuthenticationRestControllerIT (43 tests) passes, and running it alongside ClarinAuthenticationRestControllerIT / AnonymousAdditionalAuthorizationFilterIT confirms the property does not leak across classes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: add Hibernate concurrency monitor + CI upload to pinpoint @after CME The intermittent ConcurrentModificationException in @after cleanup is a genuine cross-thread data race on Hibernate's per-session, non-thread-safe JDBC ResourceRegistry (xref): a second thread mutates the test thread's session while it commits/rolls back. Verified against hibernate-core-5.6.15 sources that the releaseResources forEach lambda never touches xref, so single-thread re-entrancy is impossible (this disproves the earlier HHH-15116 single-thread theory). The window is microseconds, so it does not reproduce locally even with deliberate cross-thread session sharing; it only surfaces under CI load. A live thread dump of a running IT JVM shows NO legitimate background thread ever touches Hibernate (all are Solr/HTTP/Jetty/JVM). So the culprit is a transient thread, and any non-test thread caught inside Hibernate JDBC/session code is by definition the offender. - HibernateConcurrencyMonitor: JVM-wide background sampler that records (de-duped) any non-test thread found inside org.hibernate.{resource.jdbc,engine.jdbc, internal.SessionImpl}; flushed to target/cme-dumps/ on CME and at JVM shutdown. Pure observer, never changes test behaviour. - AbstractIntegrationTestWithDatabase: start the monitor and mark the JUnit thread in setUp; flush it alongside the existing thread dump on a captured CME. - build.yml: always-upload **/target/cme-dumps/** (not gated on failure) so a successful cleanup retry no longer hides the diagnostic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix: don't close iterate() Hibernate stream from a finalize() (root cause of flaky CME) Root cause of the intermittent ConcurrentModificationException in @after integration-test cleanup, identified via the HibernateConcurrencyMonitor CI dumps: the GC Finalizer thread, running org.dspace.core.AbstractHibernateDAO$1.finalize(), closed the Hibernate Stream returned by AbstractHibernateDAO.iterate(). Closing a stream closes its ScrollableResults, which mutates the owning Session's per-session, non-thread-safe JDBC ResourceRegistry (xref) - but on the Finalizer thread, concurrently with the thread that owns the session. When that collided with the owning thread's commit/rollback (releaseResources -> xref.forEach), it threw ConcurrentModificationException. The CI dumps showed this exact finalizer stack as the only non-test thread inside Hibernate in dspace-api, and present in dspace-server-webapp too. This was confirmed genuine cross-thread access (not the previously assumed single-thread/HHH bug): verified against hibernate-core-5.6.15 sources that the releaseResources forEach lambda never touches xref, so single-thread re-entrancy is impossible. Fix: close the backing stream on the owning thread when iteration is exhausted, and remove the finalize() override. An iterator abandoned before exhaustion is released safely when its Context/Session is closed (releaseResources then runs on the owning thread). Adds AbstractHibernateDAOIteratorIT to guard against reintroducing a stream-closing finalizer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix: remove broken Context.finalize() that leaked finalizer-thread sessions Context.finalize() ran on the GC Finalizer thread and called dbConnection.isTransActionAlive()/abort(), which resolve sessionFactory.getCurrentSession() to a brand-new session bound to the Finalizer thread - never the (now-unreachable) thread that opened the Context. So it could not roll back the Context's transaction anyway; it only opened and leaked a throwaway Hibernate session on the Finalizer thread, and threw IllegalStateException once the SessionFactory was closed (seen in the CI thread dumps used to diagnose the flaky integration-test ConcurrentModificationException). Abandoned Contexts are cleaned up safely when their owning thread's session ends; callers already close Contexts via complete()/abort()/try-with-resources (Context is AutoCloseable). Removes the now-redundant ContextTest.testFinalize (close()/abort() are covered by testClose/testAbort/testAbort2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: remove flaky-CME diagnostic scaffolding and teardown retry (root cause fixed) The intermittent @after ConcurrentModificationException is now fixed at its source (AbstractHibernateDAO.iterate no longer closes its Hibernate stream from a finalizer; broken Context.finalize() removed). The temporary diagnostics that pinpointed it are no longer needed: - Restore AbstractIntegrationTestWithDatabase.destroy() to its plain form (drop the 3x cleanup retry and the per-CME thread dump) and remove the HibernateConcurrencyMonitor wiring. - Delete HibernateConcurrencyMonitor. - Revert the build.yml always-upload of target/cme-dumps. CI keeps -Dfailsafe.rerunFailingTestsCount=2 as the generic flaky-test safety net. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert: keep Context.finalize() (out of scope, not the CME cause) Reverts the Context.finalize() removal (and the ContextTest.testFinalize deletion). The flaky @after ConcurrentModificationException is fully fixed by the AbstractHibernateDAO.iterate() change alone; Context.finalize() runs on a single GC Finalizer thread against its own finalizer-thread session and provably cannot cause that cross-thread xref race. Removing a finalizer from this core, widely-used class is a riskier change that does not belong in a flaky-test fix, so leave Context untouched. The (pre-existing, harmless) finalizer-thread session it opens can be addressed separately if desired. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: address review comments on flaky-test fix - AbstractHibernateDAOIteratorIT: add Javadoc to the test method and walk the iterator's full class hierarchy (up to Object) when asserting no finalize() override, so a finalizer reintroduced on a superclass/helper is also caught (per CodeRabbit review). - AuthenticationRestControllerIT: wrap an over-length (122 char) Javadoc line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 8f4b80d)
* test: de-flake ORCID cache tests and ZIP-download IT Two independent flaky tests keep turning the dtq-dev pipeline red after #1321: 1. CachingOrcidRestConnectorTest.testCachable / testCacheableWithError still hit the live ORCID sandbox (#1321 only mocked getLabel/search/search_fail). They use the real Spring @Cacheable bean (to exercise the CGLIB caching proxy), so they could not be spied. Point the bean's apiURL at a local MockWebServer serving the canned orcid-expanded-search.xml instead -- keeps the real caching proxy and HTTP transport under test, removes the network dependency. No production change. 2. MetadataBitstreamControllerIT.downloadAllZip compared the response to a locally-built ZIP byte-for-byte; a ZIP entry's DOS timestamp (2s resolution) defaults to "now" on both sides and differs across a 2s boundary. Assert the unzipped entry name + content instead of raw bytes. Test-only changes. Verified locally (ORCID class 8/8 green, repeated offline runs; webapp test module compiles; checkstyle clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: assert exactly one ZIP entry in downloadAllZip Address CodeRabbit: a Map keyed by entry name could mask duplicate ZIP entries (same filename overwrites). Track the entry count separately and assert it is 1, so an unexpected extra entry fails the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test: de-flake SSR authorization and Solr TopCountries ITs Two more intermittent failures on dtq-dev, both addressed at the root cause (test-only changes): 1. AuthorizationRestRepositoryIT.findByObjectSSRTest flaked with 400 instead of 200. The test sets dspace.server.ssr.url (used by Utils.getBaseObjectRestFromUri to resolve the request URI) and the AlwaysThrowExceptionFeature.turnoff flag via configurationService.setProperty(...). Such in-memory overrides are silently dropped when the combined config is rebuilt by the auto-reload listener (fires on any reloadable cfg file mtime change mid-run). When ssr.url is dropped the URI no longer resolves -> 400; if turnoff were also dropped the /search/object path would let alwaysexception throw -> 500. Fix: set both via JVM system properties (+ reloadConfig) so they sit in the highest-precedence override layer and survive auto-reload, cleared in @after. Same pattern as #1321's Shibboleth fix (AuthenticationRestControllerIT#setAuthenticationMethodSequence). Applied to all three SSR tests. 2. StatisticsRestRepositoryIT.topCountriesReport_Community_Visited flaked with an empty report (points: []). postView() commits with waitSearcher=false, so the just-posted view events can be invisible to the immediately-following report query (and a dropped solr-statistics.autoCommit override would skip the commit entirely). Fix: after posting the view events, force solrLoggerService.commit() (waitSearcher =true) so the events are flushed and visible before the report is queried. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit f6f1356)
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks test-stability fixes to address flaky integration test failures (notably VersioningWithRelationshipsIT) by removing a cross-thread Hibernate stream-close race, and by making several tests deterministic (no network/time/config-reload races).
Changes:
- Fix
AbstractHibernateDAO#iterate()to close the underlying Hibernate stream on the owning thread when iteration is exhausted (removing thefinalize()-based close). - De-flake multiple tests by eliminating external/time-sensitive dependencies: mocked ORCID responses, robust ZIP assertions, forced Solr commit, and config overrides via JVM system properties to survive auto-reload.
- Add a regression IT to prevent reintroducing a finalizer-based stream close (note: currently named in a way that appears excluded from CI).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| dspace-server-webapp/src/test/java/org/dspace/app/rest/StatisticsRestRepositoryIT.java | Forces a Solr commit (waitSearcher=true) to prevent empty report results due to visibility timing. |
| dspace-server-webapp/src/test/java/org/dspace/app/rest/MetadataBitstreamControllerIT.java | Replaces byte-for-byte ZIP comparison with assertions on ZIP entry name/content to avoid timestamp flakiness. |
| dspace-server-webapp/src/test/java/org/dspace/app/rest/AuthorizationRestRepositoryIT.java | Switches SSR-related overrides to JVM system properties + cleanup to survive config auto-reload mid-test. |
| dspace-server-webapp/src/test/java/org/dspace/app/rest/AuthenticationRestControllerIT.java | Moves auth plugin sequence overrides to JVM system properties + cleanup to prevent intermittent config-reload leaks. |
| dspace-api/src/test/resources/org/dspace/external/orcid-expanded-search.xml | Adds a canned ORCID expanded-search fixture to avoid live sandbox dependency. |
| dspace-api/src/test/java/org/dspace/external/CachingOrcidRestConnectorTest.java | Uses canned responses / MockWebServer to make ORCID tests deterministic and validate caching without live network. |
| dspace-api/src/test/java/org/dspace/core/AbstractHibernateDAOIteratorIT.java | Adds regression coverage for “no finalizer on iterate() iterator” (but name suggests it may be excluded from CI). |
| dspace-api/src/main/java/org/dspace/external/CachingOrcidRestConnector.java | Makes httpGet protected to allow tests to stub transport. |
| dspace-api/src/main/java/org/dspace/core/AbstractHibernateDAO.java | Removes finalize()-based stream closing and closes the stream when iteration is exhausted. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String siteUri = "http://ssr.example.com/api/core/sites/" + siteRest.getId(); | ||
|
|
||
| // disarm the alwaysThrowExceptionFeature | ||
| configurationService.setProperty("org.dspace.app.rest.authorization.AlwaysThrowExceptionFeature.turnoff", true); | ||
| configurationService.setProperty("dspace.server.ssr.url", "http://ssr.example.com/api"); | ||
| // Disarm the alwaysThrowExceptionFeature and define the SSR base URL via system properties so the |
| //'joh' is alphabetic, so the connector turns it into an edismax wildcard query ("joh || joh*") that matches | ||
| //many authors; the canned response carries num-found=1725. | ||
| assertEquals("Unexpected num-found for the canned ORCID response", 1725L, (long) search.numFound()); | ||
| assertEquals("Connor, John", search.results().get(0).label()); |
| * stream on the owning thread once the iteration is exhausted. This test guards against reintroducing any | ||
| * stream-closing finalizer on the returned iterator.</p> | ||
| */ | ||
| public class AbstractHibernateDAOIteratorIT extends AbstractIntegrationTestWithDatabase { |
Summary
Fixes the
VersioningWithRelationshipsITfailures reported in dataquest-dev/dspace-customers#772 (the failures that surfaced once #1359 unblocked the suite) by cherry-picking the flaky-IT fixes that already landed ondtq-dev:8f4b80d514— Fix flaky tests in IT pipeline (Fix flaky tests in IT pipeline #1321) — contains the root-cause fixf6f13561cb— test: de-flake ORCID cache tests and ZIP-download IT (test: de-flake ORCID cache tests and ZIP-download IT #1344)Both cherry-picks are
-xannotated.dtq-devhas been 20/20 green on scheduled CI since these landed;customer/vsb-tuowas missing them.Root cause of the #772 failures
The CME comes from a cross-thread race fixed by #1321:
AbstractHibernateDAO$1.finalize()(GC Finalizer thread) closed the Hibernate stream returned byiterate(), mutating the owning Session's non-thread-safe JDBCResourceRegistryconcurrently with the test thread's commit during@Afterbuilder cleanup (releaseResources → HashMap.forEach → CME) — the exact stack in the #772 report. Which IT gets hit depends on GC timing, which is why the victim moved around (this timeVersioningWithRelationshipsIT). The NPE at line 872 (anItemBuilderchain) is collateral damage in the same class once cleanup state is corrupted. #1321 removes the finalizer and closes the stream on the owning thread; the test file itself is identical on both branches, and no vsb-tuo-specific code is involved — this is missing-fix drift, not a customization issue.What was deliberately NOT picked / preserved
ItemHandleCheckerIT) skipped:ItemHandleCheckerand its IT don't exist oncustomer/vsb-tuo(dtq-dev-only feature), so there is nothing to de-flake here.CachingOrcidRestConnector.getLabel()returns null) and@Ignores the 3 unit tests exercising it. The cherry-pick keeps both; only thehttpGet-mockability change and log cleanup came through. Verified:CachingOrcidRestConnectorTest= 8 run / 3 skipped / 0 failures.The new guard test
AbstractHibernateDAOIteratorITfrom #1321 is silently excluded from every CI run by the parent pom's failsafe/surefire<exclude>**/Abstract*</exclude>rule (its name starts with "Abstract"). It passes when run explicitly (-Dit.test=AbstractHibernateDAOIteratorIT, 1/1), but as-is it guards nothing in CI — on dtq-dev as well. Suggest a follow-up PR ondtq-devrenaming it (e.g.HibernateDAOIteratorIT); I kept the cherry-pick verbatim here to avoid diverging from dtq-dev.Test evidence
VersioningWithRelationshipsIT(local, targeted)VersioningWithRelationshipsIT(local, inside full suite)EmbargoImportIT(regression check for #1359)CachingOrcidRestConnectorTestdspace-server-webapptest sources (4 ITs touched)dspace-apiIT suite (local, Windows)ItemImportCLIITtemp-file locks (FileSystemException: process cannot access the file), pre-existing and identical to the pre-cherry-pick baseline run; passes on Linux CIdtq-devscheduled CI with these fixesCloses dataquest-dev/dspace-customers#772 (vsb-tuo BE part).
cc @Kasinhou
🤖 Generated with Claude Code