Skip to content

fix(opensearch): configurable shadow write log level + phase-aware index tests (#35302)#35389

Merged
fabrizzio-dotCMS merged 8 commits intomainfrom
issue-35302-shadow-write-log-level
Apr 21, 2026
Merged

fix(opensearch): configurable shadow write log level + phase-aware index tests (#35302)#35389
fabrizzio-dotCMS merged 8 commits intomainfrom
issue-35302-shadow-write-log-level

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

@fabrizzio-dotCMS fabrizzio-dotCMS commented Apr 20, 2026

Summary

  • Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL (default WARN) so operators can tune shadow OS write failure verbosity at runtime without code changes
  • Adds PhaseRouterTest covering all four migration phases with the mismatched-index-name (T0/T1) scenario
  • Adds ContentletIndexAPIImplPhaseTest — fully isolated unit tests for deactivateIndex, createContentIndex, closeIndex, and activateIndex using a full-dependency constructor (no APILocator)
  • Adds ContentletIndexAPIImplMigrationIT — integration tests requiring two real clusters, covering ghost-index and invalid-name edge cases across phases; registered in OpenSearchUpgradeSuite
  • Guards addCustomMapping behind if (result) in createContentIndex — prevents applying mappings to an index the cluster rejected, keeping the soft-failure contract consistent

Test plan

  • ContentletIndexAPIImplPhaseTest./mvnw test -pl :dotcms-core -Dtest=ContentletIndexAPIImplPhaseTest
  • PhaseRouterTest./mvnw test -pl :dotcms-core -Dtest=PhaseRouterTest
  • ContentletIndexAPIImplMigrationIT./mvnw verify -pl :dotcms-integration -Dcoreit.test.skip=false -Dopensearch.upgrade.test=true -Dit.test=ContentletIndexAPIImplMigrationIT
  • No regressions in existing OpenSearchUpgradeSuite tests

Closes #35302

🤖 Generated with Claude Code

This PR fixes: #35302

fabrizzio-dotCMS and others added 4 commits April 17, 2026 11:38
…e-and-forget tests (#35302)

Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL (default WARN) so operators can raise
shadow OS write failures to ERROR during QA or lower them to DEBUG during
steady-state migration, without code changes. Adds PhaseRouterTest covering
all four migration phases with the mismatched-index-name scenario.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entletIndexAPIImpl

Adds ContentletIndexAPIImplPhaseTest (unit, full mock injection) and
ContentletIndexAPIImplMigrationIT (integration, requires two real clusters)
covering deactivateIndex, createContentIndex, closeIndex, and activateIndex
across all four migration phases. Includes a full-dependency constructor on
ContentletIndexAPIImpl for isolated unit testing without APILocator.
Registers ContentletIndexAPIImplMigrationIT in OpenSearchUpgradeSuite.

Refs: #35302

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ContentIndex

addCustomMapping was called unconditionally after a failed index creation, which
could apply mappings to a non-existent index and surface an unrelated exception
instead of the clean false-return the soft-failure contract promises.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @fabrizzio-dotCMS's task in 3m 16s —— View job


PR Review

  • Read changed files
  • Analyze production code changes
  • Analyze test code
  • Post review findings

Issues Found

1. Shadow failure logging not applied in createContentIndex — inconsistent with the PR's core contract (moderate)

ContentletIndexAPIImpl.createContentIndex(String, int) was refactored to separate primary from shadow results, but the catch block still logs all failures at ERROR regardless of provider role:

// ContentletIndexAPIImpl.java ~line 601
} catch (Exception e) {
    Logger.error(this.getClass(), "Error while creating content index " + physicalName, e);
    if (ops == primary) {
        primaryResult = false;
    }
    // shadow failures are fire-and-forget in dual-write phases  ← comment contradicts the code
}

Every other write path in this PR (PhaseRouter.writeChecked, writeBoolean, writeReturningChecked, CompositeBulkProcessor.close, BulkProcessorListener.afterBulk) was switched to logShadowWriteFailure(). This one was missed. During Phase 1/2, a shadow OS creation failure will fire at ERROR even when DOTCMS_SHADOW_WRITE_LOG_LEVEL=DEBUG. Fix this →

2. Same pattern missed in activateIndex OS mirror block (minor, pre-existing but adjacent to the PR's change):

// ContentletIndexAPIImpl.java ~line 2464
} catch (Exception e) {
    Logger.warn(this, "Could not mirror index activation to OS store for index: " + indexName, e);
}

This is a shadow write failure and should be logShadowWriteFailure() for the same reason. Technically pre-existing, but this PR is the right time to fix it since it's the canonical shadow-write path.

3. PhaseRouter.writeBoolean reads phase config twice — TOCTOU window (minor):

final List<T> providers = writeProviders();   // calls MigrationPhase.current() x2
...
final T primary = readProvider();             // calls MigrationPhase.current() again

If writeProviders() returns [ES, OS] (dual-write) but the flag is updated before readProvider() returns osImpl (Phase 2 → 3), primary ends up as osImpl but the loop iterates over both. Harmless in practice because phase changes don't race with request threads, but the pattern is a latent inconsistency. writeChecked and writeReturningChecked have the same structure.

4. PhaseRouterTest has no coverage of read() / readChecked() Phase 2 ES fallback (test gap):

The test file covers all four write paths (write, writeBoolean, writeChecked) but there are zero tests for the Phase 2 read fallback:

  • OS throws → should fall back to ES and return ES result
  • Both fail → exception from ES should propagate
  • Phase 3: no fallback, OS failure propagates

These are the most complex paths in PhaseRouter and the ones most likely to regress silently.

5. writeBoolean default primaryResult = false is safe but needs a note (nit):

The old default was true, the new one is false. The comment says "assume failure until primary confirms success" — this is correct reasoning, but if primary throws, the exception is re-thrown anyway so the default value is never observed by callers. The change is harmless but the comment is misleading about why the default matters.


Design concern worth calling out

The test test_activateIndex_phase2_esName_mirroredToOsStore (and the corresponding IT test) explicitly documents that passing "working_T0" in Phase 2 writes "working_T0" to the OS pointer store — even though OS physically holds "working_T1". The comment in the test says this is "expected behavior during the catch-up window."

This means the OS pointer store can get a name that does not exist in the OS cluster, and any subsequent deactivateIndex("working_T0") in Phase 2 will silently clear the OS working slot (confirmed by test_deactivateIndex_phase2_clearsBothStoresByIndexType). The behavior is documented, but the documentation is only in test comments — not in activateIndex's javadoc, where a future maintainer is most likely to look. Consider adding a warning there.


Otherwise the core logic is sound: the logShadowWriteFailure extraction is clean, the if (primaryResult) guard on addCustomMapping is a correct fix, the full-dependency constructor for unit testing is the right approach, and the test scenarios comprehensively cover the four-phase routing table for write paths. The integration test's assumeFalse(esSameAsOs()) guards are correct for the two-cluster scenarios.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

PR Review in Progress

  • Read changed files
  • Analyze production code changes
  • Analyze test code
  • Post review findings

View job run

fabrizzio-dotCMS and others added 4 commits April 20, 2026 21:40
- Logger: add info(Class, String, Throwable) overload so INFO-level log
  calls can forward the throwable just like DEBUG/WARN/ERROR do
- IndexConfigHelper: pass throwable `t` in the INFO branch of
  logShadowWriteFailure so operators don't silently lose stack traces
- ContentletIndexAPIImpl: guard addCustomMapping behind if(contentIndex)
  in the private createContentIndex(String,int,IndexTag) overload to
  match the public overload's behaviour; annotate queueApi with @nullable
  to document the testing-constructor invariant
- ContentletIndexAPIImplMigrationIT: trim both operands in esSameAsOs()
  to avoid false negatives from trailing whitespace; add .onFailure()
  warnings to tearDown Try.run calls so DB-restore failures are visible

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aration

- ContentletIndexAPIImpl.createContentIndex(String,int): replace the
  shared `result &=` accumulator with per-provider tracking so a shadow
  write failure in dual-write phases does not prevent addCustomMapping
  from running against the successfully-created primary index
- PhaseRouter.writeBoolean: initialize primaryResult to false (safe
  default) instead of true — makes the silent invariant explicit:
  "assume failure until the primary provider confirms success"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@hassandotcms hassandotcms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@fabrizzio-dotCMS fabrizzio-dotCMS added this pull request to the merge queue Apr 21, 2026
Merged via the queue into main with commit c0df1a6 Apr 21, 2026
49 checks passed
@fabrizzio-dotCMS fabrizzio-dotCMS deleted the issue-35302-shadow-write-log-level branch April 21, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] OS exceptions propagate to client in Phase 1 dual-write (should be fire-and-forget)

2 participants