Skip to content

[BUG] OS exceptions propagate to client in Phase 1 dual-write (should be fire-and-forget) #35302

@fabrizzio-dotCMS

Description

@fabrizzio-dotCMS

Summary

During Phase 1 (dual-write, ES reads), closeIndex, openIndex, and delete operations via router.writeChecked()/writeBoolean() throw OS index_not_found_exception back to the REST client as HTTP 500, even though the underlying ES operation succeeded.

Per OPENSEARCH_MIGRATION.md:

OS write failures MUST be fire-and-forget in Phases 1 and 2.

The BulkProcessor content-write path correctly implements this via BulkProcessorListener.forShadowProvider(). The index lifecycle operations in PhaseRouter do not.

Affected endpoints

  • PUT /api/v1/esindex/{name}?action=close
  • PUT /api/v1/esindex/{name}?action=open
  • DELETE /api/v1/esindex/{name}

Root cause

When FEATURE_FLAG_OPEN_SEARCH_PHASE=1 and the ES index (working_T0) and OS index (working_T1) have different timestamps (migration catchup scenario), passing working_T0 to a fan-out operation causes OSIndexAPIImpl.closeIndex()/openIndex()/delete() to throw OpenSearchException: [index_not_found_exception]. PhaseRouter.writeChecked() and writeBoolean() propagate this exception to the caller instead of swallowing it.

Stack trace observed:

OpenSearchException: [index_not_found_exception] no such index [cluster_dotcms-os-migration.working_20260414024418]
  at OSIndexAPIImpl.closeIndex()
  at IndexAPIImpl (via PhaseRouter.writeChecked())
  at ESIndexResource.doAction() → HTTP 500

The ES state did change (index was closed/deleted) but the client receives a 500 error.

Fix

Add fire-and-forget wrapping to PhaseRouter.writeChecked() and writeBoolean() for Phase 1/2 shadow writes, following the same semantics as BulkProcessorListener.forShadowProvider():

  • Catch OS exceptions in dual-write phases
  • Log at WARN level (not ERROR)
  • Return the ES result to the caller

Discovered via

QA test run against dotcms/dotcms:latest with single-node-os-migration compose stack. Confirmed in Phase 1 with ES index working_20260414024418 (no matching OS index created by latest image).

Related

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions