Summary
During Phase 1 (dual-write, ES reads), closeIndex, openIndex, and delete operations via router.writeChecked()/writeBoolean() throw OS index_not_found_exception back to the REST client as HTTP 500, even though the underlying ES operation succeeded.
Per OPENSEARCH_MIGRATION.md:
OS write failures MUST be fire-and-forget in Phases 1 and 2.
The BulkProcessor content-write path correctly implements this via BulkProcessorListener.forShadowProvider(). The index lifecycle operations in PhaseRouter do not.
Affected endpoints
PUT /api/v1/esindex/{name}?action=close
PUT /api/v1/esindex/{name}?action=open
DELETE /api/v1/esindex/{name}
Root cause
When FEATURE_FLAG_OPEN_SEARCH_PHASE=1 and the ES index (working_T0) and OS index (working_T1) have different timestamps (migration catchup scenario), passing working_T0 to a fan-out operation causes OSIndexAPIImpl.closeIndex()/openIndex()/delete() to throw OpenSearchException: [index_not_found_exception]. PhaseRouter.writeChecked() and writeBoolean() propagate this exception to the caller instead of swallowing it.
Stack trace observed:
OpenSearchException: [index_not_found_exception] no such index [cluster_dotcms-os-migration.working_20260414024418]
at OSIndexAPIImpl.closeIndex()
at IndexAPIImpl (via PhaseRouter.writeChecked())
at ESIndexResource.doAction() → HTTP 500
The ES state did change (index was closed/deleted) but the client receives a 500 error.
Fix
Add fire-and-forget wrapping to PhaseRouter.writeChecked() and writeBoolean() for Phase 1/2 shadow writes, following the same semantics as BulkProcessorListener.forShadowProvider():
- Catch OS exceptions in dual-write phases
- Log at WARN level (not ERROR)
- Return the ES result to the caller
Discovered via
QA test run against dotcms/dotcms:latest with single-node-os-migration compose stack. Confirmed in Phase 1 with ES index working_20260414024418 (no matching OS index created by latest image).
Related
Summary
During Phase 1 (dual-write, ES reads),
closeIndex,openIndex, anddeleteoperations viarouter.writeChecked()/writeBoolean()throw OSindex_not_found_exceptionback to the REST client as HTTP 500, even though the underlying ES operation succeeded.Per
OPENSEARCH_MIGRATION.md:The
BulkProcessorcontent-write path correctly implements this viaBulkProcessorListener.forShadowProvider(). The index lifecycle operations inPhaseRouterdo not.Affected endpoints
PUT /api/v1/esindex/{name}?action=closePUT /api/v1/esindex/{name}?action=openDELETE /api/v1/esindex/{name}Root cause
When
FEATURE_FLAG_OPEN_SEARCH_PHASE=1and the ES index (working_T0) and OS index (working_T1) have different timestamps (migration catchup scenario), passingworking_T0to a fan-out operation causesOSIndexAPIImpl.closeIndex()/openIndex()/delete()to throwOpenSearchException: [index_not_found_exception].PhaseRouter.writeChecked()andwriteBoolean()propagate this exception to the caller instead of swallowing it.Stack trace observed:
The ES state did change (index was closed/deleted) but the client receives a 500 error.
Fix
Add fire-and-forget wrapping to
PhaseRouter.writeChecked()andwriteBoolean()for Phase 1/2 shadow writes, following the same semantics asBulkProcessorListener.forShadowProvider():Discovered via
QA test run against
dotcms/dotcms:latestwithsingle-node-os-migrationcompose stack. Confirmed in Phase 1 with ES indexworking_20260414024418(no matching OS index created bylatestimage).Related