Skip to content

Ingest: Add post-indexing content date resolution#3112

Merged
reakaleek merged 2 commits intomainfrom
trusting-ceder
Apr 15, 2026
Merged

Ingest: Add post-indexing content date resolution#3112
reakaleek merged 2 commits intomainfrom
trusting-ceder

Conversation

@reakaleek
Copy link
Copy Markdown
Member

@reakaleek reakaleek commented Apr 15, 2026

What

Add a post-indexing _update_by_query step that resolves content_last_updated on all documents after indexing completes, compensating for Elasticsearch bulk update actions skipping ingest pipelines.

Why

HashedBulkUpdate uses scripted upserts (bulk update actions), which skip default_pipeline and final_pipeline. This means the enrichment pipeline that stamps content_last_updated never fires during normal indexing, leaving the field unset.

How

  • Add ResolveContentDatesAsync to ContentDateEnrichment — runs _update_by_query with the enrichment pipeline
  • Capture read aliases in StartAsync (write targets are removed after CompleteAsync)
  • Call ResolveContentDatesAsync on both lexical and semantic indices in StopAsync before syncing the lookup
  • Switch SyncLookupIndexAsync to use the read alias instead of the write target

Test plan

  • Integration tests against a real Elasticsearch 8.18.0 container (Testcontainers)
  • Covers: cold start, date preservation on unchanged content, date advancement on changed content, and the bulk-update pipeline gap

🤖 Generated with Claude Code

reakaleek and others added 2 commits April 15, 2026 14:31
HashedBulkUpdate uses bulk update actions (scripted upserts) which skip
Elasticsearch ingest pipelines, so content_last_updated was never set
during normal indexing. This adds a ResolveContentDatesAsync step that
runs _update_by_query with the enrichment pipeline after indexing
completes, and switches StopAsync to use read aliases instead of the
write target (which is removed after CompleteAsync).

Includes integration tests against a real Elasticsearch container
validating cold-start, date preservation, change detection, and the
bulk-update pipeline gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@reakaleek reakaleek requested a review from a team as a code owner April 15, 2026 12:36
@reakaleek reakaleek requested a review from technige April 15, 2026 12:36
@reakaleek reakaleek added the fix label Apr 15, 2026
@reakaleek reakaleek changed the title Search: Add post-indexing content date resolution Ingest: Add post-indexing content date resolution Apr 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 54cd2315-1b53-44e2-aa6a-5647992b97ee

📥 Commits

Reviewing files that changed from the base of the PR and between 9acc73b and aa3522e.

📒 Files selected for processing (6)
  • Directory.Packages.props
  • docs-builder.slnx
  • src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs
  • src/Elastic.Markdown/Exporters/Elasticsearch/ElasticsearchMarkdownExporter.cs
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/ContentDateEnrichmentTests.cs
  • tests-integration/Elastic.ContentDateEnrichment.IntegrationTests/Elastic.ContentDateEnrichment.IntegrationTests.csproj

📝 Walkthrough

Walkthrough

This change adds post-indexing content date enrichment functionality to the Elasticsearch exporter. A new ResolveContentDatesAsync method is introduced to apply the content date enrichment ingest pipeline across documents via _update_by_query. The exporter now captures read aliases during startup and invokes content date resolution for both lexical and semantic aliases during shutdown, followed by lookup index synchronization. Integration tests validate the enrichment workflow across multiple indexing scenarios using a containerized Elasticsearch instance.

Sequence Diagram

sequenceDiagram
    participant Exporter as ElasticsearchMarkdownExporter
    participant Enrichment as ContentDateEnrichment
    participant ES as Elasticsearch Cluster

    Exporter->>Exporter: StartAsync: Store read aliases<br/>(_lexicalReadAlias, _semanticReadAlias)

    Note over Exporter,ES: Indexing occurs...

    Exporter->>Exporter: StopAsync begins

    Exporter->>Enrichment: ResolveContentDatesAsync<br/>(lexical alias)
    Enrichment->>ES: _update_by_query on lexical alias<br/>with enrichment pipeline
    ES-->>Enrichment: Apply pipeline, resolve dates

    Exporter->>Enrichment: ResolveContentDatesAsync<br/>(semantic alias)
    Enrichment->>ES: _update_by_query on semantic alias<br/>with enrichment pipeline
    ES-->>Enrichment: Apply pipeline, resolve dates

    Exporter->>Exporter: SyncLookupIndexAsync<br/>using lexical read alias
    Exporter->>ES: Sync lookup index state
    ES-->>Exporter: Lookup index updated

    Exporter->>Exporter: StopAsync completes
Loading

Suggested labels

enhancement, dependencies

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a post-indexing step to resolve content dates, which is the core objective of this PR.
Description check ✅ Passed The description provides detailed context on what, why, and how the change addresses bulk update actions skipping ingest pipelines, directly relating to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch trusting-ceder

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@reakaleek reakaleek merged commit 2a86232 into main Apr 15, 2026
31 of 35 checks passed
@reakaleek reakaleek deleted the trusting-ceder branch April 15, 2026 13:38
cotti pushed a commit that referenced this pull request Apr 15, 2026
* Search: Add post-indexing content date resolution via update_by_query

HashedBulkUpdate uses bulk update actions (scripted upserts) which skip
Elasticsearch ingest pipelines, so content_last_updated was never set
during normal indexing. This adds a ResolveContentDatesAsync step that
runs _update_by_query with the enrichment pipeline after indexing
completes, and switches StopAsync to use read aliases instead of the
write target (which is removed after CompleteAsync).

Includes integration tests against a real Elasticsearch container
validating cold-start, date preservation, change detection, and the
bulk-update pipeline gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Search: Fix lint warnings in content date enrichment tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cotti pushed a commit that referenced this pull request Apr 15, 2026
* Search: Add post-indexing content date resolution via update_by_query

HashedBulkUpdate uses bulk update actions (scripted upserts) which skip
Elasticsearch ingest pipelines, so content_last_updated was never set
during normal indexing. This adds a ResolveContentDatesAsync step that
runs _update_by_query with the enrichment pipeline after indexing
completes, and switches StopAsync to use read aliases instead of the
write target (which is removed after CompleteAsync).

Includes integration tests against a real Elasticsearch container
validating cold-start, date preservation, change detection, and the
bulk-update pipeline gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Search: Fix lint warnings in content date enrichment tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants