Ensure Elasticsearch documents have an _id and track content hash for partial updates #2012

Mpdreamz · 2025-10-07T10:05:53Z

This uses Url as our _id

This allows us to do direct GET request based on urls.

GET /semantic-docs-dev/_doc/%2Fdocs%2Freference%2Fintegrations%2Fsonicwall_firewall

And we store a hash of the contents.

This allows us to conditionally update a document only if the hash has changed

POST /my-index/_update/1
{
  "scripted_upsert": true,
  "script": {
    "source": """
      if (ctx.op != 'create') {
        if (ctx._source.hash == params.hash ) {
            ctx.op = "noop"
        }
        else {
            ctx._source = params.doc
        }
      }
    """,
    "params": {
      "hash": "SOME-HASH",
      "doc": {
        "hash": "SOME-HASH",
        "semantic": "TEXT"
      }
    }
  }
}

However because we use semantic fields the equivalent is not allowed in _bulk operations

POST /semantic-docs-dev/_bulk
{"update":{"_id":"1"}}
{ "scripted_upsert": true, "script": { "source": "if (ctx.op != 'create') { if (ctx._source.hash == params.hash ) { ctx.op = 'noop' } else { ctx._source = params.doc } }", "params": { "hash": "SOME-HASH", "doc": { "hash": "SOME-HASH-2", "semantic": "DIFFERENT TEXT" } } } }

See https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text#semantic-text-updates for the rules that prevent it currently. Opened elastic/elasticsearch#136074 to discuss with Elasticsearch team.

The only option i see now is to first index into an index without semantic fields then search for updates through scroll and feed it to bulk index updates into the index with semantic fields.

Side note

The mapping for url is updated to use the path hierarchy tokenizer so its easier for us to constrain searches to specific locations e.g

GET /semantic-docs-dev/_search
{
    "query": {
        "term": {
          "url.prefix": {
            "value": "/docs/reference/integrations"
          }
        }
    }
}

… partial updates

Ensure Elasticsearch documents have an _id and track content hash for…

fb9b927

… partial updates

Mpdreamz requested a review from a team as a code owner October 7, 2025 10:05

Mpdreamz requested a review from cotti October 7, 2025 10:05

Mpdreamz added the feature label Oct 7, 2025

Mpdreamz self-assigned this Oct 7, 2025

reakaleek approved these changes Oct 7, 2025

View reviewed changes

Mpdreamz merged commit 6688f0b into main Oct 7, 2025
23 checks passed

Mpdreamz deleted the feature/es-document-id-and-hash branch October 7, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure Elasticsearch documents have an _id and track content hash for partial updates #2012

Ensure Elasticsearch documents have an _id and track content hash for partial updates #2012

Mpdreamz commented Oct 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ensure Elasticsearch documents have an _id and track content hash for partial updates #2012

Ensure Elasticsearch documents have an _id and track content hash for partial updates #2012

Conversation

Mpdreamz commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Side note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mpdreamz commented Oct 7, 2025 •

edited

Loading