Skip to content

Conversation

Mpdreamz
Copy link
Member

@Mpdreamz Mpdreamz commented Oct 7, 2025

This uses Url as our _id

This allows us to do direct GET request based on urls.

GET /semantic-docs-dev/_doc/%2Fdocs%2Freference%2Fintegrations%2Fsonicwall_firewall

And we store a hash of the contents.

This allows us to conditionally update a document only if the hash has changed

POST /my-index/_update/1
{
  "scripted_upsert": true,
  "script": {
    "source": """
      if (ctx.op != 'create') {
        if (ctx._source.hash == params.hash ) {
            ctx.op = "noop"
        }
        else {
            ctx._source = params.doc
        }
      }
    """,
    "params": {
      "hash": "SOME-HASH",
      "doc": {
        "hash": "SOME-HASH",
        "semantic": "TEXT"
      }
    }
  }
}

However because we use semantic fields the equivalent is not allowed in _bulk operations

POST /semantic-docs-dev/_bulk
{"update":{"_id":"1"}}
{ "scripted_upsert": true, "script": { "source": "if (ctx.op != 'create') { if (ctx._source.hash == params.hash ) { ctx.op = 'noop' } else { ctx._source = params.doc } }", "params": { "hash": "SOME-HASH", "doc": { "hash": "SOME-HASH-2", "semantic": "DIFFERENT TEXT" } } } }

See https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text#semantic-text-updates for the rules that prevent it currently. Opened elastic/elasticsearch#136074 to discuss with Elasticsearch team.

The only option i see now is to first index into an index without semantic fields then search for updates through scroll and feed it to bulk index updates into the index with semantic fields.

Side note

The mapping for url is updated to use the path hierarchy tokenizer so its easier for us to constrain searches to specific locations e.g

GET /semantic-docs-dev/_search
{
    "query": {
        "term": {
          "url.prefix": {
            "value": "/docs/reference/integrations"
          }
        }
    }
}

@Mpdreamz Mpdreamz requested a review from a team as a code owner October 7, 2025 10:05
@Mpdreamz Mpdreamz requested a review from cotti October 7, 2025 10:05
@Mpdreamz Mpdreamz self-assigned this Oct 7, 2025
@Mpdreamz Mpdreamz merged commit 6688f0b into main Oct 7, 2025
23 checks passed
@Mpdreamz Mpdreamz deleted the feature/es-document-id-and-hash branch October 7, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants