ES|QL Update CHUNK to support chunking_settings as optional argument #138123

kderusso · 2025-11-14T21:46:39Z

Updates CHUNK to support chunking_settings as an optional argument. Removes num_chunks, in favor of MV_SLICE.

Usage:

CHUNK(content, {"strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 })

github-actions · 2025-11-14T21:48:33Z

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Check out the cumulative docs guidelines
Reach out in the #docs Slack channel

carlosdelest

So far LGTM - I think the options validation is correct (missing some tests) but the direction is good!

As a drive-by comment - I find a bit confusing having to add nested options when we have a single other param (num_chunks):

CHUNK(content, {"num_chunks": 3, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 } })
}

Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?

CHUNK(content, {"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 })
}

We can do the same chunking building if we remove the num_chunks from the options map and then try to build from there 🤷

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java

...src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/ChunkTests.java

kderusso · 2025-11-18T13:12:16Z

Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?

@carlosdelest So I get that, but here's the rub - various chunking settings strategies have different options. For example, sentence based chunking requires sentence_overlap but word based chunking requires overlap. And none would have no overlap. So if we were to flatten out these options, we'd have to take every possible permutation into account and commit to updating it going forward. The nested options felt better to me, because we can defer to the chunking settings builder to validate the options. This is the strategy we went with for semantic_text fields and it works well, because when new chunking settings are added they seamlessly work.

I realize the nested syntax is confusing but I'm just not sure if there's a better way to do that.

We can optimize for semantic_text fields and have reasonable default chunking strategies, and we've discussed whether we want to use an inference ID to pull chunking settings (this has some complications because of jarhell issues with inference and esql). But this felt like the first step, because it offers that flexibility. WDYT?

carlosdelest · 2025-11-18T13:46:06Z

various chunking settings strategies have different options. For example, sentence based chunking requires sentence_overlap but word based chunking requires overlap. And none would have no overlap. So if we were to flatten out these options, we'd have to take every possible permutation into account and commit to updating it going forward.

@kderusso I'm not sure that I understand. Can't we take this option map:

{"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0}

and just send it to the ChunkingSettingsBuilder, removing the num_chunks from it?

kderusso · 2025-11-18T14:09:04Z

@kderusso I'm not sure that I understand. Can't we take this option map:

{"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0}

and just send it to the ChunkingSettingsBuilder, removing the num_chunks from it?

I suppose that's an option, to send in all other options as a big bag'o'options, but I'm not thrilled with that for future API extensibility - if we ever add additional options like super_excellent_widget then it really complicates everything? Maybe I'm overthinking it, it's a good discussion - would like to hear from @ioanatia on this too.

github-actions · 2025-11-19T21:07:30Z

🔍 Preview links for changed docs

docs/reference/query-languages/esql/kibana/docs/functions/chunk.md

kderusso · 2025-11-19T21:09:24Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java


-    public static final int DEFAULT_NUM_CHUNKS = Integer.MAX_VALUE;
-    public static final int DEFAULT_CHUNK_SIZE = 300;
+    static final int DEFAULT_CHUNK_SIZE = 300;


Should we change this to 5500, which per #132169 should be Jina's rerank window size?

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java

elasticsearchmachine · 2025-11-19T21:24:56Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-11-19T21:24:57Z

Hi @kderusso, I've created a changelog YAML for you.

x-pack/plugin/esql/qa/testFixtures/src/main/resources/chunk.csv-spec

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java

carlosdelest

Nice work!

Let's address Ioana's concerns - the rest LGTM

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java

…ions, tests don't work

ioanatia · 2025-11-24T11:56:15Z

...d/org/elasticsearch/xpack/esql/expression/function/scalar/string/ChunkBytesRefEvaluator.java

              result.appendNull();
              continue position;
        }
-        switch (numChunksBlock.getValueCount(p)) {


❤️ I really like how much you simplified things here ❤️

…-json * upstream/main: (247 commits) Mute org.elasticsearch.xpack.inference.integration.SemanticTextIndexOptionsIT testValidateIndexOptionsWithBasicLicense elastic#138513 Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackLookupJoinIT testLookupExplosionBigString elastic#138510 This shouldn't be zero (elastic#138501) sum of empty histogram is now null (elastic#138378) Test ES|QL bfloat16 support (elastic#138499) Fix exception handling in S3 `compareAndExchangeRegister` (elastic#138488) Mute org.elasticsearch.xpack.exponentialhistogram.ExponentialHistogramFieldMapperTests testFormattedDocValues elastic#138504 Mute org.elasticsearch.ingest.geoip.IngestGeoIpClientYamlTestSuiteIT test {yaml=ingest_geoip/60_ip_location_databases/Test adding, getting, and removing ip location databases} elastic#138502 ESQL: Refactor HeapAttackIT (elastic#138432) [Inference API] Add ElasticInferenceServiceDenseTextEmbeddingsServiceSettings to InferenceNamedWriteablesProvider (elastic#138484) Store split indices (elastic#138396) ES|QL Update CHUNK to support chunking_settings as optional argument (elastic#138123) Extract common blob-update logic in `S3HttpHandler` (elastic#138490) Cleanup esql request building api (elastic#138398) Round sum and avg in exponential_histogram CSV tests (elastic#138472) ESQL: load exponential_histogram total count as double instead of long (elastic#138417) [SIMD] Use fixed width native types for better Java interoperability (elastic#138429) Do not use Min or Max as Top's surrogate when there is an outputField (elastic#138380) ES|QL: Fix generative tests (elastic#138478) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480 ...

…lastic#138123) * Stash claude changes * Update * test * fix * iter * iter * tests * [CI] Auto commit changes from spotless * Remove num chunks, change options to chunking settings map * [CI] Auto commit changes from spotless * Verifier tests * Update docs/changelog/138123.yaml * [CI] Auto commit changes from spotless * PR Feedback * [CI] Auto commit changes from spotless * Checkpoint - mid refactoring to support chunking setting explicit options, tests don't work * [CI] Auto commit changes from spotless * fix separators issue * Docs * cleanup * Cleanup * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

kderusso added 7 commits November 14, 2025 08:54

Stash claude changes

02231c3

Update

5cc0a5d

test

deaddf8

fix

fd8ef3d

iter

f1e2cf6

iter

67f78f3

tests

abbc7b3

elasticsearchmachine added the v9.3.0 label Nov 14, 2025

[CI] Auto commit changes from spotless

180d086

kderusso requested review from carlosdelest and ioanatia November 17, 2025 15:08

carlosdelest reviewed Nov 18, 2025

View reviewed changes

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java Outdated Show resolved Hide resolved

...src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/ChunkTests.java Show resolved Hide resolved

kderusso added 2 commits November 19, 2025 14:32

Merge branch 'main' into kderusso/esql-chunk-chunking-settings

4f20501

Remove num chunks, change options to chunking settings map

cbf9632

kderusso commented Nov 19, 2025

View reviewed changes

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java Show resolved Hide resolved

elasticsearchmachine and others added 2 commits November 19, 2025 21:13

[CI] Auto commit changes from spotless

c332086

Verifier tests

7b153b6

kderusso marked this pull request as ready for review November 19, 2025 21:23

Merge branch 'main' into kderusso/esql-chunk-chunking-settings

66b293f

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 19, 2025

kderusso added >enhancement :Search Relevance/ES|QL Search functionality in ES|QL and removed needs:triage Requires assignment of a team area label labels Nov 19, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 19, 2025

Update docs/changelog/138123.yaml

38cc69b

kderusso requested a review from a team November 19, 2025 21:25

[CI] Auto commit changes from spotless

bcb7225

ioanatia reviewed Nov 20, 2025

View reviewed changes

carlosdelest approved these changes Nov 20, 2025

View reviewed changes

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java Show resolved Hide resolved

kderusso and others added 10 commits November 20, 2025 15:27

PR Feedback

fbf8ffd

[CI] Auto commit changes from spotless

06fedb7

Checkpoint - mid refactoring to support chunking setting explicit opt…

60fd328

…ions, tests don't work

[CI] Auto commit changes from spotless

a715083

fix separators issue

6959361

Docs

2451079

cleanup

7bba214

Cleanup

6d96ce0

[CI] Auto commit changes from spotless

cb9a190

Merge branch 'main' into kderusso/esql-chunk-chunking-settings

e77541d

kderusso requested a review from ioanatia November 21, 2025 21:28

Merge branch 'main' into kderusso/esql-chunk-chunking-settings

1bb2f08

ioanatia approved these changes Nov 24, 2025

View reviewed changes

kderusso merged commit 3aba4cd into elastic:main Nov 24, 2025
34 checks passed

ES|QL Update CHUNK to support chunking_settings as optional argument #138123

ES|QL Update CHUNK to support chunking_settings as optional argument #138123

Uh oh!

Conversation

kderusso commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 14, 2025

ℹ️ Important: Docs version tagging

When to use applies_to tags:

What NOT to do:

🤔 Need help?

Uh oh!

carlosdelest left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kderusso commented Nov 18, 2025

Uh oh!

carlosdelest commented Nov 18, 2025

Uh oh!

kderusso commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

kderusso Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 19, 2025

Uh oh!

elasticsearchmachine commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carlosdelest left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ioanatia Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kderusso commented Nov 14, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited

Loading