-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL Update CHUNK to support chunking_settings as optional argument #138123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL Update CHUNK to support chunking_settings as optional argument #138123
Conversation
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far LGTM - I think the options validation is correct (missing some tests) but the direction is good!
As a drive-by comment - I find a bit confusing having to add nested options when we have a single other param (num_chunks):
CHUNK(content, {"num_chunks": 3, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 } })
}
Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?
CHUNK(content, {"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 })
}
We can do the same chunking building if we remove the num_chunks from the options map and then try to build from there 🤷
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/ChunkTests.java
Show resolved
Hide resolved
@carlosdelest So I get that, but here's the rub - various chunking settings strategies have different options. For example, I realize the nested syntax is confusing but I'm just not sure if there's a better way to do that. We can optimize for |
@kderusso I'm not sure that I understand. Can't we take this option map:
and just send it to the |
I suppose that's an option, to send in all other options as a big bag'o'options, but I'm not thrilled with that for future API extensibility - if we ever add additional options like |
🔍 Preview links for changed docs |
|
|
||
| public static final int DEFAULT_NUM_CHUNKS = Integer.MAX_VALUE; | ||
| public static final int DEFAULT_CHUNK_SIZE = 300; | ||
| static final int DEFAULT_CHUNK_SIZE = 300; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this to 5500, which per #132169 should be Jina's rerank window size?
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Show resolved
Hide resolved
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @kderusso, I've created a changelog YAML for you. |
x-pack/plugin/esql/qa/testFixtures/src/main/resources/chunk.csv-spec
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Outdated
Show resolved
Hide resolved
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Let's address Ioana's concerns - the rest LGTM
...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Chunk.java
Show resolved
Hide resolved
…ions, tests don't work
| result.appendNull(); | ||
| continue position; | ||
| } | ||
| switch (numChunksBlock.getValueCount(p)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ I really like how much you simplified things here ❤️
…-json * upstream/main: (247 commits) Mute org.elasticsearch.xpack.inference.integration.SemanticTextIndexOptionsIT testValidateIndexOptionsWithBasicLicense elastic#138513 Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackLookupJoinIT testLookupExplosionBigString elastic#138510 This shouldn't be zero (elastic#138501) sum of empty histogram is now null (elastic#138378) Test ES|QL bfloat16 support (elastic#138499) Fix exception handling in S3 `compareAndExchangeRegister` (elastic#138488) Mute org.elasticsearch.xpack.exponentialhistogram.ExponentialHistogramFieldMapperTests testFormattedDocValues elastic#138504 Mute org.elasticsearch.ingest.geoip.IngestGeoIpClientYamlTestSuiteIT test {yaml=ingest_geoip/60_ip_location_databases/Test adding, getting, and removing ip location databases} elastic#138502 ESQL: Refactor HeapAttackIT (elastic#138432) [Inference API] Add ElasticInferenceServiceDenseTextEmbeddingsServiceSettings to InferenceNamedWriteablesProvider (elastic#138484) Store split indices (elastic#138396) ES|QL Update CHUNK to support chunking_settings as optional argument (elastic#138123) Extract common blob-update logic in `S3HttpHandler` (elastic#138490) Cleanup esql request building api (elastic#138398) Round sum and avg in exponential_histogram CSV tests (elastic#138472) ESQL: load exponential_histogram total count as double instead of long (elastic#138417) [SIMD] Use fixed width native types for better Java interoperability (elastic#138429) Do not use Min or Max as Top's surrogate when there is an outputField (elastic#138380) ES|QL: Fix generative tests (elastic#138478) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480 ...
…lastic#138123) * Stash claude changes * Update * test * fix * iter * iter * tests * [CI] Auto commit changes from spotless * Remove num chunks, change options to chunking settings map * [CI] Auto commit changes from spotless * Verifier tests * Update docs/changelog/138123.yaml * [CI] Auto commit changes from spotless * PR Feedback * [CI] Auto commit changes from spotless * Checkpoint - mid refactoring to support chunking setting explicit options, tests don't work * [CI] Auto commit changes from spotless * fix separators issue * Docs * cleanup * Cleanup * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
…lastic#138123) * Stash claude changes * Update * test * fix * iter * iter * tests * [CI] Auto commit changes from spotless * Remove num chunks, change options to chunking settings map * [CI] Auto commit changes from spotless * Verifier tests * Update docs/changelog/138123.yaml * [CI] Auto commit changes from spotless * PR Feedback * [CI] Auto commit changes from spotless * Checkpoint - mid refactoring to support chunking setting explicit options, tests don't work * [CI] Auto commit changes from spotless * fix separators issue * Docs * cleanup * Cleanup * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Updates
CHUNKto supportchunking_settingsas an optional argument. Removesnum_chunks, in favor ofMV_SLICE.Usage: