New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Apply windowing and chunking to long documents #104363
Conversation
Pinging @elastic/ml-core (Team:ML) |
Hi @davidkyle, I've created a changelog YAML for you. |
Overall changes LGTM. 👍 How will chunking options work by default? Will models be deployed with default chunking options? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few questions and comments
...plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/ChunkedInferenceAction.java
Outdated
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/core/ml/inference/results/ChunkedTextExpansionResults.java
Outdated
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/core/ml/inference/results/ChunkedTextEmbeddingResults.java
Show resolved
Hide resolved
...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportChunkedInferenceAction.java
Outdated
Show resolved
Hide resolved
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/TextExpansionProcessor.java
Outdated
Show resolved
Hide resolved
Currently I'm not sure. There are 2 options to consider: But the model config doesn't have a default value of The default span could be a function of max sequence length |
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few questions around switching to readOptional*()
and I think we need a few return
statements after onFailure()
calls.
.../core/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/Tokenization.java
Outdated
Show resolved
Hide resolved
.../core/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/Tokenization.java
Outdated
Show resolved
Hide resolved
...ugin/src/main/java/org/elasticsearch/xpack/inference/mock/TestInferenceServiceExtension.java
Outdated
Show resolved
Hide resolved
...erence/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java
Show resolved
Hide resolved
...rence/src/main/java/org/elasticsearch/xpack/inference/services/elser/ElserMlNodeService.java
Show resolved
Hide resolved
...rence/src/main/java/org/elasticsearch/xpack/inference/services/elser/ElserMlNodeService.java
Show resolved
Hide resolved
...ml/src/main/java/org/elasticsearch/xpack/ml/inference/deployment/InferencePyTorchAction.java
Show resolved
Hide resolved
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
Adds a chunkedInfer() method to the InferenceService interface which automatically splits long text before sending the inputs to the model. Chunking is done via a sliding window of length window_size with an overlap of span. This change only applies to the ELSER model and Text Embedding models deployed locally in the cluster
The changes in elastic#105183 clashed with elastic#104363
Adds a
chunkedInfer()
method to theInferenceService
interface which automatically splits long text before sending the inputs to the model. Chunking is done via a sliding window of lengthwindow_size
with an overlap ofspan
.By default the window size is equal to the model's max sequence length and span is 50% of that (after accounting for special tokens). One reason to choose a smaller window size is that processing time is exponential on the number of input tokens, reducing the window size results in some lost context (fewer tokens per input) but may be the fastest strategy for ingesting long text.
This change only applies to the ELSER model and Text Embedding models deployed locally in the cluster
Response Structure
(Field names subject to change)