Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Apply windowing and chunking to long documents #104363

Merged
merged 24 commits into from Feb 1, 2024

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Jan 15, 2024

Adds a chunkedInfer() method to the InferenceService interface which automatically splits long text before sending the inputs to the model. Chunking is done via a sliding window of length window_size with an overlap of span.

By default the window size is equal to the model's max sequence length and span is 50% of that (after accounting for special tokens). One reason to choose a smaller window size is that processing time is exponential on the number of input tokens, reducing the window size results in some lost context (fewer tokens per input) but may be the fastest strategy for ingesting long text.

This change only applies to the ELSER model and Text Embedding models deployed locally in the cluster

Response Structure

(Field names subject to change)

{
  "sparse_embedding_chunk": [
       {
          "text": "first text chunk",
          "inference": { sparse embedding tokens...}
        },
        {
          "text": "second text chunk",
          "inference": { sparse embedding tokens...}
        }, ...
    ]
}

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jan 15, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

@davidkyle davidkyle added the Team:Search Meta label for search team label Jan 15, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:Search Meta label for search team label Jan 15, 2024
@carlosdelest
Copy link
Member

Overall changes LGTM. 👍

How will chunking options work by default? Will models be deployed with default chunking options?

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions and comments

@davidkyle
Copy link
Member Author

How will chunking options work by default? Will models be deployed with default chunking options?

Currently I'm not sure.

There are 2 options to consider: span and windowSize. Every model has a max sequence length which maps to windowSize if windowSize is not set.

But the model config doesn't have a default value of span. This is a problem as the span parameter is optional (Integer rather than int) but in practice is must be set as there is no default.

The default span could be a function of max sequence length

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few questions around switching to readOptional*() and I think we need a few return statements after onFailure() calls.

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
@davidkyle davidkyle merged commit 6439614 into elastic:main Feb 1, 2024
15 checks passed
jedrazb pushed a commit to jedrazb/elasticsearch that referenced this pull request Feb 2, 2024
Adds a chunkedInfer() method to the InferenceService interface which 
automatically splits long text before sending the inputs to the model.
Chunking is done via a sliding window of length window_size with an 
overlap of span. This change only applies to the ELSER model and Text Embedding 
models deployed locally in the cluster
elasticsearchmachine pushed a commit that referenced this pull request Feb 6, 2024
The changes in #105183 clashed with #104363
felixbarny pushed a commit to felixbarny/elasticsearch that referenced this pull request Feb 8, 2024
@davidkyle davidkyle deleted the chunking branch April 2, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants