-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip the TokenStream overhead when indexing simple keywords. #12139
Conversation
Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed. The way that this change works is by making `IndexingChain` check `binaryValue()` when a field is indexed but `tokenStream()` returns `null`. Then `KeywordField` only has to return `null` in `tokenStream()` to take advantage of this optimization. I hesitated doing the same with `StringField` but wondered if this could be breaking to some users who might pull a `TokenStream` themselves.
I ran this made-up benchmark to try to assess the benefits of the change. It's not representative of a real-world scenario since it disables merging (to reduce noise), but it still indexes a combination of terms plus doc values and includes flush times so it includes more than just keyword indexing. public static void main(String[] args) throws IOException {
Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
for (int iter = 0; iter < 100; ++iter) {
IndexWriterConfig cfg = new IndexWriterConfig(null)
.setOpenMode(OpenMode.CREATE)
.setMergePolicy(NoMergePolicy.INSTANCE)
.setMaxBufferedDocs(200_000)
.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
long start = System.nanoTime();
try (IndexWriter w = new IndexWriter(dir, cfg)) {
Document doc = new Document();
KeywordField field1 = new KeywordField("field1", new BytesRef(1), Field.Store.NO);
doc.add(field1);
KeywordField field2 = new KeywordField("field2", new BytesRef(1), Field.Store.NO);
doc.add(field2);
KeywordField field3 = new KeywordField("field3", new BytesRef(1), Field.Store.NO);
doc.add(field3);
for (int i = 0; i < 10_000_000; ++i) {
field1.binaryValue().bytes[0] = (byte) i;
field2.binaryValue().bytes[0] = (byte) (3 * i);
field3.binaryValue().bytes[0] = (byte) (5 * i);
w.addDocument(doc);
}
}
long end = System.nanoTime();
System.out.println((end - start) / 1_000_000 + " ns per doc");
}
} Before the change, indexing takes 5.3us per document. After the change it takes 4.3us. |
Maybe we can somehow deprecate using a tokenstream there in 9.x, pulling a tokenstream is very expert and doesn't seem like StringField needs to support that? |
high level, i dont think its a big problem, but we are adding some type-guessing, with a lot of runtime checks, versus the user somehow having some type safety via the .document package. Similar to what got fixed recently in stored fields (#12116). Just worth thinking about, is there anyway this can be more type-safe to the user in the API. |
You make a good point about adding more type guessing. Ideally I imagine that we could have an |
This sounds better to me than type-guessing and null values. Then my complaint goes away. |
also maybe the Reader/String -> Analyzer -> TokenStream path could be fixed to use this? That's also confusing and this PR makes it even more so. |
I removed type guessing by adding a new |
I'm lost, i see type guessing and an InvertableType class that does nothing. Maybe you forgot to 'git add' or something? |
Yes! Sorry about that. |
its better, i'm only sad about a naming issue:
|
This is consistent with `StoredValue.Type.BINARY` and `IndexableField#binaryValue()`.
Fair point, I renamed |
yes, better thanks! The only thing good about the "Term" was that it did capture the singleton nature. I'd just suggest a small improvement to the javadocs for BINARY to mention that its "a single value" or similar? We don't want someone to pass a large UTF-8 encoded document in this way :) |
Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.
Indexing simple keywords through a
TokenStream
abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.The way that this change works is by making
IndexingChain
checkbinaryValue()
when a field is indexed buttokenStream()
returnsnull
. ThenKeywordField
only has to returnnull
intokenStream()
to take advantage of this optimization. I hesitated doing the same withStringField
but wondered if this could be breaking to some users who might pull aTokenStream
themselves.