Skip the TokenStream overhead when indexing simple keywords. #12139

jpountz · 2023-02-09T17:41:31Z

Indexing simple keywords through a TokenStream abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.

The way that this change works is by making IndexingChain check binaryValue() when a field is indexed but tokenStream() returns null. Then KeywordField only has to return null in tokenStream() to take advantage of this optimization. I hesitated doing the same with StringField but wondered if this could be breaking to some users who might pull a TokenStream themselves.

Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed. The way that this change works is by making `IndexingChain` check `binaryValue()` when a field is indexed but `tokenStream()` returns `null`. Then `KeywordField` only has to return `null` in `tokenStream()` to take advantage of this optimization. I hesitated doing the same with `StringField` but wondered if this could be breaking to some users who might pull a `TokenStream` themselves.

jpountz · 2023-02-09T17:44:38Z

I ran this made-up benchmark to try to assess the benefits of the change. It's not representative of a real-world scenario since it disables merging (to reduce noise), but it still indexes a combination of terms plus doc values and includes flush times so it includes more than just keyword indexing.

	public static void main(String[] args) throws IOException {
	  Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
	  for (int iter = 0; iter < 100; ++iter) {
	    IndexWriterConfig cfg = new IndexWriterConfig(null)
	        .setOpenMode(OpenMode.CREATE)
	        .setMergePolicy(NoMergePolicy.INSTANCE)
	        .setMaxBufferedDocs(200_000)
	        .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
	    long start = System.nanoTime();
	    try (IndexWriter w = new IndexWriter(dir, cfg)) {
	      Document doc = new Document();
	      KeywordField field1 = new KeywordField("field1", new BytesRef(1), Field.Store.NO);
	      doc.add(field1);
	      KeywordField field2 = new KeywordField("field2", new BytesRef(1), Field.Store.NO);
	      doc.add(field2);
	      KeywordField field3 = new KeywordField("field3", new BytesRef(1), Field.Store.NO);
	      doc.add(field3);
	      for (int i = 0; i < 10_000_000; ++i) {
	        field1.binaryValue().bytes[0] = (byte) i;
	        field2.binaryValue().bytes[0] = (byte) (3 * i);
	        field3.binaryValue().bytes[0] = (byte) (5 * i);
	        w.addDocument(doc);
	      }
	    }
	    long end = System.nanoTime();
	    System.out.println((end - start) / 1_000_000 + " ns per doc");
	  }
	}

Before the change, indexing takes 5.3us per document. After the change it takes 4.3us.

rmuir · 2023-02-09T23:21:12Z

I hesitated doing the same with StringField but wondered if this could be breaking to some users who might pull a TokenStream themselves.

Maybe we can somehow deprecate using a tokenstream there in 9.x, pulling a tokenstream is very expert and doesn't seem like StringField needs to support that?

rmuir · 2023-02-09T23:27:22Z

high level, i dont think its a big problem, but we are adding some type-guessing, with a lot of runtime checks, versus the user somehow having some type safety via the .document package. Similar to what got fixed recently in stored fields (#12116).

Just worth thinking about, is there anyway this can be more type-safe to the user in the API.

jpountz · 2023-02-10T14:32:53Z

You make a good point about adding more type guessing. Ideally IndexableField would have its APIs designed around how IndexingChain consumes it.

I imagine that we could have an IndexableField#invertableValue() method or something like that that would return a similar object as StoredValue that could hold either a BytesRef or a TokenStream.

rmuir · 2023-02-12T13:08:11Z

I imagine that we could have an IndexableField#invertableValue() method or something like that that would return a similar object as StoredValue that could hold either a BytesRef or a TokenStream.

This sounds better to me than type-guessing and null values. Then my complaint goes away.

rmuir · 2023-02-12T13:09:46Z

also maybe the Reader/String -> Analyzer -> TokenStream path could be fixed to use this? That's also confusing and this PR makes it even more so.

jpountz · 2023-02-16T13:49:23Z

I removed type guessing by adding a new IndexableField#invertableType that can be either TERM or TOKEN_STREAM. The type guessing is now contained in Field.java. Initially I wanted to contain everything through something that would like more like a value type, like StoredValue but fields must be able to customize the way that they produce their token stream and I didn't like requiring IndexableField implementations to provide both an implementation for the IndexableField and for this abstraction that produces terms or token streams. I'm curious if you have thoughts on ways to make the API better @rmuir.

rmuir · 2023-02-16T13:58:10Z

I'm lost, i see type guessing and an InvertableType class that does nothing. Maybe you forgot to 'git add' or something?

jpountz · 2023-02-16T14:06:02Z

Yes! Sorry about that.

rmuir · 2023-02-16T14:33:24Z

its better, i'm only sad about a naming issue:

InvertableType: OK
InvertableType.TERM: Terrible, it isn't a Term at all, its a BytesRef.
InvertableType.TOKEN_STREAM: OK

This is consistent with `StoredValue.Type.BINARY` and `IndexableField#binaryValue()`.

jpountz · 2023-02-16T14:36:29Z

Fair point, I renamed TERM to BINARY, which is consistent with StoredValue and the fact that the API on IndexableField is called #binaryValue()?

rmuir · 2023-02-16T14:42:42Z

yes, better thanks! The only thing good about the "Term" was that it did capture the singleton nature. I'd just suggest a small improvement to the javadocs for BINARY to mention that its "a single value" or similar?

We don't want someone to pass a large UTF-8 encoded document in this way :)

Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.

jpountz added 2 commits February 10, 2023 15:18

CHANGES

3ee21c9

Deprecate pulling a TokenStream on a StringField.

e848790

jpountz added 2 commits February 16, 2023 13:30

Merge branch 'main' into faster_binary_field_indexing

fac55a9

Remove type guessing.

feba503

actual changes!

1cd0534

Rename TERM to BINARY.

0874cd8

This is consistent with `StoredValue.Type.BINARY` and `IndexableField#binaryValue()`.

jpountz added 2 commits February 16, 2023 15:46

Improve javadocs.

98d8a4b

Merge branch 'main' into faster_binary_field_indexing

7f9f083

jpountz added this to the 9.6.0 milestone Feb 21, 2023

jpountz merged commit cce33b0 into apache:main Feb 21, 2023

jpountz deleted the faster_binary_field_indexing branch February 21, 2023 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip the TokenStream overhead when indexing simple keywords. #12139

Skip the TokenStream overhead when indexing simple keywords. #12139

jpountz commented Feb 9, 2023

jpountz commented Feb 9, 2023

rmuir commented Feb 9, 2023

rmuir commented Feb 9, 2023

jpountz commented Feb 10, 2023

rmuir commented Feb 12, 2023

rmuir commented Feb 12, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023

Skip the TokenStream overhead when indexing simple keywords. #12139

Skip the TokenStream overhead when indexing simple keywords. #12139

Conversation

jpountz commented Feb 9, 2023

jpountz commented Feb 9, 2023

rmuir commented Feb 9, 2023

rmuir commented Feb 9, 2023

jpountz commented Feb 10, 2023

rmuir commented Feb 12, 2023

rmuir commented Feb 12, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023

jpountz commented Feb 16, 2023

rmuir commented Feb 16, 2023