Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip the TokenStream overhead when indexing simple keywords. #12139

Merged
merged 9 commits into from
Feb 21, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 9, 2023

Indexing simple keywords through a TokenStream abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.

The way that this change works is by making IndexingChain check binaryValue() when a field is indexed but tokenStream() returns null. Then KeywordField only has to return null in tokenStream() to take advantage of this optimization. I hesitated doing the same with StringField but wondered if this could be breaking to some users who might pull a TokenStream themselves.

Indexing simple keywords through a `TokenStream` abstraction introduces a bit
of overhead due to attribute management. Not much, but indexing keywords boils
down to adding to a hash map and appending to a postings list, which is quite
cheap too so even some low overhead can significantly impact indexing speed.

The way that this change works is by making `IndexingChain` check
`binaryValue()` when a field is indexed but `tokenStream()` returns `null`.
Then `KeywordField` only has to return `null` in `tokenStream()` to take
advantage of this optimization. I hesitated doing the same with `StringField`
but wondered if this could be breaking to some users who might pull a
`TokenStream` themselves.
@jpountz
Copy link
Contributor Author

jpountz commented Feb 9, 2023

I ran this made-up benchmark to try to assess the benefits of the change. It's not representative of a real-world scenario since it disables merging (to reduce noise), but it still indexes a combination of terms plus doc values and includes flush times so it includes more than just keyword indexing.

	public static void main(String[] args) throws IOException {
	  Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
	  for (int iter = 0; iter < 100; ++iter) {
	    IndexWriterConfig cfg = new IndexWriterConfig(null)
	        .setOpenMode(OpenMode.CREATE)
	        .setMergePolicy(NoMergePolicy.INSTANCE)
	        .setMaxBufferedDocs(200_000)
	        .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
	    long start = System.nanoTime();
	    try (IndexWriter w = new IndexWriter(dir, cfg)) {
	      Document doc = new Document();
	      KeywordField field1 = new KeywordField("field1", new BytesRef(1), Field.Store.NO);
	      doc.add(field1);
	      KeywordField field2 = new KeywordField("field2", new BytesRef(1), Field.Store.NO);
	      doc.add(field2);
	      KeywordField field3 = new KeywordField("field3", new BytesRef(1), Field.Store.NO);
	      doc.add(field3);
	      for (int i = 0; i < 10_000_000; ++i) {
	        field1.binaryValue().bytes[0] = (byte) i;
	        field2.binaryValue().bytes[0] = (byte) (3 * i);
	        field3.binaryValue().bytes[0] = (byte) (5 * i);
	        w.addDocument(doc);
	      }
	    }
	    long end = System.nanoTime();
	    System.out.println((end - start) / 1_000_000 + " ns per doc");
	  }
	}

Before the change, indexing takes 5.3us per document. After the change it takes 4.3us.

@rmuir
Copy link
Member

rmuir commented Feb 9, 2023

I hesitated doing the same with StringField but wondered if this could be breaking to some users who might pull a TokenStream themselves.

Maybe we can somehow deprecate using a tokenstream there in 9.x, pulling a tokenstream is very expert and doesn't seem like StringField needs to support that?

@rmuir
Copy link
Member

rmuir commented Feb 9, 2023

high level, i dont think its a big problem, but we are adding some type-guessing, with a lot of runtime checks, versus the user somehow having some type safety via the .document package. Similar to what got fixed recently in stored fields (#12116).

Just worth thinking about, is there anyway this can be more type-safe to the user in the API.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 10, 2023

You make a good point about adding more type guessing. Ideally IndexableField would have its APIs designed around how IndexingChain consumes it.

I imagine that we could have an IndexableField#invertableValue() method or something like that that would return a similar object as StoredValue that could hold either a BytesRef or a TokenStream.

@rmuir
Copy link
Member

rmuir commented Feb 12, 2023

I imagine that we could have an IndexableField#invertableValue() method or something like that that would return a similar object as StoredValue that could hold either a BytesRef or a TokenStream.

This sounds better to me than type-guessing and null values. Then my complaint goes away.

@rmuir
Copy link
Member

rmuir commented Feb 12, 2023

also maybe the Reader/String -> Analyzer -> TokenStream path could be fixed to use this? That's also confusing and this PR makes it even more so.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 16, 2023

I removed type guessing by adding a new IndexableField#invertableType that can be either TERM or TOKEN_STREAM. The type guessing is now contained in Field.java. Initially I wanted to contain everything through something that would like more like a value type, like StoredValue but fields must be able to customize the way that they produce their token stream and I didn't like requiring IndexableField implementations to provide both an implementation for the IndexableField and for this abstraction that produces terms or token streams. I'm curious if you have thoughts on ways to make the API better @rmuir.

@rmuir
Copy link
Member

rmuir commented Feb 16, 2023

I'm lost, i see type guessing and an InvertableType class that does nothing. Maybe you forgot to 'git add' or something?

@jpountz
Copy link
Contributor Author

jpountz commented Feb 16, 2023

Yes! Sorry about that.

@rmuir
Copy link
Member

rmuir commented Feb 16, 2023

its better, i'm only sad about a naming issue:

  • InvertableType: OK
  • InvertableType.TERM: Terrible, it isn't a Term at all, its a BytesRef.
  • InvertableType.TOKEN_STREAM: OK

This is consistent with `StoredValue.Type.BINARY` and `IndexableField#binaryValue()`.
@jpountz
Copy link
Contributor Author

jpountz commented Feb 16, 2023

Fair point, I renamed TERM to BINARY, which is consistent with StoredValue and the fact that the API on IndexableField is called #binaryValue()?

@rmuir
Copy link
Member

rmuir commented Feb 16, 2023

yes, better thanks! The only thing good about the "Term" was that it did capture the singleton nature. I'd just suggest a small improvement to the javadocs for BINARY to mention that its "a single value" or similar?

We don't want someone to pass a large UTF-8 encoded document in this way :)

@jpountz jpountz added this to the 9.6.0 milestone Feb 21, 2023
@jpountz jpountz merged commit cce33b0 into apache:main Feb 21, 2023
@jpountz jpountz deleted the faster_binary_field_indexing branch February 21, 2023 13:00
jpountz added a commit that referenced this pull request Feb 21, 2023
Indexing simple keywords through a `TokenStream` abstraction introduces a bit
of overhead due to attribute management. Not much, but indexing keywords boils
down to adding to a hash map and appending to a postings list, which is quite
cheap too so even some low overhead can significantly impact indexing speed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants