New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow FST builder to use different writer (#12543) #12624
Conversation
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java
Outdated
Show resolved
Hide resolved
882f5a5
to
90534ee
Compare
e93ff38
to
9ef101f
Compare
bc14a46
to
a619079
Compare
714b2db
to
f807a32
Compare
@mikemccand I put out another revision. Basically the idea is to write everything to a DataOutput (BytesStore is also a DataOutput). To support write-then-read-immediately use case that we are able to do today, if the DataOutput implements the FSTReader then we will use it to read (which only BytesStore has currently). Otherwise we are following a general use case, which is to use the FST public constructor to read from previously compiled one. Let me know what you think. I think a benefit is that it's also easy to drop support for BytesStore or write-then-read-immediately use case if needed. |
63bb2fd
to
5b05fe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dungba88 -- I left a bunch of head scratching comments :) I feel like this is moving in the wrong-ish direction (adding yet more IO abstractions to FST) when we should be able to eliminate most of FST's bizarre IO classes?
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/Freezeable.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
final FSTReader fstReader; | ||
|
||
// buffer to store the scratch bytes before writing to the fstWriter | ||
final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to scratchBytes
, and maybe say in the comment // buffer to store bytes for the one node we are currently writing
?
Also, instead of BytesStore
, could we use Lucene's existing oal.store
ByteBuffersDataOutput.newResettableInstance()
? Let's try to get away from FST implementing so many of its own storage classes/interfaces...
lucene/core/src/java/org/apache/lucene/util/fst/FSTDataOutputWriter.java
Outdated
Show resolved
Hide resolved
@@ -337,11 +349,23 @@ public long size() { | |||
return getPosition(); | |||
} | |||
|
|||
/** Similar to {@link #truncate(long)} with newLen=0 but keep the first block to reduce GC. */ | |||
public void reset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can switch to Lucene's existing ByteBuffersDataOutput
for our scratch single-node bytes then we can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public API of ByteBuffersDataOutput
class seems to only allow appending only, but for scratch area we still need to some backward writing right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we turn it into a byte[]
(.toArrayCopy()
) and do our own reversal? These byte[]
are small, just a single node right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you think this change would be simpler if we go and do #12355 first? I suppose that won't make FST building any simpler since we'd likely write the whole thing in reverse (like we do now), but then make 2nd pass to re-reverse the bytes so everything becomes forwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it seems the main complexity is due to the fact that we need to override the previously written bytes, sometimes when we want to remove the labels, sometimes when we need to update the header after writing everything. So I guess making it forward would not eliminate those backward writing operation.
I simplified the writing operation a bit in the direct addressing mode. Seems like we can convert it to append-only mode. But the binary search one seems to be difficult, as when we fix the arc we need to do in backward (I assume to avoid overriding?). Anyhow, the remaining backward operations are:
- writeBytes: As mentioned, this is used in the binary search mode. If we first copy the whole buffer to a new one, we can start appending back to back. But this isn't as efficient as the current in-place copy? The whole method seems to be expanding each arc from variable arc length to fixed arc lengths. Anyhow there's only a single place this method is used, and the method is pretty much 1-liner, so I embed it to FSTCompiler.
- reverse: This is only done at the end before writing to the DataOutput and the NodeHash.ByteBlockPool
@@ -26,7 +26,8 @@ | |||
// TODO: merge with PagedBytes, except PagedBytes doesn't | |||
// let you read while writing which FST needs | |||
|
|||
class BytesStore extends DataOutput implements FSTReader { | |||
// TODO: Separate the scratch writer and reader functionality | |||
class BytesStore extends DataOutput implements FSTReader, Freezeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this class entirely? And callers that want to write FST and immediately use it in RAM should just use ByteBuffersDataOutput
for their scratch area?
Do we really need to publish a Freezable
interface? Can't it just be a private boolean frozen
in this class? Is anything needing freezing besides this BytesStore
?
And then FSTCompiler
should only get a DataOutput
? Hmm this may require implementing a ByteBuffersRandomAccessInput
or so, so the reverse (positional) byte reads work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this class entirely? And callers that want to write FST and immediately use it in RAM should just use ByteBuffersDataOutput for their scratch area?
This BytesStore class currently serves as 3 purposes:
- Acts as a scratch area writer. Their operations (copying random bytes from one place to other, truncating, skipping the bytes) seems to be complicated to model with other DataOutput
- Acts as a FST writer. Any other DataOutput can be used.
- Acts as a FST reader. This is tricky as DataOutput does not support read operation, and a separate DataInput needs to be provided.
To replace the third one, I think we should keep both the second and third as a single class, because they have to match (a ByteBufferDataOutput needs to go with the ByteBufferRandomAccessInput). However the third purpose is only useful for write-then-read-immediately, which I assume is not the most common use case. Here we can either use the existing BytesStore, or create something based on ByteBufferDataOutput as you suggested. Both are easily replaceable in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to publish a Freezable interface
Now thinking again, I think even the ByteBuffersDataOutput would need freezing as well, otherwise its last block will always be a full block, and might be a waste of RAM. Freezable provides a chance to optimize those datastructure as they are not to be modified again.
Anyhow I removed the interface and the method for now. It's easy to add back.
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
a1f2ee2
to
cea2933
Compare
Thanks @dungba88 -- I will catch up with the latest iterations soon. I tested just how much slower the 9.x:
This PR:
More than two orders-of-magnitude (base 10) slower! +1 to |
I wonder: are there other places in Lucene that might fall prey to this performance trap (calling |
I re-ran the Test2BFST with the new change, it looks much better 2.2B nodes
3GB
|
I assume this is before the last iteration that does the freeze, is that right? What do you think about the last results? |
Sorry yes -- I'll try to come back to this soon, and re-test. Thanks @dungba88 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting close! Thank you for persisting @dungba88 and sorry for the slow turnaround here.
* | ||
* @param metaOut the DataOutput to write the metadata to | ||
* @param out the DataOutput to write the FST bytes to | ||
*/ | ||
public void save(DataOutput metaOut, DataOutput out) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One need not call this if they passed their own IndexOutput
when creating the compiler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It's also not supported (throwing exception). I'll add a comment.
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
private long numBytesWritten; | ||
|
||
/** | ||
* Get an on-heap DataOutput that allows the FST to be read immediately after writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add , and also optionally saved to an external DataOutput
?
I.e. with these changes we support these possible FST workflows:
- Build FST and use it immediately entirely in RAM and then discard it
- Build FST and use it immediately entirely in RAM and also save it to disk (any other
DataOutput
), and load it later and use it - Build FST but stream it immediately to disk (except the
FSTMetaData
, saved at the end) and you cannot use it when done unless you go and open your ownDataInput
on the backing store and make a new FST from that
Could we include this enumeration in `FSTCompiler''s class javadocs?
Also: could you update Test2BFSTs
to also test the 3rd bullet above? Right now it only tests the first 2 bullets. Actually, maybe create a new test, Test2BFSTsToDisk
or so? That way we can limit heap size of that new test and confirm RAM is truly capped. That last bullet fully caps the RAM usage during compilation, yay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new Test2BFSTOffHeap and running it:
100000: 424 RAM bytes used; 39257811 FST bytes; 19189176 nodes; took 23 seconds
200000: 424 RAM bytes used; 78522623 FST bytes; 38378071 nodes; took 49 seconds
300000: 424 RAM bytes used; 117788163 FST bytes; 57567190 nodes; took 80 seconds
400000: 424 RAM bytes used; 157053095 FST bytes; 76756389 nodes; took 107 seconds
500000: 424 RAM bytes used; 196318494 FST bytes; 95945639 nodes; took 133 seconds
600000: 424 RAM bytes used; 235583412 FST bytes; 115134691 nodes; took 170 seconds
700000: 480 RAM bytes used; 274866378 FST bytes; 134324199 nodes; took 198 seconds
800000: 480 RAM bytes used; 314246540 FST bytes; 153513668 nodes; took 222 seconds
900000: 480 RAM bytes used; 353626848 FST bytes; 172703151 nodes; took 245 seconds
1000000: 480 RAM bytes used; 393006717 FST bytes; 191892620 nodes; took 277 seconds
1100000: 480 RAM bytes used; 432387052 FST bytes; 211082115 nodes; took 311 seconds
1200000: 480 RAM bytes used; 471766692 FST bytes; 230271461 nodes; took 334 seconds
1300000: 480 RAM bytes used; 511147081 FST bytes; 249461034 nodes; took 357 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, very exciting we can now build massive FSTs with "trivial" amounts of heap!! Thank you Tantivy FST implementation :)
* How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be | ||
* large then make this larger. For example 15 bits = 32768 byte pages. | ||
* Set the {@link DataOutput} which is used for low-level writing of FST. If you want the FST to | ||
* be immediately readable, you need to use a DataOutput that also implements {@link FSTReader}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does FSTReader
need to be public and known to users? Could we make it package private and change this to say "you need to use FSTCompiler#getOnHeapReaderWriter
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when we change the compile() to only return metadata, users need to create the FST with the on-heap FSTReader, and thus it needs to be public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java
Show resolved
Hide resolved
public void freeze() { | ||
// these operations are costly, so we want to compute it once and cache | ||
List<ByteBuffer> byteBuffers = dataOutput.toWriteableBufferList(); | ||
if (byteBuffers.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can we just call .toDataInput()
? I think it's already doing this "is it a single buffer" opto? Or was there even poor performance in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it didn't handle the single buffer use case, and ByteBuffersDataInput would fall into the same performance regression problem as BytesStore with multiple blocks (which due to the fact that the method is size is larger and won't be inlined by JVM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it has anything to do with method sizes. I think it's got more to do with ByteBuffersDataInput wrapping input buffers with asReadOnlyBuffer:
public ByteBuffersDataInput(List<ByteBuffer> buffers) {
ensureAssumptions(buffers);
this.blocks = buffers.toArray(ByteBuffer[]::new);
for (int i = 0; i < blocks.length; ++i) {
blocks[i] = blocks[i].asReadOnlyBuffer().order(ByteOrder.LITTLE_ENDIAN);
}
It also eagerly preallocates arrays for float and long buffers - something that was added later (I wasn't aware), which may also contribute to performance if you call this method millions of times. But this is only my gut feeling, it'd have to be verified with a profile run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the missing context, I was referring to a regression I found at #10520 instead.
Maybe it was different for ByteBuffersDataInput, or there would be no regression if the block size is just 1. I'll think we need to run some test to verify. Let me know if there's a process for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for ByteBuffersDataInput in multi-block cases, there's a slight regression over the BytesStore implementation (note that we already called freeze()
). The diff is https://github.com/dungba88/lucene/pull/13/files.
ReverseRandomBytesReader with ByteBuffersDataInput
1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585]
1> 0...: took 0 seconds
1> 1000000...: took 27 seconds
1> 2000000...: took 54 seconds
1> 3000000...: took 82 seconds
1> 4000000...: took 109 seconds
1> 5000000...: took 137 seconds
1> 6000000...: took 165 seconds
1> 7000000...: took 192 seconds
1> 8000000...: took 219 seconds
1> 9000000...: took 247 seconds
1> 10000000...: took 275 seconds
1> 11000000...: took 300 seconds
BytesStore-like BytesReader
1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585]
1> 0...: took 0 seconds
1> 1000000...: took 22 seconds
1> 2000000...: took 44 seconds
1> 3000000...: took 66 seconds
1> 4000000...: took 89 seconds
1> 5000000...: took 111 seconds
1> 6000000...: took 133 seconds
1> 7000000...: took 155 seconds
1> 8000000...: took 178 seconds
1> 9000000...: took 200 seconds
1> 10000000...: took 222 seconds
1> 11000000...: took 245 seconds
As this PR is already rather long, I can make the first implementation simple by solely using ByteBuffersDataInput, then open an issue to benchmark different alternatives thoroughly. WDYT?
For this PR, I'll use the more simplified version by solely using ByteBuffersDataInput. I opened another one for the alternative way: #12879. I did see improvement as shown in the log above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dweiss . I think that makes sense. I left this PR to use the simple implementation (i.e just delegate everything to ByteBuffersDataInput). I'll mark the other one as draft, as I still want to play around with the benchmark just for curiosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see improvement as shown in the log above.
I compared 9.7.x and main + this PR (an earlier rev than the current one, the rev where we re-added .freeze()
), running Test2BFSTs
. This PR is a bit faster at building (643 seconds vs 677 for the NO_OUTPUTS
case), but slower at get
(89 seconds vs 60 seconds), but there were other changes since 9.7.x so this is not a pure test. Yet perhaps this asReadOnly
wrapping (vs FST's dedicated byte[]
-backed BytesStore
) is indeed part of that slowdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to sticking to the "clean but slower" impl for this PR, and seeing if we can / it's worth clawing back the performance cost/overhead of ByteBuffer
s in a separate follow-on PR. Maybe @uschindler has some ideas on recovering performance here too ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please stick with ByteBuffer. I have a similar suggestion with the stupid groupVInts. Don't splatter our code with highly specialized implementations everywhere. Create a default ByteBuffer impl. There's no overhead as the JVM has optimized ByteBuffer access to make those invisible.
What you see in benchmarks like this is the overhead of C2 jumping in. To make a "correct" benchmark use JMH. In addition: Don't trust profilers, they are useless in those areas as they measure bullshit.
So:
- create a JMH benchmark
- use ByteBuffer only and use ByteBuffer.wrap() to temporarily create wrappers around other implementations like byte[] or MemorySegment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S.: I have not looked into this PR, this is just my general suggestion regarding ByteBuffer and escape analysis, see also #12841.
lucene/core/src/test/org/apache/lucene/util/fst/TestFSTDataOutputWriter.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
protected FSTCompiler.Builder<T> getFSTBuilder() { | ||
return new FSTCompiler.Builder<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we randomize whether the "on disk" vs "on heap" DataOutput
is used, so that TestFSTs
randomly picks? We should randomize all three posibilities:
- Create & use on-heap FST
- Create on heap, save to disk, load FST from disk
- Stream to disk, then load FST from disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this test along with the 2B nodes off-heap
lucene/core/src/test/org/apache/lucene/util/fst/Test2BFSTOffHeap.java
Outdated
Show resolved
Hide resolved
Since we are struggling to best measure FST performance impact of these changes, I opened a spinoff issue to create a dedicated FST microbenchmark. Having such a compass that's quick to check would really help us make more informed decisions on complex "code complexity vs performance" types of FST changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super close! I left minor comments.
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java
Show resolved
Hide resolved
Thanks everyone! I addressed comments, putting a simpler implementation. +1 to the FST micro benchmarking |
@@ -419,7 +418,8 @@ public FST(FSTMetadata<T> metadata, DataInput in, FSTStore fstStore) throws IOEx | |||
|
|||
/** Create the FST with a metadata object and a FSTReader. */ | |||
FST(FSTMetadata<T> metadata, FSTReader fstReader) { | |||
this.metadata = metadata; | |||
assert fstReader != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm also this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fstReader is not directly passed by users. If the users call the constructor with FSTStore, that cannot be null and it will throw an NPE due to the call fstStore.init()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, super!
Thanks for persisting @dungba88 -- this was a crazy long and tricky exercise. I'm so excited Lucene can finally build arbitrarily large FSTs with bounded heap usage. I'll merge this soon and backport, then let's watch builds for any exciting FST related failures... |
* Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Update Javadoc * Address comments
Description
Split the scratch byte operations inside
BytesStore
out of FSTWriter and allow the DataOutput to be configurable fromFSTCompiler.Builder
. Those operations are for building the frontier node. Only the final bytes will be written to the DataOutput. The Builder will still maintainbytesPageBits(int)
for backward-compatibility, in which it will use aByteBuffersDataOutput
.Also extract the saving of metadata as a standalone method, as there can be use case where some already saved the FST to a DataOutput when building and only need to save the metadata here.
Issue: #12543
The FST produced by this DataOutput-backed FST is not readable directly, and users need to construct a corresponding DataInput and pass it to the FST public constructor. In short the FSTCompiler only compile the FST but not read it.
The write-then-read-immediately use case is still supported with ByteBuffersDataOutput.