New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9409: Truncation can also cause IndexOutOfBoundsException. #1593
base: master
Are you sure you want to change the base?
Conversation
This changes terms and points to check the length of the index/data files before creating slices in these files. A side-effect of this is that we can no longer verify checksums of the meta file before checking the length of other files, but this shouldn't be a problem. On the other hand it helps make sure that we would return a clear exception in case of truncation instead of a confusing OutOfBoundsException that isn't clear whether it's due to index corruption or a bug in Lucene.
} catch (Throwable t) { | ||
priorE = t; | ||
} finally { | ||
CodecUtil.checkFooter(metaIn, priorE); | ||
} | ||
} | ||
// At this point, checksums of the meta file have been validated so we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm are we losing this safety?
Oh, actually, maybe not, because in the finally
clause above, where we check meta's footer, if the checksum is bad we will throw an exception, adding it as suppressed exception if the indexLength
or dataLength
was wrong. So I think we do not lose any safety with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't lose safety, but in case of a corrupt meta file, it might be slightly more confusing in the sense that the suppressed exception will complain about a truncated index/data file
Lucene86PointsFormat.META_EXTENSION); | ||
metaOut = writeState.directory.createOutput(metaFileName, writeState.context); | ||
CodecUtil.writeIndexHeader(metaOut, | ||
tempMetaOut = writeState.directory.createTempOutput( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we switching to a temp file and copying to the real file after closing? Maybe add a comment explaining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we need to write file lengths of the index/data files before any offsets/lengths of slices into these files. But since these index/data files have not been written yet, we don't know the length yet. So I wrote into a temp file, and only then write the final metadata file that includes first the lengths of the index/data files and then metadata about the KD trees that includes offsets into these index/data files. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, I could buffer the metadata in memory like we do for terms. It will require changing some APIs to replace IndexOutput with DataOutputs but other than that it shouldn't be too hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the explanation @jpountz. I am OK with using temp files for this ...
Sorry, I'm against this change. The test is broken. It looks like we are willing to make bad tradeoffs in order to deliver CorruptIndexException and only CorruptIndexException if anything goes wrong. Fix the test instead!
This is seriously the wrong tradeoff: let's fix the test instead. If we unexpectedly hit EOF, EOFException is the correct exception. If an index is out of bounds, IndexOutOfBoundsException is the correct exception. |
I like the CorruptIndexException because it tells me that the problem is that the file got altered after being written, while I would otherwise wonder if there is a bug in Lucene. As an alternative, would it work better for you if we called |
I repurposed this PR to instead make the test expect out-of-bounds exceptions. Does it look better to you @rmuir @uschindler ? |
I am fine to fix the test. Sure you have to first figure out why the index is out of bounds, and the exact exception may be misleading, but that's actually what's happening here. If you want other exceptions, another fix would be to enforce the IO layer to have a meaningful exception and implement it for all directory implementations. |
Expect
IndexOutOfBoundsException
when opening indices with truncated files.