Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9409: Truncation can also cause IndexOutOfBoundsException. #1593

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 18, 2020

Expect IndexOutOfBoundsException when opening indices with truncated files.

This changes terms and points to check the length of the index/data
files before creating slices in these files. A side-effect of this is
that we can no longer verify checksums of the meta file before checking
the length of other files, but this shouldn't be a problem. On the other
hand it helps make sure that we would return a clear exception in case
of truncation instead of a confusing OutOfBoundsException that isn't
clear whether it's due to index corruption or a bug in Lucene.
} catch (Throwable t) {
priorE = t;
} finally {
CodecUtil.checkFooter(metaIn, priorE);
}
}
// At this point, checksums of the meta file have been validated so we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are we losing this safety?

Oh, actually, maybe not, because in the finally clause above, where we check meta's footer, if the checksum is bad we will throw an exception, adding it as suppressed exception if the indexLength or dataLength was wrong. So I think we do not lose any safety with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't lose safety, but in case of a corrupt meta file, it might be slightly more confusing in the sense that the suppressed exception will complain about a truncated index/data file

Lucene86PointsFormat.META_EXTENSION);
metaOut = writeState.directory.createOutput(metaFileName, writeState.context);
CodecUtil.writeIndexHeader(metaOut,
tempMetaOut = writeState.directory.createTempOutput(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we switching to a temp file and copying to the real file after closing? Maybe add a comment explaining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we need to write file lengths of the index/data files before any offsets/lengths of slices into these files. But since these index/data files have not been written yet, we don't know the length yet. So I wrote into a temp file, and only then write the final metadata file that includes first the lengths of the index/data files and then metadata about the KD trees that includes offsets into these index/data files. I'll add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, I could buffer the metadata in memory like we do for terms. It will require changing some APIs to replace IndexOutput with DataOutputs but other than that it shouldn't be too hard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the explanation @jpountz. I am OK with using temp files for this ...

@rmuir
Copy link
Member

rmuir commented Jun 21, 2020

Sorry, I'm against this change. The test is broken. It looks like we are willing to make bad tradeoffs in order to deliver CorruptIndexException and only CorruptIndexException if anything goes wrong. Fix the test instead!

A side-effect of this is that we can no longer verify checksums of the meta file before checking the length of other files

This is seriously the wrong tradeoff: let's fix the test instead. If we unexpectedly hit EOF, EOFException is the correct exception. If an index is out of bounds, IndexOutOfBoundsException is the correct exception.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 22, 2020

I like the CorruptIndexException because it tells me that the problem is that the file got altered after being written, while I would otherwise wonder if there is a bug in Lucene. As an alternative, would it work better for you if we called retrieveChecksum(IndexInput) before the try block, and then again with the length (retrieveChecksum(IndexInput, long)) after the try block once the checksum of the meta file has been validated?

@jpountz
Copy link
Contributor Author

jpountz commented Aug 11, 2020

I repurposed this PR to instead make the test expect out-of-bounds exceptions. Does it look better to you @rmuir @uschindler ?

@uschindler
Copy link
Contributor

I am fine to fix the test. Sure you have to first figure out why the index is out of bounds, and the exact exception may be misleading, but that's actually what's happening here. If you want other exceptions, another fix would be to enforce the IO layer to have a meaningful exception and implement it for all directory implementations.

@jpountz jpountz changed the title LUCENE-9409: Check file lengths before creating slices. LUCENE-9409: Truncation can also cause IndexOutOfBoundsException. Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants