-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve recovery / snapshot restoring file identity handling #7351
Conversation
The diffing logic here etc looks great to me. |
if (file.metadata.hash() != null && file.metadata().hash().length > 0) { | ||
BytesRef hash = file.metadata.hash(); | ||
// TODO - not sure if this is the right way to do this - can't we pass the bytes ref directly? | ||
builder.field(Fields.META_HASH, Base64.encodeBytes(hash.bytes, hash.offset, hash.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use field(String name, byte[] value, int offset, int length) or field(String name, BytesReference value) here and then use XContentParser#binaryValue to read this value back?
I left a couple of minor comments. Otherwise, looks good to me. |
@imotov I pushed a new commit including a test for the |
5d48020
to
60b280c
Compare
LGTM |
I think we have a small regression here for snapshot and restore since we don't have the hash for the segments in the already existing snapshot. I think we can read the hashes for those where we calculated them from the snapshot on the fly if necessary. I will open a followup for this as I already discussed this with @imotov |
This commit changes the way how files are selected for retransmission on recovery / restore. Today this happens on a per-file basis where the rather weak checksum and the file length in bytes is compared to check if a file is identical. This is prone to fail in the case of a checksum collision which can happen under certain circumstances. The changes in this commit move the identity comparsion to a per-commit / per-segment level where files are only treated as identical iff all the other files in the commit / segment are the same. This "all or nothing" strategy is reducing the chance for a collision dramatically since we also use a strong hash to identify commits / segments based on the content of the ".si" / "segments.N" file. Closes elastic#7351
This commit changes the way how files are selected for retransmission on recovery / restore. Today this happens on a per-file basis where the rather weak checksum and the file length in bytes is compared to check if a file is identical. This is prone to fail in the case of a checksum collision which can happen under certain circumstances. The changes in this commit move the identity comparsion to a per-commit / per-segment level where files are only treated as identical iff all the other files in the commit / segment are the same. This "all or nothing" strategy is reducing the chance for a collision dramatically since we also use a strong hash to identify commits / segments based on the content of the ".si" / "segments.N" file. Closes #7351
Due to additional safety added in elastic#7351 we compute now a strong hash for .si and segments_N files which are compared during snapshot / restore. Old snapshots don't have this hash which can cause unnecessary copying of large amount of data. This commit adds the ability to fetch this hash from the blob store if needed. Closes elastic#7434
Due to additional safety added in #7351 we compute now a strong hash for .si and segments_N files which are compared during snapshot / restore. Old snapshots don't have this hash which can cause unnecessary copying of large amount of data. This commit adds the ability to fetch this hash from the blob store if needed. Closes #7434
This commit changes the way how files are selected for retransmission on recovery / restore. Today this happens on a per-file basis where the rather weak checksum and the file length in bytes is compared to check if a file is identical. This is prone to fail in the case of a checksum collision which can happen under certain circumstances. The changes in this commit move the identity comparsion to a per-commit / per-segment level where files are only treated as identical iff all the other files in the commit / segment are the same. This "all or nothing" strategy is reducing the chance for a collision dramatically since we also use a strong hash to identify commits / segments based on the content of the ".si" / "segments.N" file. Closes #7351
Due to additional safety added in #7351 we compute now a strong hash for .si and segments_N files which are compared during snapshot / restore. Old snapshots don't have this hash which can cause unnecessary copying of large amount of data. This commit adds the ability to fetch this hash from the blob store if needed. Closes #7434
This commit changes the way how files are selected for retransmission on recovery / restore. Today this happens on a per-file basis where the rather weak checksum and the file length in bytes is compared to check if a file is identical. This is prone to fail in the case of a checksum collision which can happen under certain circumstances. The changes in this commit move the identity comparsion to a per-commit / per-segment level where files are only treated as identical iff all the other files in the commit / segment are the same. This "all or nothing" strategy is reducing the chance for a collision dramatically since we also use a strong hash to identify commits / segments based on the content of the ".si" / "segments.N" file. Closes elastic#7351
Due to additional safety added in elastic#7351 we compute now a strong hash for .si and segments_N files which are compared during snapshot / restore. Old snapshots don't have this hash which can cause unnecessary copying of large amount of data. This commit adds the ability to fetch this hash from the blob store if needed. Closes elastic#7434
This commit changes the way how files are selected for retransmission on recovery / restore. Today this happens on a per-file basis where the rather weak checksum and the file length in bytes is compared to check if a file is identical. This is prone to fail in the case of a checksum collision which can happen under certain circumstances. The changes in this commit move the identity comparsion to a per-commit / per-segment level where files are only treated as identical iff all the other files in the commit / segment are the same. This "all or nothing" strategy is reducing the chance for a collision dramatically since we also use a strong hash to identify commits / segments based on the content of the ".si" / "segments.N" file. Closes elastic#7351
Due to additional safety added in elastic#7351 we compute now a strong hash for .si and segments_N files which are compared during snapshot / restore. Old snapshots don't have this hash which can cause unnecessary copying of large amount of data. This commit adds the ability to fetch this hash from the blob store if needed. Closes elastic#7434
This commit changes the way how files are selected for retransmission
on recovery / restore. Today this happens on a per-file basis where the
rather weak checksum and the file length in bytes is compared to check if
a file is identical. This is prone to fail in the case of a checksum collision
which can happen under certain circumstances.
The changes in this commit move the identity comparison to a per-commit / per-segment
level where files are only treated as identical iff all the other files in the
commit / segment are the same. This
all or nothing
strategy is reducing the chance fora collision dramatically since we also use a strong hash to identify commits / segments
based on the content of the
.si
/segments.N
file.