-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Checksum functionality is missing #3011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
VictoriaMetrics relies on persistent storage for data integrity and durability. That's why it is recommended to store VictoriaMetrics data to Google compute persistent disks, which provide reasonable integrity and durability guarantees. While it is possible to add checksums to the compressed zstd data stored by VictoriaMetrics on disk (see If the data stored on disk is corrupted, then there are high chances that the incorrect data will be noticed during decompression, decoding and sanity checking steps. In this case VictoriaMetrics just puts the error message about the discovered incorrect data into logs and exits (aka crashes), since it cannot auto-heal arbitrary data corruptions on disk. There are some chances that some small on-disk data corruption will remain unnoticed during decompression, decoding and sanity checking. Checksums would reduce these chances. But this looks like more theoretical than practical case. According to data corruption theory, the data can be corrupted not only on persistent storage, but basically at any hardware place - CPU, RAM, network card, network media, the data transfer path from persistent storage to RAM, etc. I'm unsure that the persistent storage has higher chances for data corruption comparing to other hardware places. Then it is unclear why we should pay especial attention to data corruption related to persistent disk only? |
I agree, data corruption could be caused by any hardware, and hardware error detection or correction is not enough, that is why software checksum matters. There is some low-level software checksum that exists such as TCP, so this could narrow down the problem a little bit. In my experience, disk error is much higher than others, from a time point of view, we could store data on a disk for months, even years, and the time increases the probability of data corruption. |
Important note regarding this feature request: VictoriaMetrics can detect some corrupted data right now during decompression, decoding and sanity checking of the data read from disk. But it cannot fix the corrupted data. If we add checksums for the data stored to disk, then VictoriaMetrics will be able to detect more cases for corrupted on-disk data. But it still won't be able to fix the corrupted data. The only option it can do is to report about the corrupted data in the log and terminate. |
…ed by blockHeader.{Min,Max}Timestamp when upacking the block This should reduce chances of unnoticed on-disk data corruption. Updates #2998 Updates #3011 This change modifies the format for data exported via /api/v1/export/native - now this data contains MaxTimestamp and PrecisionBits fields from blockHeader. This is OK, since the native export format is undocumented.
…ed by blockHeader.{Min,Max}Timestamp when upacking the block This should reduce chances of unnoticed on-disk data corruption. Updates #2998 Updates #3011 This change modifies the format for data exported via /api/v1/export/native - now this data contains MaxTimestamp and PrecisionBits fields from blockHeader. This is OK, since the native export format is undocumented.
FYI, VictoriaMetrics validates the correctness of timestamps stored on disk while performing background merge starting from v1.82.0. This should help detecting the corruption of timestamps stored on disk. VictoriaMetrics logs the error message when it detects corrupted timestamps and then exits. |
I think this enhancement request is solved according to this comment. Closing now. Feel free to open it if you encounter new issue regarding the same topic. |
@jiekun I don't think this issue is resolved by only validating timestamps, as VictoriaMetrics does more than just store timestamps on disk. I am not currently using VictoriaMetrics in a production system, and this isn't an actual problem I'm encountering, it's more of a potential risk report, and it could indeed happen. |
Thank you for pointing that out, really appreciated! <3 I believe valyala have considered about more complex checksum and decided to go with the current way, based on the K.I.S.S principle. @valyala Please reopen it if you think we need to go further on this. |
In my opinion, I like the K.I.S.S. principle, but it does not completely address the actual issue. |
Describe the bug
VictoriaMetrics may return corrupted data to the user since VictoriaMetrics does not persist any checksum/CRC information, and VictoriaMetrics does not use the checksum feature of the zstd compression method.
To Reproduce
Data corruption may happen in memory, and silent data corruption on a disk is not that rare.
Expected behavior
IMHO, data integrity is very important, corrupted data is worse than no data.
Maybe we can use some attached storage services with the checksum feature, such as Kubernetes PVC, but I think this function is fundamental to VictoriaMetrics and we should implement it inside VictoriaMetrics.
The text was updated successfully, but these errors were encountered: