Skip to content

Checksum functionality is missing #3011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
damnever opened this issue Aug 23, 2022 · 8 comments
Closed

Checksum functionality is missing #3011

damnever opened this issue Aug 23, 2022 · 8 comments
Labels
enhancement New feature or request question The question issue

Comments

@damnever
Copy link

Describe the bug

VictoriaMetrics may return corrupted data to the user since VictoriaMetrics does not persist any checksum/CRC information, and VictoriaMetrics does not use the checksum feature of the zstd compression method.

To Reproduce

Data corruption may happen in memory, and silent data corruption on a disk is not that rare.

Expected behavior

IMHO, data integrity is very important, corrupted data is worse than no data.

Maybe we can use some attached storage services with the checksum feature, such as Kubernetes PVC, but I think this function is fundamental to VictoriaMetrics and we should implement it inside VictoriaMetrics.

@YuriKravetc YuriKravetc added the bug Something isn't working label Aug 23, 2022
@valyala valyala added question The question issue enhancement New feature or request and removed bug Something isn't working labels Aug 23, 2022
@valyala
Copy link
Collaborator

valyala commented Aug 23, 2022

VictoriaMetrics relies on persistent storage for data integrity and durability. That's why it is recommended to store VictoriaMetrics data to Google compute persistent disks, which provide reasonable integrity and durability guarantees.

While it is possible to add checksums to the compressed zstd data stored by VictoriaMetrics on disk (see content_checksum at https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md ), I'm unsure this will solve any practical issues.

If the data stored on disk is corrupted, then there are high chances that the incorrect data will be noticed during decompression, decoding and sanity checking steps. In this case VictoriaMetrics just puts the error message about the discovered incorrect data into logs and exits (aka crashes), since it cannot auto-heal arbitrary data corruptions on disk. There are some chances that some small on-disk data corruption will remain unnoticed during decompression, decoding and sanity checking. Checksums would reduce these chances. But this looks like more theoretical than practical case. According to data corruption theory, the data can be corrupted not only on persistent storage, but basically at any hardware place - CPU, RAM, network card, network media, the data transfer path from persistent storage to RAM, etc. I'm unsure that the persistent storage has higher chances for data corruption comparing to other hardware places. Then it is unclear why we should pay especial attention to data corruption related to persistent disk only?

@damnever
Copy link
Author

I agree, data corruption could be caused by any hardware, and hardware error detection or correction is not enough, that is why software checksum matters.

There is some low-level software checksum that exists such as TCP, so this could narrow down the problem a little bit. In my experience, disk error is much higher than others, from a time point of view, we could store data on a disk for months, even years, and the time increases the probability of data corruption.

@valyala
Copy link
Collaborator

valyala commented Aug 26, 2022

Important note regarding this feature request:

VictoriaMetrics can detect some corrupted data right now during decompression, decoding and sanity checking of the data read from disk. But it cannot fix the corrupted data. If we add checksums for the data stored to disk, then VictoriaMetrics will be able to detect more cases for corrupted on-disk data. But it still won't be able to fix the corrupted data. The only option it can do is to report about the corrupted data in the log and terminate.

valyala added a commit that referenced this issue Sep 6, 2022
…ed by blockHeader.{Min,Max}Timestamp when upacking the block

This should reduce chances of unnoticed on-disk data corruption.

Updates #2998
Updates #3011

This change modifies the format for data exported via /api/v1/export/native -
now this data contains MaxTimestamp and PrecisionBits fields from blockHeader.

This is OK, since the native export format is undocumented.
valyala added a commit that referenced this issue Sep 6, 2022
…ed by blockHeader.{Min,Max}Timestamp when upacking the block

This should reduce chances of unnoticed on-disk data corruption.

Updates #2998
Updates #3011

This change modifies the format for data exported via /api/v1/export/native -
now this data contains MaxTimestamp and PrecisionBits fields from blockHeader.

This is OK, since the native export format is undocumented.
@valyala
Copy link
Collaborator

valyala commented Oct 7, 2022

FYI, VictoriaMetrics validates the correctness of timestamps stored on disk while performing background merge starting from v1.82.0. This should help detecting the corruption of timestamps stored on disk. VictoriaMetrics logs the error message when it detects corrupted timestamps and then exits.

valyala added a commit that referenced this issue Oct 20, 2022
…ng, which needs validation

This reduces CPU usage when there is no sense in validating timestamps.

This is a follow-up for 5fa9525

Updates #2998
Updates #3011
valyala added a commit that referenced this issue Oct 20, 2022
…ng, which needs validation

This reduces CPU usage when there is no sense in validating timestamps.

This is a follow-up for 5fa9525

Updates #2998
Updates #3011
@jiekun
Copy link
Contributor

jiekun commented Jul 22, 2024

I think this enhancement request is solved according to this comment. Closing now. Feel free to open it if you encounter new issue regarding the same topic.

@jiekun jiekun closed this as completed Jul 22, 2024
@damnever
Copy link
Author

@jiekun I don't think this issue is resolved by only validating timestamps, as VictoriaMetrics does more than just store timestamps on disk. I am not currently using VictoriaMetrics in a production system, and this isn't an actual problem I'm encountering, it's more of a potential risk report, and it could indeed happen.

@jiekun
Copy link
Contributor

jiekun commented Jul 22, 2024

Thank you for pointing that out, really appreciated! <3

I believe valyala have considered about more complex checksum and decided to go with the current way, based on the K.I.S.S principle. @valyala Please reopen it if you think we need to go further on this.

@damnever
Copy link
Author

In my opinion, I like the K.I.S.S. principle, but it does not completely address the actual issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question The question issue
Projects
None yet
Development

No branches or pull requests

4 participants