Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block_cache checksum frequently triggerred #226

Closed
wangyuyue opened this issue Apr 18, 2023 · 2 comments
Closed

block_cache checksum frequently triggerred #226

wangyuyue opened this issue Apr 18, 2023 · 2 comments

Comments

@wangyuyue
Copy link

wangyuyue commented Apr 18, 2023

Describe the bug
I run the ssd_perf/graph_cache_leader workload for 64 navy writer threads and 64 navy reader threads. After 50 minutes, the terminal output is flooded with the item header checksum mismatch error.

To Reproduce
The configuration file I use:

{
    "cache_config": {
        "cacheSizeMB": 71680,
        "navyReaderThreads": 64,
        "navyWriterThreads": 64,
        "nvmCachePaths": [
            "/dev/nvme0n1"
        ],
        "nvmCacheSizeMB": 1075200,
        "writeAmpDeviceList": [
            "/dev/nvme0n1"
        ],
        "navyBigHashBucketSize": 4096,
        "navyBigHashSizePct": 65,
        "navyBlockSize": 4096,
        "navyParcelMemoryMB": 6048,
        "enableChainedItem": true,
        "htBucketPower": 26,
        "moveOnSlabRelease": false,
        "numPools": 2,
        "poolRebalanceIntervalSec": 5,
        "poolSizes": [
            0.2,
            0.8
        ]
    },
    "test_config": {
        "opDelayNs": 0,
        "opDelayBatch": 3,
        "generator": "online",
        "enableLookaside": true,
        "keyPoolDistribution": [
            0.5228545418419219,
            0.477145458158078
        ],
        "numKeys": 4100000000,
        "numOps": 200000000,
        "numThreads": 16,
        "opPoolDistribution": [
            0.571,
            0.429
        ],
        "poolDistributions": [
            {
                "addChainedRatio": 0.0,
                "delRatio": 0.0,
                "getRatio": 0.7684563460126871,
                "keySizeRange": [
                    8,
                    16
                ],
                "keySizeRangeProbability": [
                    1.0
                ],
                "loneGetRatio": 0.2315436539873129,
                "popDistFile": "fbobj_pop.json",
                "setRatio": 0.0,
                "valSizeDistFile": "fbobj_sizes.json"
            },
            {
                "addChainedRatio": 0.0,
                "delRatio": 0.0,
                "getRatio": 0.8841112719979642,
                "keySizeRange": [
                    8,
                    16
                ],
                "keySizeRangeProbability": [
                    1.0
                ],
                "loneGetRatio": 0.1158887280020357,
                "popDistFile": "assoc_pop.json",
                "setRatio": 0.0,
                "valSizeDistFile": "assoc_sizes.json"
            }
        ]
    }
}

Expected behavior
Output statistics for every minute.

Screenshots

E0417 20:53:13.873994 878573 BlockCache.cpp:402] Item header checksum mismatch. Region 5229 is likely corrupted. Expected:1667855729, Actual: 564302642. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.049113 878575 BlockCache.cpp:402] Item header checksum mismatch. Region 5225 is likely corrupted. Expected:1634476133, Actual: 1712872625. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.211392 878577 BlockCache.cpp:402] Item header checksum mismatch. Region 5228 is likely corrupted. Expected:1868832889, Actual: 3472898925. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.413154 878579 BlockCache.cpp:402] Item header checksum mismatch. Region 5221 is likely corrupted. Expected:1948283493, Actual: 367335369. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.565924 878581 BlockCache.cpp:402] Item header checksum mismatch. Region 5230 is likely corrupted. Expected:1919033451, Actual: 3110670073. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.705444 878583 BlockCache.cpp:402] Item header checksum mismatch. Region 5219 is likely corrupted. Expected:1634476133, Actual: 1712872625. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.863216 878585 BlockCache.cpp:402] Item header checksum mismatch. Region 5226 is likely corrupted. Expected:1667855729, Actual: 564302642. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.022443 878587 BlockCache.cpp:402] Item header checksum mismatch. Region 5223 is likely corrupted. Expected:539912047, Actual: 1512533508. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.154551 878589 BlockCache.cpp:402] Item header checksum mismatch. Region 5245 is likely corrupted. Expected:1864397680, Actual: 29886577. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.297171 878591 BlockCache.cpp:402] Item header checksum mismatch. Region 5222 is likely corrupted. Expected:1713401463, Actual: 3011049547. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.436860 878593 BlockCache.cpp:402] Item header checksum mismatch. Region 5224 is likely corrupted. Expected:543516756, Actual: 4032079696. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.586964 878595 BlockCache.cpp:402] Item header checksum mismatch. Region 5243 is likely corrupted. Expected:544763750, Actual: 614490409. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.722491 878597 BlockCache.cpp:402] Item header checksum mismatch. Region 5239 is likely corrupted. Expected:1735353376, Actual: 2068717658. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.886340 878599 BlockCache.cpp:402] Item header checksum mismatch. Region 5232 is likely corrupted. Expected:1864397680, Actual: 29886577. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.040464 878601 BlockCache.cpp:402] Item header checksum mismatch. Region 5236 is likely corrupted. Expected:2053205024, Actual: 884959818. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.173985 878603 BlockCache.cpp:402] Item header checksum mismatch. Region 5231 is likely corrupted. Expected:543516756, Actual: 4032079696. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.327759 878605 BlockCache.cpp:402] Item header checksum mismatch. Region 5227 is likely corrupted. Expected:1814062440, Actual: 1958191057. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.474437 878607 BlockCache.cpp:402] Item header checksum mismatch. Region 5234 is likely corrupted. Expected:544763750, Actual: 614490409. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).

Server:

  • Hardware: Intel Ice Lake, 72 core, 256GB RAM, 1.6TB NVMe
  • OS: ubuntu 20.04
  • CacheLib Version: 4dc071d

Question:
Can anyone help me analyze the reason for the crash? Is this because of the large number of navy threads in use and potentially a CacheLib error?

@jaesoo-fb
Copy link
Contributor

Hi @wangyuyue

You are getting the checksum error from the data read from SSD, and it is highly like caused by the device error. Can you retry on another SSD? or just DRAM for debugging?

@wangyuyue
Copy link
Author

wangyuyue commented Apr 18, 2023

You are getting the checksum error from the data read from SSD, and it is highly like caused by the device error. Can you retry on another SSD? or just DRAM for debugging?

Hi, I find the error may be caused by a previous cachebench process that is not fully cleaned. We can close the issue now.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants