Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latency spike and wave pattern in disk usage - v19.2.4 #45557

Closed
ghost opened this issue Mar 1, 2020 · 4 comments
Closed

latency spike and wave pattern in disk usage - v19.2.4 #45557

ghost opened this issue Mar 1, 2020 · 4 comments
Labels
T-storage Storage Team

Comments

@ghost
Copy link

ghost commented Mar 1, 2020

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:
latency problem; production system.

To Reproduce

What did you do? Describe in your own words.
added a whole lot of BLOBs (~1TB; 50 mil. rows)

Additional data / screenshots
1 week graph
Capture

Environment:

single node
started with v19.2.2 and when noticing the problem, updated to v19.2.4
Linux 4.15.0-74-generic 84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Jira issue: CRDB-8005

@petermattis
Copy link
Collaborator

The wave pattern in disk usage is really strange. Even with RocksDB compactions we never see fluctuations that dramatic. Can you share the RocksDB SSTables and RocksDB Flushes/Compactions graphs (on the Storage dashboard)?

What did you do? Describe in your own words.
added a whole lot of BLOBs (~1TB; 50 mil. rows)

It would be helpful for you to precisely describe what your workload is doing, the types of queries, the size of the cluster, including the number of nodes and the types of machines. Even better if you can share source code from the workload program.

@petermattis petermattis added this to Incoming in Storage via automation Mar 2, 2020
@ghost
Copy link
Author

ghost commented Mar 2, 2020

storage

There are two drops, these are caused by me restarting the node.
The reason its now less violent is (I guess), because I set gc.ttlseconds = 1000000.

It would be helpful for you to precisely describe what your workload is doing, the types of queries, the size of the cluster, including the number of nodes and the types of machines. Even better if you can share source code from the workload program.

Single node; 48 cores; 512 GB ram; 6 disks NVMe RAID10.
As said, this is a production system, not a stress test system.
All types of queries, with an avg. of 2-3k q/s

@ghost
Copy link
Author

ghost commented Mar 7, 2020

I yesterday ran into that problem again (latency spikes) and talked with @dt about how to change the soft and hard limit for compaction bytes and now that I think of it, I dropped about a week before the problems started a column in a table that was maybe 400-500 GB-ish JSONB
So maybe this is related to #24029 and #26693

cockroach-rocksdb.db.root.2020-03-07T07_37_58Z.043406.log

@petermattis petermattis moved this from Incoming to To Do (future milestone) in Storage Mar 23, 2020
@petermattis petermattis moved this from To Do (future milestone) to To Do (investigations) in Storage Apr 30, 2020
@jlinder jlinder added the T-storage Storage Team label Jun 16, 2021
@mwang1026
Copy link

19.2 has been EOL for a while. If you repro this in a higher version feel free to reopen

Storage automation moved this from To Do (investigations) to Done Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

3 participants