You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.
Our test setup consists of 3 machines, each with 1 NVME and 12 HDD drives.
NVME is split into 12 parts, each one serving as caching layer for associated HDD drive.
Machines are used in ceph cluster, each NVME used as independently.
Non-default sysctl variables on all devices are: skip_seq_thresh_kb = 256 reclaim_policy = 1
We've observed periodical stalls on flashcache devices during tests (basically no IO at all) lasting from few seconds to few minutes. Most of test io ops were crossing the sequential threshold. During such stall periods:
nr_queued value on all flashcache devices is large (>10k)
only one flashcache device is able to lower the nr_queued at a time - it's underlying HDD shows 100% util in iotop, all other HDD devices show some marginal values or no IO at all
if fallow_delay is set to 0, perf top shows flashcache_deq_pending at the top, the hot spot is inside it's internal for loop (near if (node->index == index) {)
if fallow_delay is set to 900, perf top shows either flachcache_clean_set or _raw_spin_lock_irq (called from flachcache_clean_set), hot spot is in one of flachcache_clean_set's for loops (near if (!(cacheblk->cache_state & DIRTY_FALLOW_2)))
From what I read in the code, I understand that flashcache uses global kernel thread pools for it's job. Is it possible all cleaning jobs are executed on one core only? That could explain why only one HDD can clean itself at the same time.
When skip_seq_thresh_kb = 0 and fallow_delay = 0, cluster can handle it's load without issues. If the dirty percent is near 99%, queues start filling up but they're cleaned up simultaneously and multiple HDDs are at 100% util.
Sysctl dump of a sample drive (configuration causing issues):
Our test setup consists of 3 machines, each with 1 NVME and 12 HDD drives.
NVME is split into 12 parts, each one serving as caching layer for associated HDD drive.
Machines are used in ceph cluster, each NVME used as independently.
Non-default sysctl variables on all devices are:
skip_seq_thresh_kb = 256
reclaim_policy = 1
We've observed periodical stalls on flashcache devices during tests (basically no IO at all) lasting from few seconds to few minutes. Most of test io ops were crossing the sequential threshold. During such stall periods:
flashcache_deq_pending
at the top, the hot spot is inside it's internal for loop (nearif (node->index == index) {
)flachcache_clean_set
or_raw_spin_lock_irq
(called from flachcache_clean_set), hot spot is in one of flachcache_clean_set's for loops (nearif (!(cacheblk->cache_state & DIRTY_FALLOW_2))
)From what I read in the code, I understand that flashcache uses global kernel thread pools for it's job. Is it possible all cleaning jobs are executed on one core only? That could explain why only one HDD can clean itself at the same time.
When
skip_seq_thresh_kb = 0
andfallow_delay = 0
, cluster can handle it's load without issues. If the dirty percent is near 99%, queues start filling up but they're cleaned up simultaneously and multiple HDDs are at 100% util.Sysctl dump of a sample drive (configuration causing issues):
The text was updated successfully, but these errors were encountered: