HBASE-30102 Add metric to account for region data classified as cold by the Time Based Priority logic#8128
Conversation
…by the Time Based Priority logic Change-Id: I5601a37300a3f5d10fe4886ba988f2d25e66b546
taklwu
left a comment
There was a problem hiding this comment.
when Time Based Priority is disabled, what would be the % Cold Data? is it always showing?
| getAllCacheKeysForFile(hFileInfo.getHFileContext().getHFileName(), 0, Long.MAX_VALUE); | ||
| int evictedBlocks = evictBlockSet(keySet); | ||
| if (evictedBlocks > 0) { | ||
| LOG.info("Evicted {} blocks for file {} as it is now considered cold by DataTieringManager", |
There was a problem hiding this comment.
nit: should we have it as debug level? I'm wondered if we see a lot of these message.
There was a problem hiding this comment.
Maybe, yeah. Although it would be triggered only upon enabling of the time based priority on the individual store, and once for each affected file, it can still flood the logs. Let me switch it to DEBUG.
| if (key.getCfName() != null) { | ||
| builder.setFamilyName(key.getCfName()); | ||
| } | ||
| if (key.getRegionName() != null) { | ||
| builder.setRegionName(key.getRegionName()); | ||
| } |
There was a problem hiding this comment.
question: why weren't the cf and regionname filled before ? is it because the cold data needs for log message or other compute usage?
There was a problem hiding this comment.
Yes. This is required not only by the new "coldDataRatio" metric we are adding, but also the existing "regionCachedRatio" that is critical for the CacheAwareLoadBalancer. Without this change here, we cannot calculate these metrics when recovering the persistent cache. IMO, it's a bug in the current CacheAwareLoadBalancer implementation.
Change-Id: I392517f882e7c5a8c6063b16f525f6467956a3bb
taklwu
left a comment
There was a problem hiding this comment.
LGTM.
something is wrong with the github action, can you give a try to trigger them ?
It would show as 0%. With the "% Cold Data" and "% Cached", operators can infer if there's indeed a problem with the region cache, as those are mutually exclusive, if both are low, it means the region caching went into some problems, most likely, not enough cache capacity. |
| for (HStoreFile file : newFiles) { | ||
| // call isHotData to account for the new file size in regionColdDataSize, if the new file is | ||
| // considered cold data as per data-tiering logic. | ||
| isHotData(file.getFileInfo().getHFileInfo(), file.getFileInfo().getConf()); |
There was a problem hiding this comment.
Can this cause a deadlock? This part is inside regionColdDataSize.computeIfPresent block and the isHotData also runs regionColdDataSize.compute on the same ConcurrentMap.
There was a problem hiding this comment.
You mean, if another thread is calling isHotData? I don't think it would, which ever reaches the regionColdDataSize atomic methods first would own the lock and block the other, right?
There was a problem hiding this comment.
I checked this part with Claude and it gave this answer:
No — this is a single-thread deadlock. That's what makes it especially insidious.
One thread does this:
- Enters computeIfPresent(regionName, lambda) — acquires the internal bin lock
- Inside the lambda, calls isHotData(...)
- isHotData calls compute(regionName, lambda2) — tries to acquire the same bin lock
- The lock is non-reentrant, so the same thread blocks waiting on itself
It's not a classic two-thread deadlock — it's a single thread trying to re-acquire a lock it already holds, on a lock that doesn't support reentrancy. The thread hangs forever.
This will happen every time a compaction produces a new cold file in a region that already has cold data tracked. That's a normal steady-state scenario, not a rare race condition.
There was a problem hiding this comment.
I don't think this is true. I've added a UT that simulates the compaction resulting in new cold file (the scenario claude mentioned) and it doesn't dead lock.
There was a problem hiding this comment.
Thanks for taking a look and adding a unit test for it!
| // to evict it. | ||
| Set<BlockCacheKey> keySet = | ||
| getAllCacheKeysForFile(hFileInfo.getHFileContext().getHFileName(), 0, Long.MAX_VALUE); | ||
| int evictedBlocks = evictBlockSet(keySet); |
There was a problem hiding this comment.
The method name shouldCacheFile suggests this just a check but it actually evicts the blocks.
There was a problem hiding this comment.
Yeah, not ideal, but I still think it should be BucketCache responsibility to evict blocks from files that became classified as cold, upon changing time based priority configuration. Moving this to DataTieringManager would tighter couple it with BucketCache.
Sure. I've run the failing tests locally, and it all passed. I'm trying another round here. |
Co-authored-by: Peter Somogyi <psomogyi@cloudera.com>
Co-authored-by: Peter Somogyi <psomogyi@cloudera.com>
Change-Id: I2194da9f2d1e596ae76a0fa244a521a698ff13f9
…by the Time Based Priority logic (#8128) Signed-off-by: Peter Somogyi <psomogyi@apache.com> Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
No description provided.