HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck by ivandika3 · Pull Request #9932 · apache/ozone

ivandika3 · 2026-03-16T09:57:54Z

What changes were proposed in this pull request?

We have encountered incidents caused by createFakeDirIfShould in getFileStatus since createFakeDirIfShould creates a RocksDB iterator and it might take a long time when the keyTable has a lot of tombstones. This causes OM to be stuck since writes on the same bucket will be held, which in turns held all the pending write transactions in OM Ratis applier.

Let's move createFakeDirIfShould outside of the lock to prevent this. There is some tradeoff in terms of consistency, but since createFakeDirIfShould should not be the normal case, we can contend with this.

This should only be relevant to LEGACY buckets since FSO bucket does not have this iterator logic and OBS bucket will not be accessed by FS client (getFileStatus is a FS Ops)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14841

How was this patch tested?

Existing CI. Clean CI: https://github.com/ivandika3/ozone/actions/runs/23134601164

…stuck

chungen0126 · 2026-03-16T14:42:03Z

Thanks @ivandika3 for working on this. However, I’m not entirely convinced that this is the right approach:

Performance Impact: I’m concerned we are sacrificing consistency for a minimal benefit. Is there any production data showing that this change actually optimizes the system?"
Consistency Trade-off: Moving createFakeDirIfShould out the lock might compromise the atomicity of the operation. If we proceed with this, we should at least consider making it a configurable option so users can choose between performance and strict consistency.

In my opinion, the risk of breaking consistency might outweigh the benefits here. Would love to hear your thoughts on this.

ivandika3 · 2026-03-17T01:26:43Z

Thanks @chungen0126 for the review.

Performance Impact: I’m concerned we are sacrificing consistency for a minimal benefit. Is there any production data showing that this change actually optimizes the system?"

We have two performance issues and large incidents because of this behavior

We have a large bucket which contains a lot of keys (hundreds of millions) and we schedule lifecycle configuration to delete most of the keys, this generates millions of tombstones that were not compacted by RocksDB in a timely manner
Issue 1: The OM trash emptier that runs periodically and calls this getFileStatus on all the buckets .Trash/ directory, it blocks when processing this large bucket for around 5 minutes, during this time the whole OM is stuck due to lock contention mentioned in the description
Issue 2: We had a recent issue where one user is trying to access .Trash/ although the trash directory has not been created yet (since no user of the bucket has deleted with hadoop trash). This causes bucket lock to be held of around few hours and RocksDB iterator metrics shows that there were hundreds of millions of keys skipped. We had to restart the OMs manually to restore service.

I understand that we have a periodic compaction that will compact the large keyTable, but this is run every few hours and by that time there might already be a large number of tombstones.

Consistency Trade-off: Moving createFakeDirIfShould out the lock might compromise the atomicity of the operation. If we proceed with this, we should at least consider making it a configurable option so users can choose between performance and strict consistency.

I understand the concern and since the fake dir logic is based on the limitation of the OBS/LEGACY flat namespace, we had to contend with it. By right RocksDB iterator should be used for range query (listKeys, listStatus), and not for point query (getFileStatus). Even for listKeys and listStatus, the iterator is not protected by BUCKET_LOCK (see HDDS-13596).

Anyway, this is a legacy behavior which should not be used for new buckets. The long term is to migrate to OBS and FSO buckets to prevent this issues.

Please let me know if you have any suggestions for this. I'm OK if the community decides not to go ahead with it.

chungen0126 · 2026-03-17T06:29:48Z

Thanks for the explanation.

We have a large bucket which contains a lot of keys (hundreds of millions) and we schedule lifecycle configuration to delete most of the keys, this generates millions of tombstones that were not compacted by RocksDB in a timely manner

Regarding this point, since the object deletion triggered by the lifecycle is a periodic and internal Ozone operation, I have an idea: Is it possible to automatically trigger a compaction if we detect that the number of deleted objects exceeds a certain threshold?

I understand the concern and since the fake dir logic is based on the limitation of the OBS/LEGACY flat namespace, we had to contend with it.

Another concern of mine is that this change might increase system instability. It could potentially lead to flaky tests, which would make the project much harder to maintain in the long run.

ivandika3 · 2026-03-18T01:29:15Z

Regarding this point, since the object deletion triggered by the lifecycle is a periodic and internal Ozone operation, I have an idea: Is it possible to automatically trigger a compaction if we detect that the number of deleted objects exceeds a certain threshold?

Yes, this is planned in the future, but RocksDB compaction takes time to compact out the tombstones and during the compaction there might still be tombstones.

Another concern of mine is that this change might increase system instability.

Which instability are you referring to? In our production experience, this is source of system instability since it is not acceptable for a single read to block the whole OM.

It could potentially lead to flaky tests, which would make the project much harder to maintain in the long run.

Fair, but currently the fake dir test in TestKeyManagerImpl does not seem to test concurrent access.

ivandika3 · 2026-03-18T01:30:25Z

Let's close this, can reopen this if community needs it.

HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM …

afb7d19

…stuck

ivandika3 marked this pull request as ready for review March 16, 2026 13:40

ivandika3 self-assigned this Mar 16, 2026

ivandika3 closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932

HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932
ivandika3 wants to merge 1 commit intoapache:masterfrom
ivandika3:HDDS-14841

ivandika3 commented Mar 16, 2026 •

edited

Loading

Uh oh!

chungen0126 commented Mar 16, 2026

Uh oh!

ivandika3 commented Mar 17, 2026 •

edited

Loading

Uh oh!

chungen0126 commented Mar 17, 2026

Uh oh!

ivandika3 commented Mar 18, 2026

Uh oh!

ivandika3 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivandika3 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

chungen0126 commented Mar 16, 2026

Uh oh!

ivandika3 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chungen0126 commented Mar 17, 2026

Uh oh!

ivandika3 commented Mar 18, 2026

Uh oh!

ivandika3 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivandika3 commented Mar 16, 2026 •

edited

Loading

ivandika3 commented Mar 17, 2026 •

edited

Loading