Skip to content

HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932

Closed
ivandika3 wants to merge 1 commit intoapache:masterfrom
ivandika3:HDDS-14841
Closed

HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932
ivandika3 wants to merge 1 commit intoapache:masterfrom
ivandika3:HDDS-14841

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Mar 16, 2026

What changes were proposed in this pull request?

We have encountered incidents caused by createFakeDirIfShould in getFileStatus since createFakeDirIfShould creates a RocksDB iterator and it might take a long time when the keyTable has a lot of tombstones. This causes OM to be stuck since writes on the same bucket will be held, which in turns held all the pending write transactions in OM Ratis applier.

Let's move createFakeDirIfShould outside of the lock to prevent this. There is some tradeoff in terms of consistency, but since createFakeDirIfShould should not be the normal case, we can contend with this.

This should only be relevant to LEGACY buckets since FSO bucket does not have this iterator logic and OBS bucket will not be accessed by FS client (getFileStatus is a FS Ops)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14841

How was this patch tested?

Existing CI. Clean CI: https://github.com/ivandika3/ozone/actions/runs/23134601164

@ivandika3 ivandika3 marked this pull request as ready for review March 16, 2026 13:40
@ivandika3 ivandika3 self-assigned this Mar 16, 2026
@chungen0126
Copy link
Contributor

Thanks @ivandika3 for working on this. However, I’m not entirely convinced that this is the right approach:

  • Performance Impact: I’m concerned we are sacrificing consistency for a minimal benefit. Is there any production data showing that this change actually optimizes the system?"
  • Consistency Trade-off: Moving createFakeDirIfShould out the lock might compromise the atomicity of the operation. If we proceed with this, we should at least consider making it a configurable option so users can choose between performance and strict consistency.

In my opinion, the risk of breaking consistency might outweigh the benefits here. Would love to hear your thoughts on this.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Mar 17, 2026

Thanks @chungen0126 for the review.

Performance Impact: I’m concerned we are sacrificing consistency for a minimal benefit. Is there any production data showing that this change actually optimizes the system?"

We have two performance issues and large incidents because of this behavior

  • We have a large bucket which contains a lot of keys (hundreds of millions) and we schedule lifecycle configuration to delete most of the keys, this generates millions of tombstones that were not compacted by RocksDB in a timely manner
  • Issue 1: The OM trash emptier that runs periodically and calls this getFileStatus on all the buckets .Trash/ directory, it blocks when processing this large bucket for around 5 minutes, during this time the whole OM is stuck due to lock contention mentioned in the description
  • Issue 2: We had a recent issue where one user is trying to access .Trash/ although the trash directory has not been created yet (since no user of the bucket has deleted with hadoop trash). This causes bucket lock to be held of around few hours and RocksDB iterator metrics shows that there were hundreds of millions of keys skipped. We had to restart the OMs manually to restore service.

I understand that we have a periodic compaction that will compact the large keyTable, but this is run every few hours and by that time there might already be a large number of tombstones.

Consistency Trade-off: Moving createFakeDirIfShould out the lock might compromise the atomicity of the operation. If we proceed with this, we should at least consider making it a configurable option so users can choose between performance and strict consistency.

I understand the concern and since the fake dir logic is based on the limitation of the OBS/LEGACY flat namespace, we had to contend with it. By right RocksDB iterator should be used for range query (listKeys, listStatus), and not for point query (getFileStatus). Even for listKeys and listStatus, the iterator is not protected by BUCKET_LOCK (see HDDS-13596).

Anyway, this is a legacy behavior which should not be used for new buckets. The long term is to migrate to OBS and FSO buckets to prevent this issues.

Please let me know if you have any suggestions for this. I'm OK if the community decides not to go ahead with it.

@chungen0126
Copy link
Contributor

Thanks for the explanation.

We have a large bucket which contains a lot of keys (hundreds of millions) and we schedule lifecycle configuration to delete most of the keys, this generates millions of tombstones that were not compacted by RocksDB in a timely manner

Regarding this point, since the object deletion triggered by the lifecycle is a periodic and internal Ozone operation, I have an idea: Is it possible to automatically trigger a compaction if we detect that the number of deleted objects exceeds a certain threshold?

I understand the concern and since the fake dir logic is based on the limitation of the OBS/LEGACY flat namespace, we had to contend with it.

Another concern of mine is that this change might increase system instability. It could potentially lead to flaky tests, which would make the project much harder to maintain in the long run.

@ivandika3
Copy link
Contributor Author

Regarding this point, since the object deletion triggered by the lifecycle is a periodic and internal Ozone operation, I have an idea: Is it possible to automatically trigger a compaction if we detect that the number of deleted objects exceeds a certain threshold?

Yes, this is planned in the future, but RocksDB compaction takes time to compact out the tombstones and during the compaction there might still be tombstones.

Another concern of mine is that this change might increase system instability.

Which instability are you referring to? In our production experience, this is source of system instability since it is not acceptable for a single read to block the whole OM.

It could potentially lead to flaky tests, which would make the project much harder to maintain in the long run.

Fair, but currently the fake dir test in TestKeyManagerImpl does not seem to test concurrent access.

@ivandika3
Copy link
Contributor Author

Let's close this, can reopen this if community needs it.

@ivandika3 ivandika3 closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants