HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932
HDDS-14841. Run createFakeDirIfShould outside the lock to prevent OM stuck#9932ivandika3 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thanks @ivandika3 for working on this. However, I’m not entirely convinced that this is the right approach:
In my opinion, the risk of breaking consistency might outweigh the benefits here. Would love to hear your thoughts on this. |
|
Thanks @chungen0126 for the review.
We have two performance issues and large incidents because of this behavior
I understand that we have a periodic compaction that will compact the large keyTable, but this is run every few hours and by that time there might already be a large number of tombstones.
I understand the concern and since the fake dir logic is based on the limitation of the OBS/LEGACY flat namespace, we had to contend with it. By right RocksDB iterator should be used for range query (listKeys, listStatus), and not for point query (getFileStatus). Even for listKeys and listStatus, the iterator is not protected by BUCKET_LOCK (see HDDS-13596). Anyway, this is a legacy behavior which should not be used for new buckets. The long term is to migrate to OBS and FSO buckets to prevent this issues. Please let me know if you have any suggestions for this. I'm OK if the community decides not to go ahead with it. |
|
Thanks for the explanation.
Regarding this point, since the object deletion triggered by the lifecycle is a periodic and internal Ozone operation, I have an idea: Is it possible to automatically trigger a compaction if we detect that the number of deleted objects exceeds a certain threshold?
Another concern of mine is that this change might increase system instability. It could potentially lead to flaky tests, which would make the project much harder to maintain in the long run. |
Yes, this is planned in the future, but RocksDB compaction takes time to compact out the tombstones and during the compaction there might still be tombstones.
Which instability are you referring to? In our production experience, this is source of system instability since it is not acceptable for a single read to block the whole OM.
Fair, but currently the fake dir test in |
|
Let's close this, can reopen this if community needs it. |
What changes were proposed in this pull request?
We have encountered incidents caused by createFakeDirIfShould in getFileStatus since createFakeDirIfShould creates a RocksDB iterator and it might take a long time when the keyTable has a lot of tombstones. This causes OM to be stuck since writes on the same bucket will be held, which in turns held all the pending write transactions in OM Ratis applier.
Let's move createFakeDirIfShould outside of the lock to prevent this. There is some tradeoff in terms of consistency, but since createFakeDirIfShould should not be the normal case, we can contend with this.
This should only be relevant to LEGACY buckets since FSO bucket does not have this iterator logic and OBS bucket will not be accessed by FS client (
getFileStatusis a FS Ops)What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14841
How was this patch tested?
Existing CI. Clean CI: https://github.com/ivandika3/ozone/actions/runs/23134601164