New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache tiering: fixing the slow request, waiting for rw locks issue #2374
Conversation
At the least, you need to clear the set on on_change. You also have to make sure to remove the object from the set if the corresponding request is discarded due to the client disconnecting. However, I'm not really convinced that that is the problem. Really, the issue is that we can evict an object which has a queued request pending on it. |
Maybe we can look at whether there are pending requests on the obc in the agent_work() checks? |
You are right. Checking the queued requests would be a better idea. But where do we hold the pending request? Are they ObjectContext::RWState::waiters and ObjectContext::RWState::count? |
If checking ObjectContext::RWState::waiters and ObjectContext::RWState::count for the pending requests on this object, there is still a window which the problem can happen. That is after the promotion replication and requeuing the client request, and before dequeuing the client request. Should we loop the OSD::op_wq to check the pending requests on an object? Or adding something in the ObjectContext to remember the pending requests? @athanatos @liewegas |
Hmm, that's true that there is still that window. Is it necessary that this is completely air-tight, though? As long as we avoid evicting a newly-promoted object before the request is processed we will win. I'm afraid that a complicated mechanism to cover this could introduce more complexity than we need. |
Tried to use ObjectContext::RWState::count to check the pending request. In my testing, it hit the slow request just once. I checked the log, it exactly falls into the window we talked above. So with this solution, it's possible that we still hit this issue, but much less than before. Should we go ahead with this solution? |
Updated with the following changes:
|
m->get_object_locator().nspace); | ||
if (promote_in_flight.count(oid)) { | ||
promote_in_flight.erase(oid); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the .count(oid) guard around the no-op actually slows things down since we do the lookup twice (once for count(), once for erase()). just .erase(oid) is sufficient.
This looks okay to me, modulo the useless count() calls. @athanatos ? |
object Fix issue 9285. Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>
Updated according to Sage's comments. |
I think the best way to fix this is simpler: we should simply add the object to the hitset very early in the do_op call, before we put it on any wait lists or attempt to promote. The cache agent won't want to evict such an object anyway. |
In the current code we already add the object to the hit set before putting it to any wait lists or attempt to promote. But this doesn't prevent the object from being evicted. Also, if the promotion takes longer time, since we have a fixed number of hit sets, the hit set which this object is in might be removed before the promotion is done. @athanatos |
No, we add the object to the hitset only after we have passed most of the wait lists. Admittedly, we do add it to the hitset before starting promotion. If you want to pursue the approach in this patch, you have to make sure that if the op causing the promotion is thrown out, we clear the promotion entry. I think that's going to be quite annoying, and will only help in the case where the cache is thrashing so badly that it won't perform well anyway (the case where we are trying to evict an object accessed so recently that the promotion hasn't even completed). I'd rather a less intrusive solution if possible. |
Sorry, @athanatos , what are the wait lists you are talking about in the do_op call? Do you mean 'waiting_for_active', 'waiting_for_unreadable_object', 'waiting_for_degraded_object', 'waiting_for_blocked_object' etc. In the normal case, the object is not in any of these lists (It is not in the 'waiting_for_blocked_object' list after promotion). If moving adding the object to the hitset before passing these lists, I don't understand how it could prevent the object from evicting. |
Let's reevaluate this after the current promotion improvements are in place |
Fix issue 9285.
Signed-off-by: Zhiqiang Wang wonzhq@hotmail.com