Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache tiering: fixing the slow request, waiting for rw locks issue #2374

Closed
wants to merge 1 commit into from

Conversation

wonzhq
Copy link
Contributor

@wonzhq wonzhq commented Sep 2, 2014

Fix issue 9285.

Signed-off-by: Zhiqiang Wang wonzhq@hotmail.com

@athanatos
Copy link
Contributor

At the least, you need to clear the set on on_change. You also have to make sure to remove the object from the set if the corresponding request is discarded due to the client disconnecting.

However, I'm not really convinced that that is the problem. Really, the issue is that we can evict an object which has a queued request pending on it.

@liewegas
Copy link
Member

liewegas commented Sep 2, 2014

Maybe we can look at whether there are pending requests on the obc in the agent_work() checks?

@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 3, 2014

You are right. Checking the queued requests would be a better idea. But where do we hold the pending request? Are they ObjectContext::RWState::waiters and ObjectContext::RWState::count?

@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 3, 2014

If checking ObjectContext::RWState::waiters and ObjectContext::RWState::count for the pending requests on this object, there is still a window which the problem can happen. That is after the promotion replication and requeuing the client request, and before dequeuing the client request. Should we loop the OSD::op_wq to check the pending requests on an object? Or adding something in the ObjectContext to remember the pending requests? @athanatos @liewegas

@liewegas
Copy link
Member

liewegas commented Sep 8, 2014

Hmm, that's true that there is still that window. Is it necessary that this is completely air-tight, though? As long as we avoid evicting a newly-promoted object before the request is processed we will win. I'm afraid that a complicated mechanism to cover this could introduce more complexity than we need.

@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 9, 2014

Tried to use ObjectContext::RWState::count to check the pending request. In my testing, it hit the slow request just once. I checked the log, it exactly falls into the window we talked above. So with this solution, it's possible that we still hit this issue, but much less than before. Should we go ahead with this solution?

@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 16, 2014

Updated with the following changes:

  1. Make it a map to record the time when it starts promotion. If an object stays in this map for more than 60s during eviction, we evict it anyway.
  2. Addressed Sam's comments to clear the map on on_change, and remove the object from the map if the corresponding request is discarded.
    Pls take a look, thx. @athanatos @liewegas

m->get_object_locator().nspace);
if (promote_in_flight.count(oid)) {
promote_in_flight.erase(oid);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the .count(oid) guard around the no-op actually slows things down since we do the lookup twice (once for count(), once for erase()). just .erase(oid) is sufficient.

@liewegas
Copy link
Member

This looks okay to me, modulo the useless count() calls. @athanatos ?

object

Fix issue 9285.

Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>
@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 22, 2014

Updated according to Sage's comments.

@athanatos
Copy link
Contributor

I think the best way to fix this is simpler: we should simply add the object to the hitset very early in the do_op call, before we put it on any wait lists or attempt to promote. The cache agent won't want to evict such an object anyway.

@wonzhq
Copy link
Contributor Author

wonzhq commented Sep 24, 2014

In the current code we already add the object to the hit set before putting it to any wait lists or attempt to promote. But this doesn't prevent the object from being evicted. Also, if the promotion takes longer time, since we have a fixed number of hit sets, the hit set which this object is in might be removed before the promotion is done. @athanatos

@athanatos
Copy link
Contributor

No, we add the object to the hitset only after we have passed most of the wait lists. Admittedly, we do add it to the hitset before starting promotion. If you want to pursue the approach in this patch, you have to make sure that if the op causing the promotion is thrown out, we clear the promotion entry. I think that's going to be quite annoying, and will only help in the case where the cache is thrashing so badly that it won't perform well anyway (the case where we are trying to evict an object accessed so recently that the promotion hasn't even completed). I'd rather a less intrusive solution if possible.

@ghost ghost added the bug-fix label Oct 4, 2014
@wonzhq
Copy link
Contributor Author

wonzhq commented Oct 9, 2014

Sorry, @athanatos , what are the wait lists you are talking about in the do_op call? Do you mean 'waiting_for_active', 'waiting_for_unreadable_object', 'waiting_for_degraded_object', 'waiting_for_blocked_object' etc. In the normal case, the object is not in any of these lists (It is not in the 'waiting_for_blocked_object' list after promotion). If moving adding the object to the hitset before passing these lists, I don't understand how it could prevent the object from evicting.
As for the approach in this patch, I've added the code to clear the map on on_change, and remove the object from the map if the corresponding request is discarded. Also, a timestamp is added per Sage's suggestion, so that the object won't stay in the 'promotion_in_flight' map for longer than 60s.

@liewegas
Copy link
Member

liewegas commented Dec 6, 2014

Let's reevaluate this after the current promotion improvements are in place

@liewegas liewegas closed this Dec 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants