Cache tiering: fixing the slow request, waiting for rw locks issue #2374

wonzhq · 2014-09-02T06:52:26Z

Fix issue 9285.

Signed-off-by: Zhiqiang Wang wonzhq@hotmail.com

athanatos · 2014-09-02T21:41:26Z

At the least, you need to clear the set on on_change. You also have to make sure to remove the object from the set if the corresponding request is discarded due to the client disconnecting.

However, I'm not really convinced that that is the problem. Really, the issue is that we can evict an object which has a queued request pending on it.

liewegas · 2014-09-02T23:29:47Z

Maybe we can look at whether there are pending requests on the obc in the agent_work() checks?

wonzhq · 2014-09-03T02:04:28Z

You are right. Checking the queued requests would be a better idea. But where do we hold the pending request? Are they ObjectContext::RWState::waiters and ObjectContext::RWState::count?

wonzhq · 2014-09-03T03:56:39Z

If checking ObjectContext::RWState::waiters and ObjectContext::RWState::count for the pending requests on this object, there is still a window which the problem can happen. That is after the promotion replication and requeuing the client request, and before dequeuing the client request. Should we loop the OSD::op_wq to check the pending requests on an object? Or adding something in the ObjectContext to remember the pending requests? @athanatos @liewegas

liewegas · 2014-09-08T21:39:50Z

Hmm, that's true that there is still that window. Is it necessary that this is completely air-tight, though? As long as we avoid evicting a newly-promoted object before the request is processed we will win. I'm afraid that a complicated mechanism to cover this could introduce more complexity than we need.

wonzhq · 2014-09-09T07:07:32Z

Tried to use ObjectContext::RWState::count to check the pending request. In my testing, it hit the slow request just once. I checked the log, it exactly falls into the window we talked above. So with this solution, it's possible that we still hit this issue, but much less than before. Should we go ahead with this solution?

wonzhq · 2014-09-16T08:41:30Z

Updated with the following changes:

Make it a map to record the time when it starts promotion. If an object stays in this map for more than 60s during eviction, we evict it anyway.
Addressed Sam's comments to clear the map on on_change, and remove the object from the map if the corresponding request is discarded.
Pls take a look, thx. @athanatos @liewegas

liewegas · 2014-09-21T23:12:08Z

src/osd/ReplicatedPG.cc

+                    m->get_object_locator().nspace);
+      if (promote_in_flight.count(oid)) {
+	promote_in_flight.erase(oid);
+      }


the .count(oid) guard around the no-op actually slows things down since we do the lookup twice (once for count(), once for erase()). just .erase(oid) is sufficient.

liewegas · 2014-09-21T23:13:03Z

This looks okay to me, modulo the useless count() calls. @athanatos ?

object Fix issue 9285. Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>

wonzhq · 2014-09-22T02:41:36Z

Updated according to Sage's comments.

athanatos · 2014-09-23T21:04:19Z

I think the best way to fix this is simpler: we should simply add the object to the hitset very early in the do_op call, before we put it on any wait lists or attempt to promote. The cache agent won't want to evict such an object anyway.

wonzhq · 2014-09-24T08:14:52Z

In the current code we already add the object to the hit set before putting it to any wait lists or attempt to promote. But this doesn't prevent the object from being evicted. Also, if the promotion takes longer time, since we have a fixed number of hit sets, the hit set which this object is in might be removed before the promotion is done. @athanatos

athanatos · 2014-09-29T21:31:56Z

No, we add the object to the hitset only after we have passed most of the wait lists. Admittedly, we do add it to the hitset before starting promotion. If you want to pursue the approach in this patch, you have to make sure that if the op causing the promotion is thrown out, we clear the promotion entry. I think that's going to be quite annoying, and will only help in the case where the cache is thrashing so badly that it won't perform well anyway (the case where we are trying to evict an object accessed so recently that the promotion hasn't even completed). I'd rather a less intrusive solution if possible.

wonzhq · 2014-10-09T06:19:17Z

Sorry, @athanatos , what are the wait lists you are talking about in the do_op call? Do you mean 'waiting_for_active', 'waiting_for_unreadable_object', 'waiting_for_degraded_object', 'waiting_for_blocked_object' etc. In the normal case, the object is not in any of these lists (It is not in the 'waiting_for_blocked_object' list after promotion). If moving adding the object to the hitset before passing these lists, I don't understand how it could prevent the object from evicting.
As for the approach in this patch, I've added the code to clear the map on on_change, and remove the object from the map if the corresponding request is discarded. Also, a timestamp is added per Sage's suggestion, so that the object won't stay in the 'promotion_in_flight' map for longer than 60s.

liewegas · 2014-12-06T01:29:07Z

Let's reevaluate this after the current promotion improvements are in place

wonzhq force-pushed the wip-9285 branch from 6a580d5 to e5e4373 Compare September 16, 2014 08:35

liewegas reviewed Sep 21, 2014
View reviewed changes

Cache tiering: fix the slow request issue because of wrongly evicting

9389e4d

object Fix issue 9285. Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>

wonzhq force-pushed the wip-9285 branch from e5e4373 to 9389e4d Compare September 22, 2014 02:38

ghost added the bug-fix label Oct 4, 2014

liewegas force-pushed the master branch from 232990c to 4e4702d Compare October 21, 2014 16:57

ghost added the core label Nov 5, 2014

liewegas closed this Dec 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache tiering: fixing the slow request, waiting for rw locks issue #2374

Cache tiering: fixing the slow request, waiting for rw locks issue #2374

wonzhq commented Sep 2, 2014

athanatos commented Sep 2, 2014

liewegas commented Sep 2, 2014

wonzhq commented Sep 3, 2014

wonzhq commented Sep 3, 2014

liewegas commented Sep 8, 2014

wonzhq commented Sep 9, 2014

wonzhq commented Sep 16, 2014

liewegas Sep 21, 2014

liewegas commented Sep 21, 2014

wonzhq commented Sep 22, 2014

athanatos commented Sep 23, 2014

wonzhq commented Sep 24, 2014

athanatos commented Sep 29, 2014

wonzhq commented Oct 9, 2014

liewegas commented Dec 6, 2014

Cache tiering: fixing the slow request, waiting for rw locks issue #2374

Cache tiering: fixing the slow request, waiting for rw locks issue #2374

Conversation

wonzhq commented Sep 2, 2014

athanatos commented Sep 2, 2014

liewegas commented Sep 2, 2014

wonzhq commented Sep 3, 2014

wonzhq commented Sep 3, 2014

liewegas commented Sep 8, 2014

wonzhq commented Sep 9, 2014

wonzhq commented Sep 16, 2014

liewegas Sep 21, 2014

Choose a reason for hiding this comment

liewegas commented Sep 21, 2014

wonzhq commented Sep 22, 2014

athanatos commented Sep 23, 2014

wonzhq commented Sep 24, 2014

athanatos commented Sep 29, 2014

wonzhq commented Oct 9, 2014

liewegas commented Dec 6, 2014