Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common/TrackedOp: fix inaccurate counting for slow requests #7690

Merged
merged 2 commits into from Mar 2, 2016

Conversation

Projects
None yet
3 participants
@xiexingguo
Copy link
Member

commented Feb 18, 2016

Also fix implicitly result code cast.

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Feb 18, 2016

@xiexingguo could you elaborated a little bit why the total # of slow request is not accurate? iiuc, we failed to collect the last op in the warning message. but it would better if you can put it in your commit message to help reviewers and other developers to understand your change.

@xiexingguo

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2016

@tchaikov
In the original design there are two counters in charge of collecting potentially problematic requests, namely 'slow' and 'warned'. Counter 'slow' is responsible for capturing all the requests which have already hit the "complain" limit and shall be marked as "slow" while counter 'warned' is responsible for countering those requests which have already hit the "warning interval" and thus shall be logged and then outputted.

The problem here is if 'warned' counter hits the log_threshold, we will quit the entire for loop but there may be residual shard_queues which may still containing slow requests. As a result, the 'slow' counter does not reflect the real number of total slow requests in all the shard_queues under this case. And no slow requests will be tracked especially when 'log_threshold' is set to zero.

The solution for the above problem is to keep counting 'slow' requests until we have finished traversing all the shard_queues, no matter whether we have gathered enough requests for log or not, and if so, we simply stop counter 'warned' and skip over logging process.

@xiexingguo xiexingguo force-pushed the xiexingguo:xxg-wip-14804 branch from 041ad71 to 7953169 Feb 18, 2016

@xiexingguo

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2016

@tchaikov Updated and thank you for your attention. Ping me if there is anything to be further clarified.

@xiexingguo xiexingguo force-pushed the xiexingguo:xxg-wip-14804 branch from 7953169 to cd80a66 Feb 19, 2016

xiexingguo added some commits Feb 18, 2016

common/TrackedOp: fix inaccurate counting for total slow requests
In the original design there are two counters in charge of collecting
potentially problematic requests, namely 'slow' and 'warned'.
Counter 'slow' is responsible for capturing all the requests which have
already hit the "complain" limit and shall be marked as "slow" while
counter 'warned' is responsible for countering those requests which
have already hit the "warning interval" and thus shall be logged and
then outputted.

The problem here is if 'warned' counter hits the log_threshold,
we will quit the entire for loop but there may be residual shard_queues
which may still containing slow requests. As a result, the 'slow' counter
does not reflect the real number of total slow requests in all the
shard_queues under this case. And no slow requests will be tracked
especially when 'log_threshold' is set to zero(Do we do this intentional?
Or else we shall never allow 'log_threshold' to be zero).

The solution for the above problem is to keep counting 'slow' requests
until we have finished traversing all the shard_queues, no matter
whether we have gathered enough requests for logging or not, and if so,
we simply stop counter 'warned' and skip over logging process.

Fixes: #14804
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
common/TrackedOp: avoid implicitly cast for return value
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

@xiexingguo xiexingguo force-pushed the xiexingguo:xxg-wip-14804 branch from cd80a66 to 52c04b8 Feb 20, 2016

@tchaikov tchaikov added the needs-qa label Feb 29, 2016

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Feb 29, 2016

lgtm. @xiexingguo thanks for the excellent explanation! it really helps.

@xiexingguo

This comment has been minimized.

Copy link
Member Author

commented Feb 29, 2016

My pleasure. And thank you for your careful review. I've learned so much.

liewegas added a commit that referenced this pull request Mar 2, 2016

Merge pull request #7690 from xiexingguo/xxg-wip-14804
common/TrackedOp: fix inaccurate counting for slow requests

Reviewed-by: Kefu Chai <kchai@redhat.com>

@liewegas liewegas merged commit c4f68ad into ceph:master Mar 2, 2016

1 of 2 checks passed

default Build finished. No test results found.
Details
Signed-off-by all commits in this PR are signed 1 tests run, 0 skipped, 0 failed.
Details

@xiexingguo xiexingguo deleted the xiexingguo:xxg-wip-14804 branch Mar 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.