New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: add check for index entry's existing when adding bucket stats during bucket reshard. #29062
Conversation
|
@zhangsw I think it would be helpful for the commit msg/description to provide a bit more of an explanation, if possible? |
|
@mattbenjamin I've updated the description |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
rgw.none statistics have been a common source of confusion amongst users, it seems. So I appreciate this PR to try to address the issue. So recently I looked to see what causes an rgw.none stat to appear. Here's what I came to understand:
So we have two options. One is not to even collect these stats, which your current code change does. The other is to keep collecting them but not reporting them. So I thought I'd throw this question out to @zhangsw, @mattbenjamin, @cbodley, and @theanalyst to see what you all think. Thanks everyone! |
|
@vumrao I would like to invite you to comment on this issue as well. See my previous comment. |
|
FWIW after thinking about it, I lean towards the solution in this PR -- not collecting the stats. |
Thanks, @ivancich. I really like the idea of not reporting it as you mentioned and we have seen downstream, it causes a lot of confusion. I think if we won't report it then it will take care of keeping the object count reporting consistent? Like we saw in downstream when
|
|
@vumrao So you're speculating that the rgw.none is hiding some real objects. And it is interesting that 24,999,855 + 148 = 25000003, just over the 25,000,000 expected. But given my earlier analysis, a bucket-index entry that counts as rgw.none happens when an operation is cancelled. So I think the more likely explanation is that 145 objects did not actually get fully uploaded. This PR cannot address that issue. Can you verify that the bucket actually has 25,000,000 GETable objects? |
|
We just had someone report on ceph-users that they have an rgw.none num_objects stat that appears to be 2^64 - 13, and they believe it triggered resharding to 65521 shards. I'm not certain, though, how they got -13 via a reshard operation. I'm guessing this happened due to bucket index manipulations that don't account for all ops. So this is a buggy area and I'm thinking we should removal all rgw.none calculations, during resharding and during bucket-index updates. |
Thanks, @ivancich. Sorry setup was removed but it looks like you are right these objects never made to the cluster. I will discuss with team and see if they will see this issue again in the small count will ask them to check the count from the application side(like s3cmd or swift) and see if we have correct count. If application is reporting the same what is listed in |
|
@ivancich thanks for the detailed analysis, I also agree that not accounting the stats in the none category (as proposed here) makes sense |
i think this flag is what reshard is failing to consider. rgw_bucket_complete_op() in cls_rgw.cc is only accounting for entries with exists=true, so reshard should ignore entries with exists=false to remain consistent and generate the same stats |
|
@ivancich There are some things to consider before removing this stat. In my experience, this is a frequent occurrence in our environment and can lead to serious performance issues when listing objects, so this information may be necessary to proactively handle the situation. |
|
@zhangsw Given @cbodley's insight and @IlsooByun's good argument for keeping the functionality, would you like to update this PR or submit a new one that keeps rgw.none but checks the exists flag? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment, but I think it'd be better to go in the direction @cbodley is suggesting. Please let us know your intentions -- whether to modify this PR or create a new one. Thanks!
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
…ring resharding. Signed-off-by: zhang Shaowen <zhangshaowen@cmss.chinamobile.com>
3d4dc57
to
ce42734
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
@ivancich what's the way forward for this change? |
I think QA and then very likely merge, @mattbenjamin. |
If we reshard versioning bucket, bucket stats will contain extra information
rgw.none.Before reshard:
After reshard:
Versioning bucket has idx like below and it's counted in stats while resharding bucket.
Fixes: https://tracker.ceph.com/issues/45970