New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PGLog: skip ERROR entires in _merge_object_divergent_entries #16675

Merged
merged 1 commit into from Aug 4, 2017

Conversation

Projects
None yet
6 participants
@Jeegn-Chen
Contributor

Jeegn-Chen commented Jul 29, 2017

During consistency check, do not take the version of ERROR entries
as the valid prior version.

Fixes: http://tracker.ceph.com/issues/20843
Signed-off-by: Jeegn Chen jeegnchen@gmail.com

@Jeegn-Chen

This comment has been minimized.

Contributor

Jeegn-Chen commented Jul 29, 2017

@liewegas I think this change might be an enhancement of your commit 7879efd. Could you take a look?

@liewegas

This comment has been minimized.

Member

liewegas commented Aug 1, 2017

Thanks for tracking this down! Can you add a unit test case to TestPGLog.cc illustrate the problem and the fix? Moving the 'last =' line looks right, but I'm not sure I have the right scenario in mind.

@Jeegn-Chen

This comment has been minimized.

Contributor

Jeegn-Chen commented Aug 1, 2017

Thanks for the review, Sage.
I will add the unit test.

@Jeegn-Chen

This comment has been minimized.

Contributor

Jeegn-Chen commented Aug 1, 2017

PS:
Here is the entry sequence I've got with ceph-objectstore-tool. The issue happens on the MODIFY entry 8336'960, which had 2 ERROR entries 8336‘958 and 8336'959 appeared between itself and its prior 8336'957.
In original logic, "last" would point to 959 while 960 expected 957 as its prior_version and thus core dump came out.
This rare sequence came out after an accident network outage, which made OSDs suicide for osd_max_markdown_count.

            {
                "op": "modify  ",
                "object": "5:93e5b521:::notify.7:head",
                "version": "8336'957",
                "prior_version": "8321'952",
                "reqid": "unknown.0.0:0",
                "extra_reqids": [],
                "mtime": "2017-07-26 16:55:30.056210",
                "return_code": 0,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
            {
                "op": "error   ",
                "object": "5:93e5b521:::notify.7:head",
                "version": "8336'958",
                "prior_version": "0'0",
                "reqid": "client.904458.0:2501388",
                "extra_reqids": [],
                "mtime": "0.000000",
                "return_code": -107,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": true,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
            {
                "op": "error   ",
                "object": "5:93e5b521:::notify.7:head",
                "version": "8336'959",
                "prior_version": "0'0",
                "reqid": "client.904458.0:2501403",
                "extra_reqids": [],
                "mtime": "0.000000",
                "return_code": -107,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": true,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
            {
                "op": "modify  ",
                "object": "5:93e5b521:::notify.7:head",
                "version": "8336'960",<<<<<<<Core Dump on 960
                "prior_version": "8336'957", <<< prior_version is 957, while the previous one is 569 ERROR
                "reqid": "unknown.0.0:0",
                "extra_reqids": [],
                "mtime": "2017-07-26 16:55:32.585029",
                "return_code": 0,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            }
@Jeegn-Chen

This comment has been minimized.

Contributor

Jeegn-Chen commented Aug 2, 2017

@liewegas The unit test has been added.

@jdurgin jdurgin self-assigned this Aug 2, 2017

@Jeegn-Chen

This comment has been minimized.

Contributor

Jeegn-Chen commented Aug 3, 2017

The previous failure in make check seems unrelated. Trigger the check again.

osd/PGLog: skip ERROR entires in _merge_object_divergent_entries
During consistency check, do not take the version of ERROR entries
as the valid prior version of the following non-error entry.

Fixes: http://tracker.ceph.com/issues/20843
Signed-off-by: Jeegn Chen <jeegnchen@gmail.com>
@Liuchang0812

This comment has been minimized.

Contributor

Liuchang0812 commented Aug 4, 2017

jenkins test this please

@liewegas liewegas requested a review from jdurgin Aug 4, 2017

@jdurgin

jdurgin approved these changes Aug 4, 2017

nice work!

@jdurgin jdurgin merged commit a09e29e into ceph:master Aug 4, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

zealoussnow added a commit to zealoussnow/ceph that referenced this pull request Oct 23, 2017

OSD: fix bug 20843, see ceph#16675
Signed-off-by: Leo Zhang <nguzcf@gmail.com>

zealoussnow added a commit to zealoussnow/ceph that referenced this pull request Oct 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment