New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jewel: osd: rados ls on pool with no access returns no error #16473

Merged
merged 5 commits into from Sep 18, 2017

Conversation

Projects
None yet
5 participants
@smithfarm
Contributor

smithfarm commented Jul 21, 2017

@smithfarm smithfarm self-assigned this Jul 21, 2017

@smithfarm smithfarm added this to the jewel milestone Jul 21, 2017

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Jul 21, 2017

Contributor

@badone Does this backport look sane?

Contributor

smithfarm commented Jul 21, 2017

@badone Does this backport look sane?

@smithfarm smithfarm changed the title from jewel: rados ls on pool with no access returns no error to jewel: osd: rados ls on pool with no access returns no error Jul 21, 2017

@badone

badone approved these changes Aug 4, 2017

@smithfarm sorry for the delay, looks good to me.

@smithfarm smithfarm added the needs-qa label Aug 4, 2017

@ceph ceph deleted a comment from casey69 Sep 7, 2017

osd: Reverse order of op_has_sufficient_caps and do_pg_op
Fixes: http://tracker.ceph.com/issues/19790

Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
(cherry picked from commit a921882)

Conflicts:
        qa/suites/rados/singleton-nomsgr/all/pool-access.yaml - drop mgr.x
            role, which is not needed in jewel and might even cause the test to
            fail
        src/osd/PrimaryLogPG.cc - this file doesn't exist in jewel, so apply
            the change manually to ReplicatedPG.cc (i.e., by moving the
            op->includes_pg_op() conditional below the
            !op_has_sufficient_caps(op) conditional)
@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 7, 2017

Contributor

Because I overlooked the mgr.x role, it stayed in, and that seems to have caused the new test to fail: http://pulpito.front.sepia.ceph.com/smithfarm-2017-09-07_07:08:05-rados-wip-jewel-backports-distro-basic-smithi/1604961/

Dropped the mgr.x and repushed.

Contributor

smithfarm commented Sep 7, 2017

Because I overlooked the mgr.x role, it stayed in, and that seems to have caused the new test to fail: http://pulpito.front.sepia.ceph.com/smithfarm-2017-09-07_07:08:05-rados-wip-jewel-backports-distro-basic-smithi/1604961/

Dropped the mgr.x and repushed.

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 8, 2017

Contributor

@badone @tchaikov Can you look at this one? It seems to be causing this reproducible test failure in the rados suite: http://qa-proxy.ceph.com/teuthology/smithfarm-2017-09-07_18:58:43-rados-wip-jewel-backports-distro-basic-smithi/1607448/teuthology.log

The "AttributeError: managers" initially had me confused, but now I see it has nothing to do with the ceph-mgr daemon. Somehow the ctx dictionary is not getting populated with a managers item, but I don't see how this is (not) happening.

I traced the code line in question to 96e7724 but this has been in the jewel codestream for a long time.

Contributor

smithfarm commented Sep 8, 2017

@badone @tchaikov Can you look at this one? It seems to be causing this reproducible test failure in the rados suite: http://qa-proxy.ceph.com/teuthology/smithfarm-2017-09-07_18:58:43-rados-wip-jewel-backports-distro-basic-smithi/1607448/teuthology.log

The "AttributeError: managers" initially had me confused, but now I see it has nothing to do with the ceph-mgr daemon. Somehow the ctx dictionary is not getting populated with a managers item, but I don't see how this is (not) happening.

I traced the code line in question to 96e7724 but this has been in the jewel codestream for a long time.

@tchaikov tchaikov self-requested a review Sep 9, 2017

@badone

This comment has been minimized.

Show comment
Hide comment
@badone

badone Sep 10, 2017

Contributor

@smithfarm is it http://tracker.ceph.com/issues/20043 ? Did we somehow end up on ext4?

Contributor

badone commented Sep 10, 2017

@smithfarm is it http://tracker.ceph.com/issues/20043 ? Did we somehow end up on ext4?

- [mon.a, osd.0, osd.1, client.0]
tasks:
- install:
- ceph:

This comment has been minimized.

@tchaikov

tchaikov Sep 11, 2017

Contributor

should be

- ceph:
    fs: xfs

otherwise we will be using whatever the filesystem / is using, instead creating new fs specified here using the available block devices on the test node.

@tchaikov

tchaikov Sep 11, 2017

Contributor

should be

- ceph:
    fs: xfs

otherwise we will be using whatever the filesystem / is using, instead creating new fs specified here using the available block devices on the test node.

This comment has been minimized.

@badone

badone Sep 11, 2017

Contributor

Actually, it appears this has historically been worked around with

 - ceph:
     conf:
       global:
         osd max object name len: 460
         osd max object namespace len: 64

per #15207 and the other tests in qa/suites/rados/singleton-nomsgr/all/

@badone

badone Sep 11, 2017

Contributor

Actually, it appears this has historically been worked around with

 - ceph:
     conf:
       global:
         osd max object name len: 460
         osd max object namespace len: 64

per #15207 and the other tests in qa/suites/rados/singleton-nomsgr/all/

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Sep 11, 2017

Contributor

i agreed with @badone, the root cause is that we are using ext4 for the underlying local fs for the filestore. that's why the osds failed to boot. and an exception was raised, the statements in the finally block were executed before the ctx.manager dict is set.

so we should fix the test, so xfs is specified, or optionally, set the ctx.managers before doing anything that could throw in the with block in task() in ceph.py.

Contributor

tchaikov commented Sep 11, 2017

i agreed with @badone, the root cause is that we are using ext4 for the underlying local fs for the filestore. that's why the osds failed to boot. and an exception was raised, the statements in the finally block were executed before the ctx.manager dict is set.

so we should fix the test, so xfs is specified, or optionally, set the ctx.managers before doing anything that could throw in the with block in task() in ceph.py.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Sep 11, 2017

Contributor

for the ceph.py fix, see #17625.

Contributor

tchaikov commented Sep 11, 2017

for the ceph.py fix, see #17625.

tchaikov and others added some commits Sep 11, 2017

tasks/ceph: construct CephManager earlier
Previously, if errors occurred during healthy(), then
the finally block would invoke osd_scrub_pgs, which relies
on CephManager being constructed, and it would die, hiding
the original exception.

Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit c444db1)

Conflicts:
	qa/tasks/ceph.py: the `tasks` directory was moved into
		`qa` after the cherry-picked change was merged.
		so apply the change manually to the ceph.py
		under `qa` directory.
(cherry picked from commit bc71dab)
qa/suites/rados/singleton-nomsgr/pool-access: behave on ext4
We may land on an ext4 root partition.

Fixes: http://tracker.ceph.com/issues/20043
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from 657453d)

Conflicts:
    applied the changes to pool-access.yaml instead of health-warnings.yaml
    to address a specific test failure in the jewel branch
@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 11, 2017

Contributor

@tchaikov @badone Fixes applied; thanks!

Contributor

smithfarm commented Sep 11, 2017

@tchaikov @badone Fixes applied; thanks!

liewegas and others added some commits May 24, 2017

qa/suites/rados/singleton-nomsgr: fix syntax
This parsed out as

  tasks:
  - install: null
  - ceph:
      conf:
        osd: osd max object name len = 400 osd max object namespace len = 64
  - workunit:
      clients:
        all:
        - rados/test_health_warnings.sh

which is clearly not correct.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from 85e2f3f)

Conflicts:
    applied changes to pool-access.yaml instead of health-warnings.yaml
tests: use XFS explicitly in singleton-nomsgr/pool-access.yaml
Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: Nathan Cutler <ncutler@suse.com>
@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 12, 2017

Contributor

jenkins test docs

Contributor

smithfarm commented Sep 12, 2017

jenkins test docs

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 12, 2017

Contributor

Test passes - will do a final rados run before merging.

Contributor

smithfarm commented Sep 12, 2017

Test passes - will do a final rados run before merging.

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 18, 2017

Contributor

@jdurgin @badone @tchaikov This PR was included in two rados runs: http://tracker.ceph.com/issues/20613#note-62

The only thing that worries me about these runs is rados/singleton-nomsgr/{all/11429.yaml rados.yaml} - was not seeing that one fail before.

Contributor

smithfarm commented Sep 18, 2017

@jdurgin @badone @tchaikov This PR was included in two rados runs: http://tracker.ceph.com/issues/20613#note-62

The only thing that worries me about these runs is rados/singleton-nomsgr/{all/11429.yaml rados.yaml} - was not seeing that one fail before.

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 18, 2017

Contributor

Also, @jdurgin, do we need to hold up 10.2.10 for this?

Contributor

smithfarm commented Sep 18, 2017

Also, @jdurgin, do we need to hold up 10.2.10 for this?

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Sep 18, 2017

Contributor

@smithfarm the failed test of "The failed job is rados/singleton-nomsgr/{all/11429.yaml rados.yaml} - different failure message from last time" is not relevant to this change.

  1. osd.2 sent boot to mon.a
  2. mon.a prepared a osdmap proposal p1 for it
  3. osd.1 sent boot to mon.a
  4. mon.a postpone the proposal because p1 is not accepted yet
  5. proposal p1 was accepted by mon.b and mon.c
  6. client sent "osd deep-scrub osd.1" to mon.a,
  7. mon.a returned EAGAIN, because mon.1 was still down at that moment.

we restarted all OSDs before wait_for_clean(), and the instructed osd.0 and osd.1 to perform deep-scrub. but wait_for_clean() only checks the health status of PGs not OSDs. so we need to fix the test instead.

Contributor

tchaikov commented Sep 18, 2017

@smithfarm the failed test of "The failed job is rados/singleton-nomsgr/{all/11429.yaml rados.yaml} - different failure message from last time" is not relevant to this change.

  1. osd.2 sent boot to mon.a
  2. mon.a prepared a osdmap proposal p1 for it
  3. osd.1 sent boot to mon.a
  4. mon.a postpone the proposal because p1 is not accepted yet
  5. proposal p1 was accepted by mon.b and mon.c
  6. client sent "osd deep-scrub osd.1" to mon.a,
  7. mon.a returned EAGAIN, because mon.1 was still down at that moment.

we restarted all OSDs before wait_for_clean(), and the instructed osd.0 and osd.1 to perform deep-scrub. but wait_for_clean() only checks the health status of PGs not OSDs. so we need to fix the test instead.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Sep 18, 2017

Contributor

@smithfarm a good reason to backport this PR ceph/ceph-qa-suite#1179 😬

Contributor

tchaikov commented Sep 18, 2017

@smithfarm a good reason to backport this PR ceph/ceph-qa-suite#1179 😬

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 18, 2017

Contributor

@tchaikov Thanks for the analysis. The same test failed in a different way on the previous run: http://pulpito.ceph.com/smithfarm-2017-09-12_17:11:54-rados-wip-jewel-backports-distro-basic-smithi/1623590/

It's odd. This is the first time this test has failed during 10.2.10 preparation, and it failed twice in a row.

Contributor

smithfarm commented Sep 18, 2017

@tchaikov Thanks for the analysis. The same test failed in a different way on the previous run: http://pulpito.ceph.com/smithfarm-2017-09-12_17:11:54-rados-wip-jewel-backports-distro-basic-smithi/1623590/

It's odd. This is the first time this test has failed during 10.2.10 preparation, and it failed twice in a row.

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 18, 2017

Contributor

@smithfarm a good reason to backport this PR ceph/ceph-qa-suite#1179 😬

I'm ready to do that as soon as @jdurgin approves.

Contributor

smithfarm commented Sep 18, 2017

@smithfarm a good reason to backport this PR ceph/ceph-qa-suite#1179 😬

I'm ready to do that as soon as @jdurgin approves.

@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Sep 18, 2017

Contributor

@smithfarm the test failed in a different way can be fixed by ceph/ceph-qa-suite#1165

Contributor

tchaikov commented Sep 18, 2017

@smithfarm the test failed in a different way can be fixed by ceph/ceph-qa-suite#1165

@jdurgin jdurgin merged commit d8aa3c5 into ceph:jewel Sep 18, 2017

4 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details

@smithfarm smithfarm deleted the smithfarm:wip-20723-jewel branch Sep 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment