Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suites: qa tasks with crush rules #53308

Merged
merged 4 commits into from Mar 20, 2024

Conversation

NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Sep 6, 2023

To handle EC profiles with crush rules of 2+2 and 8+6 on few hosts:

  • Adding tasks that will create EC pools with customized crush rules.
  • Adding 2 new suites to handle 8+6 and 2+2 EC crush rules on 4 nodes cluster
  • Adding option to thrasher to also thrash host - which means to mark all osds in one host as out and simulate real maintenance process of hosts.

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

'set_choose_tries 100',
'take default class hdd',
'choose indep 4 type host',
'chooseleaf indep 8 type osd',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use choose rather than chooseleaf here due to https://tracker.ceph.com/issues/62213

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this one needs 4 hosts, 8 osds per host. 2+2 should probably be a normal host chooseleaf rule, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it supposed to be 4 host 1 osd? so it will simulate full host shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so step chooseleaf indep 4 type host

Copy link
Contributor

@athanatos athanatos Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah,

        'choose indep 4 type host',
        'choose indep 1 type osd',

is almost correct, but you should use step chooseleaf indep 4 type host instead. Using two choose steps instead of a single chooseleaf step actually has different behavior once OSDs get marked out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

qa/tasks/util/rados.py Outdated Show resolved Hide resolved
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch from c90f6e5 to f78ba83 Compare September 7, 2023 05:52
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 9 times, most recently from 364859e to 31ba9f3 Compare September 11, 2023 11:52
@NitzanMordhai
Copy link
Contributor Author

@athanatos @neha-ojha I added some code to thrash hosts, so the thrasher will be able to thrash the entire osds under one host

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 2 times, most recently from 62ac56d to 65c864f Compare September 12, 2023 06:51
@ljflores
Copy link
Contributor

ljflores commented Oct 2, 2023

Hey @NitzanMordhai, QA caught this failure:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
  File "/home/teuthworker/src/github.com_ceph_ceph-c_1121b624dd34d5cca4b579b47cfabf66fe39eae8/qa/tasks/rados.py", line 260, in thread
    erasure_code_crush_rule_name=crush_name,
UnboundLocalError: local variable 'crush_name' referenced before assignment

You can see more examples on this link:
https://pulpito.ceph.com/yuriw-2023-09-27_20:55:59-rados-wip-yuri5-testing-2023-09-27-0959-distro-default-smithi/

You can re-add the "needs-qa" label when it's ready for a retest!

@ljflores
Copy link
Contributor

jenkins test make check

@ljflores
Copy link
Contributor

@NitzanMordhai there are a few suspect jobs in the latest teuthology run as analyzed by @ronen-fr :

  1. https://pulpito.ceph.com/yuriw-2023-11-26_21:30:23-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/7467376/
    Description: rados/thrash-erasure-code/{ceph clusters/{fixed-4 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-recovery} supported-random-distro$/{centos_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-rados-plugin=clay-k=4-m=2}
2023-11-26T21:59:03.401 INFO:tasks.ceph.osd.6.smithi120.stderr:2023-11-26T21:59:03.399+0000 7fadb4fb9640 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 6  (PID: 85946) UID: 0
2023-11-26T21:59:03.417 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 4489216 to 5046272 tid 1
2023-11-26T21:59:03.419 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 5046272 to 5537792 tid 2
2023-11-26T21:59:03.420 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 5537792 to 5668864 tid 3
2023-11-26T21:59:03.420 INFO:tasks.rados.rados.0.smithi183.stdout:1518: snap_create
2023-11-26T21:59:03.421 INFO:tasks.rados.rados.0.smithi183.stdout:update_object_version oid 5 v 159 (ObjNum 49 snap 0 seq_num 1953066355) dirty exists
2023-11-26T21:59:03.421 INFO:tasks.rados.rados.0.smithi183.stdout:1510:  expect (ObjNum 459 snap 123 seq_num 459)
2023-11-26T21:59:03.425 INFO:teuthology.orchestra.run.smithi026.stdout:ERROR: (22) Invalid argument
2023-11-26T21:59:03.425 INFO:teuthology.orchestra.run.smithi026.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2023-11-26T21:59:03.435 DEBUG:teuthology.orchestra.run.smithi026:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.12.asok dump_historic_ops
2023-11-26T21:59:03.496 INFO:teuthology.orchestra.run.smithi026.stdout:osd.0: {}
2023-11-26T21:59:03.496 INFO:teuthology.orchestra.run.smithi026.stderr:osd.0: osd_enable_op_tracker = ''
2023-11-26T21:59:03.501 ERROR:teuthology.orchestra.daemon.state:Failed to send signal 1: None
Traceback (most recent call last):
  1. https://pulpito.ceph.com/yuriw-2023-11-20_15:34:30-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/
    I thought this one was an infra failure, but it looks more like a problem with how the yaml file is structured.
2023-11-21T11:02:13.323 INFO:teuthology.task.internal:Checking for old test directory...
2023-11-21T11:02:13.324 DEBUG:teuthology.orchestra.run.smithi046:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.326 DEBUG:teuthology.orchestra.run.smithi110:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.329 DEBUG:teuthology.orchestra.run.smithi132:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.331 DEBUG:teuthology.orchestra.run.smithi138:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.334 DEBUG:teuthology.orchestra.run.smithi142:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.336 DEBUG:teuthology.orchestra.run.smithi161:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.339 DEBUG:teuthology.orchestra.run.smithi179:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.341 DEBUG:teuthology.orchestra.run.smithi190:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.346 INFO:teuthology.run_tasks:Running task internal.check_ceph_data...
2023-11-21T11:02:13.352 INFO:teuthology.task.internal:Checking for non-empty /var/lib/ceph...
2023-11-21T11:02:13.353 DEBUG:teuthology.orchestra.run.smithi046:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.375 DEBUG:teuthology.orchestra.run.smithi110:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.377 DEBUG:teuthology.orchestra.run.smithi132:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.379 DEBUG:teuthology.orchestra.run.smithi138:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.381 DEBUG:teuthology.orchestra.run.smithi142:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.384 DEBUG:teuthology.orchestra.run.smithi161:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.388 DEBUG:teuthology.orchestra.run.smithi179:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.390 DEBUG:teuthology.orchestra.run.smithi190:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.400 INFO:teuthology.run_tasks:Running task internal.vm_setup...
2023-11-21T11:02:13.587 INFO:teuthology.run_tasks:Running task kernel...
2023-11-21T11:02:13.601 INFO:teuthology.task.kernel:normalize config orig: {'kdb': True, 'sha1': 'distro'}
2023-11-21T11:02:13.601 INFO:teuthology.task.kernel:config {'mon.a': {'kdb': True, 'sha1': 'distro'}, 'mon.b': {'kdb': True, 'sha1': 'distro'}, 'mon.c': {'kdb': True, 'sha1': 'distro'}, 'mgr.x': {'kdb': True, 'sha1': 'distro'}}, timeout 300
2023-11-21T11:02:13.601 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/run_tasks.py", line 109, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/task/kernel.py", line 1237, in task
    p.spawn(process_role, ctx, config, timeout, role, role_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/task/kernel.py", line 1250, in process_role
    (role_remote,) = ctx.cluster.only(role).remotes.keys()
ValueError: too many values to unpack (expected 1)

And two more with slightly different Tracebacks but similar problems:

  1. https://pulpito.ceph.com/yuriw-2023-11-20_15:34:30-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/7463289/
  2. https://pulpito.ceph.com/yuriw-2023-11-20_15:34:30-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/7463384/

These failures don't happen deterministically, so it might be worth running several of these "workloads/ec-rados-plugin=xxx} " tests multiple times to ensure they're passing.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 3 times, most recently from 4f274e1 to d536f03 Compare February 5, 2024 13:57
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 3 times, most recently from 646c2cd to 2262a0e Compare February 15, 2024 07:51
@NitzanMordhai
Copy link
Contributor Author

I made some more changes to the thrasher, we had some issues with 4 hosts thrashing.

Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Adding new yaml entry to handle create of crush profile before
creating new pool, will be skipped if no crush profile name
was set.

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Extra un needed checks and sets for filestore in suites setup need to
be removed.

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
To simulate real world maintanace, we will usualy shut down hosts and not just osd
the following commit will add host thrasher option to Thraser, when thrash_hosts
is True we won't thrash osds one by one, we will choose entire host and thrash all
the osds under that host.

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch from 2262a0e to 190a761 Compare February 28, 2024 07:26
@yuriw yuriw merged commit 98a7421 into ceph:main Mar 20, 2024
10 of 11 checks passed
@ljflores
Copy link
Contributor

Looks like this one was merged a little early before a second round of QA could be reviewed.

@NitzanMordhai can you take a look at https://tracker.ceph.com/issues/65517?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants