suites: qa tasks with crush rules #53308

NitzanMordhai · 2023-09-06T07:23:15Z

To handle EC profiles with crush rules of 2+2 and 8+6 on few hosts:

Adding tasks that will create EC pools with customized crush rules.
Adding 2 new suites to handle 8+6 and 2+2 EC crush rules on 4 nodes cluster
Adding option to thrasher to also thrash host - which means to mark all osds in one host as out and simulate real maintenance process of hosts.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

athanatos · 2023-09-06T17:41:47Z

qa/erasure-code/ec-rados-plugin=jerasure-k=2-m=2-crush.yaml

+        'set_choose_tries 100',
+        'take default class hdd',
+        'choose indep 4 type host',
+        'chooseleaf indep 8 type osd',


Use choose rather than chooseleaf here due to https://tracker.ceph.com/issues/62213

I'm not sure why this one needs 4 hosts, 8 osds per host. 2+2 should probably be a normal host chooseleaf rule, right?

isn't it supposed to be 4 host 1 osd? so it will simulate full host shutdown?

Right, so step chooseleaf indep 4 type host

Ah,

'choose indep 4 type host', 'choose indep 1 type osd',

is almost correct, but you should use step chooseleaf indep 4 type host instead. Using two choose steps instead of a single chooseleaf step actually has different behavior once OSDs get marked out.

qa/erasure-code/ec-rados-plugin=jerasure-k=8-m=6-crush.yaml

qa/tasks/util/rados.py

NitzanMordhai · 2023-09-11T11:53:34Z

@athanatos @neha-ojha I added some code to thrash hosts, so the thrasher will be able to thrash the entire osds under one host

ljflores · 2023-10-02T16:43:34Z

Hey @NitzanMordhai, QA caught this failure:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
  File "/home/teuthworker/src/github.com_ceph_ceph-c_1121b624dd34d5cca4b579b47cfabf66fe39eae8/qa/tasks/rados.py", line 260, in thread
    erasure_code_crush_rule_name=crush_name,
UnboundLocalError: local variable 'crush_name' referenced before assignment

You can see more examples on this link:
https://pulpito.ceph.com/yuriw-2023-09-27_20:55:59-rados-wip-yuri5-testing-2023-09-27-0959-distro-default-smithi/

You can re-add the "needs-qa" label when it's ready for a retest!

ljflores · 2023-10-03T19:52:51Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcPuCOnhYL1841-wip-yuri5-testing-2023-10-02-1105-old-wip-yuri5-testing-2023-09-27-0959

ljflores · 2023-10-27T15:04:34Z

jenkins test make check

ljflores · 2023-12-19T18:59:11Z

@NitzanMordhai there are a few suspect jobs in the latest teuthology run as analyzed by @ronen-fr :

https://pulpito.ceph.com/yuriw-2023-11-26_21:30:23-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/7467376/
Description: rados/thrash-erasure-code/{ceph clusters/{fixed-4 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-recovery} supported-random-distro$/{centos_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-rados-plugin=clay-k=4-m=2}

2023-11-26T21:59:03.401 INFO:tasks.ceph.osd.6.smithi120.stderr:2023-11-26T21:59:03.399+0000 7fadb4fb9640 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 6  (PID: 85946) UID: 0
2023-11-26T21:59:03.417 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 4489216 to 5046272 tid 1
2023-11-26T21:59:03.419 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 5046272 to 5537792 tid 2
2023-11-26T21:59:03.420 INFO:tasks.rados.rados.0.smithi183.stdout:1517:  writing smithi18389814-19 from 5537792 to 5668864 tid 3
2023-11-26T21:59:03.420 INFO:tasks.rados.rados.0.smithi183.stdout:1518: snap_create
2023-11-26T21:59:03.421 INFO:tasks.rados.rados.0.smithi183.stdout:update_object_version oid 5 v 159 (ObjNum 49 snap 0 seq_num 1953066355) dirty exists
2023-11-26T21:59:03.421 INFO:tasks.rados.rados.0.smithi183.stdout:1510:  expect (ObjNum 459 snap 123 seq_num 459)
2023-11-26T21:59:03.425 INFO:teuthology.orchestra.run.smithi026.stdout:ERROR: (22) Invalid argument
2023-11-26T21:59:03.425 INFO:teuthology.orchestra.run.smithi026.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2023-11-26T21:59:03.435 DEBUG:teuthology.orchestra.run.smithi026:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.12.asok dump_historic_ops
2023-11-26T21:59:03.496 INFO:teuthology.orchestra.run.smithi026.stdout:osd.0: {}
2023-11-26T21:59:03.496 INFO:teuthology.orchestra.run.smithi026.stderr:osd.0: osd_enable_op_tracker = ''
2023-11-26T21:59:03.501 ERROR:teuthology.orchestra.daemon.state:Failed to send signal 1: None
Traceback (most recent call last):

https://pulpito.ceph.com/yuriw-2023-11-20_15:34:30-rados-wip-yuri7-testing-2023-11-17-0819-distro-default-smithi/
I thought this one was an infra failure, but it looks more like a problem with how the yaml file is structured.

2023-11-21T11:02:13.323 INFO:teuthology.task.internal:Checking for old test directory...
2023-11-21T11:02:13.324 DEBUG:teuthology.orchestra.run.smithi046:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.326 DEBUG:teuthology.orchestra.run.smithi110:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.329 DEBUG:teuthology.orchestra.run.smithi132:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.331 DEBUG:teuthology.orchestra.run.smithi138:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.334 DEBUG:teuthology.orchestra.run.smithi142:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.336 DEBUG:teuthology.orchestra.run.smithi161:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.339 DEBUG:teuthology.orchestra.run.smithi179:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.341 DEBUG:teuthology.orchestra.run.smithi190:> test '!' -e /home/ubuntu/cephtest
2023-11-21T11:02:13.346 INFO:teuthology.run_tasks:Running task internal.check_ceph_data...
2023-11-21T11:02:13.352 INFO:teuthology.task.internal:Checking for non-empty /var/lib/ceph...
2023-11-21T11:02:13.353 DEBUG:teuthology.orchestra.run.smithi046:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.375 DEBUG:teuthology.orchestra.run.smithi110:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.377 DEBUG:teuthology.orchestra.run.smithi132:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.379 DEBUG:teuthology.orchestra.run.smithi138:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.381 DEBUG:teuthology.orchestra.run.smithi142:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.384 DEBUG:teuthology.orchestra.run.smithi161:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.388 DEBUG:teuthology.orchestra.run.smithi179:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.390 DEBUG:teuthology.orchestra.run.smithi190:> test -z $(ls -A /var/lib/ceph)
2023-11-21T11:02:13.400 INFO:teuthology.run_tasks:Running task internal.vm_setup...
2023-11-21T11:02:13.587 INFO:teuthology.run_tasks:Running task kernel...
2023-11-21T11:02:13.601 INFO:teuthology.task.kernel:normalize config orig: {'kdb': True, 'sha1': 'distro'}
2023-11-21T11:02:13.601 INFO:teuthology.task.kernel:config {'mon.a': {'kdb': True, 'sha1': 'distro'}, 'mon.b': {'kdb': True, 'sha1': 'distro'}, 'mon.c': {'kdb': True, 'sha1': 'distro'}, 'mgr.x': {'kdb': True, 'sha1': 'distro'}}, timeout 300
2023-11-21T11:02:13.601 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/run_tasks.py", line 109, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/task/kernel.py", line 1237, in task
    p.spawn(process_role, ctx, config, timeout, role, role_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_2442239d2653456406c25ae0c71b689c8f2657b6/teuthology/task/kernel.py", line 1250, in process_role
    (role_remote,) = ctx.cluster.only(role).remotes.keys()
ValueError: too many values to unpack (expected 1)

And two more with slightly different Tracebacks but similar problems:

These failures don't happen deterministically, so it might be worth running several of these "workloads/ec-rados-plugin=xxx} " tests multiple times to ensure they're passing.

NitzanMordhai · 2024-02-15T07:52:32Z

I made some more changes to the thrasher, we had some issues with 4 hosts thrashing.

github-actions · 2024-02-27T16:45:51Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Adding new yaml entry to handle create of crush profile before creating new pool, will be skipped if no crush profile name was set. Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

Extra un needed checks and sets for filestore in suites setup need to be removed. Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

To simulate real world maintanace, we will usualy shut down hosts and not just osd the following commit will add host thrasher option to Thraser, when thrash_hosts is True we won't thrash osds one by one, we will choose entire host and thrash all the osds under that host. Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

ljflores · 2024-04-16T20:37:09Z

Looks like this one was merged a little early before a second round of QA could be reviewed.

@NitzanMordhai can you take a look at https://tracker.ceph.com/issues/65517?

NitzanMordhai requested a review from a team as a code owner September 6, 2023 07:23

NitzanMordhai requested a review from neha-ojha September 6, 2023 07:23

github-actions bot added common core tests labels Sep 6, 2023

athanatos requested changes Sep 6, 2023

View reviewed changes

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch from c90f6e5 to f78ba83 Compare September 7, 2023 05:52

NitzanMordhai requested a review from athanatos September 7, 2023 05:52

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 9 times, most recently from 364859e to 31ba9f3 Compare September 11, 2023 11:52

athanatos approved these changes Sep 11, 2023

View reviewed changes

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 2 times, most recently from 62ac56d to 65c864f Compare September 12, 2023 06:51

NitzanMordhai added the needs-qa label Sep 14, 2023

ljflores added the wip-yuri5-testing label Sep 27, 2023

ljflores added TESTED and removed needs-qa wip-yuri5-testing labels Oct 2, 2023

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch from 65c864f to 1b10309 Compare October 5, 2023 07:20

NitzanMordhai added the needs-qa label Oct 5, 2023

ljflores added the wip-yuri8-testing label Oct 27, 2023

yuriw removed the wip-yuri8-testing label Oct 27, 2023

ljflores added needs-qa wip-yuri7-testing and removed needs-qa labels Nov 17, 2023

ljflores removed needs-qa wip-yuri7-testing labels Dec 19, 2023

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 3 times, most recently from 4f274e1 to d536f03 Compare February 5, 2024 13:57

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch 3 times, most recently from 646c2cd to 2262a0e Compare February 15, 2024 07:51

NitzanMordhai added needs-qa and removed TESTED labels Feb 15, 2024

github-actions bot added the needs-rebase label Feb 27, 2024

NitzanMordhai added 4 commits February 28, 2024 07:21

erasure-code: add crush rule profile to suites

0c0d90d

Adding new yaml entry to handle create of crush profile before creating new pool, will be skipped if no crush profile name was set. Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

suites: add crush rules to thrash erasure code

5bc1f46

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

suites: remove extra filestore_debug.. config that left

b8ae42d

Extra un needed checks and sets for filestore in suites setup need to be removed. Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

NitzanMordhai force-pushed the wip-nitzan-qa-tasks-with-crush-rules branch from 2262a0e to 190a761 Compare February 28, 2024 07:26

github-actions bot removed the needs-rebase label Feb 28, 2024

ljflores added the wip-yuri7-testing label Mar 11, 2024

yuriw merged commit 98a7421 into ceph:main Mar 20, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suites: qa tasks with crush rules #53308

suites: qa tasks with crush rules #53308

NitzanMordhai commented Sep 6, 2023 •

edited

athanatos Sep 6, 2023

athanatos Sep 6, 2023

NitzanMordhai Sep 7, 2023

athanatos Sep 7, 2023

athanatos Sep 7, 2023 •

edited

NitzanMordhai Sep 11, 2023

NitzanMordhai commented Sep 11, 2023

ljflores commented Oct 2, 2023

ljflores commented Oct 3, 2023

ljflores commented Oct 27, 2023

ljflores commented Dec 19, 2023

NitzanMordhai commented Feb 15, 2024

github-actions bot commented Feb 27, 2024

ljflores commented Apr 16, 2024

suites: qa tasks with crush rules #53308

suites: qa tasks with crush rules #53308

Conversation

NitzanMordhai commented Sep 6, 2023 • edited

Contribution Guidelines

Checklist

athanatos Sep 6, 2023

Choose a reason for hiding this comment

athanatos Sep 6, 2023

Choose a reason for hiding this comment

NitzanMordhai Sep 7, 2023

Choose a reason for hiding this comment

athanatos Sep 7, 2023

Choose a reason for hiding this comment

athanatos Sep 7, 2023 • edited

Choose a reason for hiding this comment

NitzanMordhai Sep 11, 2023

Choose a reason for hiding this comment

NitzanMordhai commented Sep 11, 2023

ljflores commented Oct 2, 2023

ljflores commented Oct 3, 2023

ljflores commented Oct 27, 2023

ljflores commented Dec 19, 2023

NitzanMordhai commented Feb 15, 2024

github-actions bot commented Feb 27, 2024

ljflores commented Apr 16, 2024

NitzanMordhai commented Sep 6, 2023 •

edited

athanatos Sep 7, 2023 •

edited