misc: temporary fix for "No space left on device" errors #1335

smithfarm · 2019-10-17T12:19:34Z

41a13ec fixed a longstanding bug that the lab
was relying on. Before the bug was fixed, the get_wwn_id_map function was doing:

    try:
        r = remote.run(
            args=[
                'ls',
                '-l',
                '/dev/disk/by-id/wwn-*',
            ],
            stdout=StringIO(),
        )
        stdout = r.stdout.getvalue()
    except Exception:
        log.info('Failed to get wwn devices! Using /dev/sd* devices...')
        return dict((d, d) for d in devs)

The bug was that "remote.run" was putting single quotes around the string
"/dev/disk/by-id/wwn-*" because it wasn't enclosed in Raw(...). The single
quotes were causing the command to fail, triggering the except clause, and that
was happening 100% of the time.

The fix in 41a13ec caused the command to start
succeeding, which caused execution to continue. As a result, MON stores and
OSDs started getting created on the wrong devices, and tests that were
previously succeeding started to fail due to "No space left on device".

In short, the wwn devices on today's smithis are not big enough for
/var/lib/ceph.

This commit "fixes the fix" by dropping the dead code and always returning the
value that qa/tasks/ceph.py has come to expect.

Fixes: https://tracker.ceph.com/issues/42313
Signed-off-by: Nathan Cutler ncutler@suse.com

41a13ec fixed a longstanding bug that the lab was relying on. Before the bug was fixed, the get_wwn_id_map function was doing: try: r = remote.run( args=[ 'ls', '-l', '/dev/disk/by-id/wwn-*', ], stdout=StringIO(), ) stdout = r.stdout.getvalue() except Exception: log.info('Failed to get wwn devices! Using /dev/sd* devices...') return dict((d, d) for d in devs) The bug was that "remote.run" was putting single quotes around the string "/dev/disk/by-id/wwn-*" because it wasn't enclosed in Raw(...). The single quotes were causing the command to fail, triggering the except clause, and that was happening 100% of the time. The fix in 41a13ec caused the command to start succeeding, which caused execution to continue. As a result, MON stores and OSDs started getting created on the wrong devices, and tests that were previously succeeding started to fail due to "No space left on device". In short, the wwn devices on today's smithis are not big enough for /var/lib/ceph. This commit "fixes the fix" by dropping the dead code and always returning the value that qa/tasks/ceph.py has come to expect. Fixes: https://tracker.ceph.com/issues/42313 Signed-off-by: Nathan Cutler <ncutler@suse.com>

kshtsk · 2019-10-17T12:37:12Z

The fix in 41a13ec caused the command to start
succeeding, which caused execution to continue. As a result, MON stores and
OSDs started getting created on the wrong devices, and tests that were
previously succeeding started to fail due to "No space left on device".

This is not precise description, in fact since the scratch contents is used in devs parameter the get_wwn_id_map just returns the empty dictionary:

2019-10-17T11:34:47.802 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1']
2019-10-17T11:34:47.803 INFO:teuthology.orchestra.run.smithi110:Running:
2019-10-17T11:34:47.803 INFO:teuthology.orchestra.run.smithi110:> ls -l /dev/disk/by-id/wwn-*
2019-10-17T11:34:47.875 INFO:teuthology.orchestra.run.smithi110.stdout:lrwxrwxrwx 1 root root  9 Oct 17 11:28 /dev/disk/by-id/wwn-0x5000c50091dfc3c1 -> ../../sda
2019-10-17T11:34:47.875 INFO:teuthology.orchestra.run.smithi110.stdout:lrwxrwxrwx 1 root root 10 Oct 17 11:28 /dev/disk/by-id/wwn-0x5000c50091dfc3c1-part1 -> ../../sda1
2019-10-17T11:34:47.875 INFO:tasks.ceph:dev map: {}
2019-10-17T11:34:47.875 INFO:tasks.ceph:Generating config...
.
.
.
2019-10-17T11:34:49.857 INFO:tasks.ceph:Running mkfs on osd nodes...
2019-10-17T11:34:49.857 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi110.front.sepia.ceph.com'): {}}

the logs if we revert this behavior with this patch.

2019-10-17T11:41:15.420 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1']
2019-10-17T11:41:15.420 WARNING:teuthology.misc:The get_wwn_id_map is buggy and deprecated, and will be removed shortly. The qa/tasks/ceph.py should use another method to work correctly.
2019-10-17T11:41:15.420 INFO:tasks.ceph:dev map: {'cluster2.osd.0': '/dev/vg_nvme/lv_4', 'cluster2.osd.1': '/dev/vg_nvme/lv_1', 'cluster2.osd.2': '/dev/vg_nvme/lv_2'}
2019-10-17T11:41:15.420 INFO:tasks.ceph:Generating config...
.
.
.
2019-10-17T11:41:17.393 INFO:tasks.ceph:Running mkfs on osd nodes...
2019-10-17T11:41:17.393 INFO:tasks.ceph:ctx.disk_config.remote_to_roles_to_dev: {Remote(name='ubuntu@smithi101.front.sepia.ceph.com'): {'cluster1.osd.2': '/dev/vg_nvme/lv_2', 'cluster1.osd.1': '/dev/vg_nvme/lv_1', 'cluster1.osd.0': '/dev/vg_nvme/lv_4'}, Remote(name='ubuntu@smithi081.front.sepia.ceph.com'): {'cluster2.osd.0': '/dev/vg_nvme/lv_4', 'cluster2.osd.1': '/dev/vg_nvme/lv_1', 'cluster2.osd.2': '/dev/vg_nvme/lv_2'}}

kshtsk · 2019-10-17T12:38:47Z

retest this please

kshtsk · 2019-10-17T12:54:34Z

please remove this line:

Fixes: https://tracker.ceph.com/issues/42313

because this ticket is not related to this issue.

dillaman

👍

dillaman · 2019-10-17T12:56:05Z

please remove this line:
Fixes: https://tracker.ceph.com/issues/42313
because this ticket is not related to this issue.

Disagree -- this fixes the issue described in that ticket. Please stop fighting get teuthology working again.

kshtsk · 2019-10-17T12:57:56Z

please remove this line:
Fixes: https://tracker.ceph.com/issues/42313
because this ticket is not related to this issue.
Disagree -- this fixes the issue described in that ticket. Please stop fighting get teuthology working again.

It is different problem, please take a look at the log in the description it is using teuthology of the version prior to this change.

dillaman · 2019-10-17T12:58:31Z

please remove this line:
Fixes: https://tracker.ceph.com/issues/42313
because this ticket is not related to this issue.
Disagree -- this fixes the issue described in that ticket. Please stop fighting get teuthology working again.
It is different problem, please take a look at the log in the description it is using teuthology of the version prior to this change.

Again -- stop fighting this. I updated the description of the ticket.

smithfarm · 2019-10-17T13:06:40Z

Test running here: http://pulpito.ceph.com/smithfarm-2019-10-17_12:35:03-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi/

kshtsk · 2019-10-17T13:07:30Z

@dillaman thank you, you're my hero.

smithfarm requested review from gregsfortytwo, yuriw, dillaman and kshtsk October 17, 2019 12:19

smithfarm force-pushed the wip-wwn2 branch from c07b7db to b2f1bca Compare October 17, 2019 12:19

smithfarm mentioned this pull request Oct 17, 2019

misc: temporary fix for "No space left on device" errors #1334

Closed

smithfarm requested a review from batrick October 17, 2019 12:21

dillaman approved these changes Oct 17, 2019

View reviewed changes

kshtsk approved these changes Oct 17, 2019

View reviewed changes

smithfarm merged commit d8332ea into master Oct 17, 2019

smithfarm mentioned this pull request Oct 17, 2019

misc: get_wwn_id_map: use any device name, not just wwn-* #1332

Closed

tchaikov deleted the wip-wwn2 branch October 17, 2019 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc: temporary fix for "No space left on device" errors #1335

misc: temporary fix for "No space left on device" errors #1335

smithfarm commented Oct 17, 2019 •

edited

kshtsk commented Oct 17, 2019

kshtsk commented Oct 17, 2019

kshtsk commented Oct 17, 2019

dillaman left a comment

dillaman commented Oct 17, 2019

kshtsk commented Oct 17, 2019

dillaman commented Oct 17, 2019

smithfarm commented Oct 17, 2019

kshtsk commented Oct 17, 2019

misc: temporary fix for "No space left on device" errors #1335

misc: temporary fix for "No space left on device" errors #1335

Conversation

smithfarm commented Oct 17, 2019 • edited

kshtsk commented Oct 17, 2019

kshtsk commented Oct 17, 2019

kshtsk commented Oct 17, 2019

dillaman left a comment

Choose a reason for hiding this comment

dillaman commented Oct 17, 2019

kshtsk commented Oct 17, 2019

dillaman commented Oct 17, 2019

smithfarm commented Oct 17, 2019

kshtsk commented Oct 17, 2019

smithfarm commented Oct 17, 2019 •

edited