qa: use hard_reset to reboot kclient #28825

batrick · 2019-07-01T18:23:22Z

power_off may allow the mounts to gracefully unmount. We don't want this if the
kclient is stuck or we desire the client to "disappear" and come back.

Fixes: http://tracker.ceph.com/issues/37681
Signed-off-by: Patrick Donnelly pdonnell@redhat.com

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

djgalloway · 2019-07-01T18:29:04Z

The teuthology hard_reset function does wait for the machine to come back up before returning. https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/console.py#L215

Is that okay for your purposes?

batrick · 2019-07-01T18:42:09Z

Hmm, I think we should probably defer the wait if possible. I'll fix that.

batrick · 2019-07-01T18:47:43Z

ceph/teuthology#1296

batrick · 2019-07-01T18:47:58Z

@djgalloway please have another look

batrick · 2019-07-12T23:54:21Z

See an unexpected error where the old mount point is busy: /ceph/teuthology-archive/pdonnell-2019-07-11_22:56:09-kcephfs-wip-pdonnell-testing-20190711.203149-distro-basic-smithi/4112066/teuthology.log

My instinct would be that the reset didn't actually happen somehow so I'm adding a call to uptime to see if that reveals it to be so.

batrick · 2019-07-12T23:55:27Z

@djgalloway do you see what happened?

batrick · 2019-07-15T18:46:04Z

2019-07-15T18:37:11.431 INFO:tasks.cephfs.test_journal_repair:Killing mount, it's blocked on the MDS we killed
2019-07-15T18:37:11.485 INFO:teuthology.orchestra.console:Performing hard reset of smithi079
2019-07-15T18:37:11.485 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi079.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset
2019-07-15T18:37:11.527 INFO:teuthology.orchestra.console:Hard reset for smithi079 completed
2019-07-15T18:37:11.708 DEBUG:teuthology.orchestra.console:Waiting for login prompt on smithi079
2019-07-15T18:37:11.708 DEBUG:teuthology.orchestra.console:pexpect command: console -M conserver.front.sepia.ceph.com -p 3109 -f smithi079
2019-07-15T18:37:11.786 DEBUG:teuthology.orchestra.console:expect: smithi079 login
2019-07-15T18:37:11.999 DEBUG:teuthology.orchestra.console:expect before: ^M
[Enter `^Ec?' for help]^M
^M^M
Employee SKU^M
Kernel 3.10.0-957.21.3.el7.x86_64 on an x86_64^M
^M

2019-07-15T18:37:11.999 DEBUG:teuthology.orchestra.console:expect after: smithi079 login:
2019-07-15T18:37:12.153 INFO:teuthology.misc:Re-opening connections...
2019-07-15T18:37:12.153 INFO:teuthology.misc:trying to connect to ubuntu@smithi079.front.sepia.ceph.com
2019-07-15T18:37:12.154 INFO:teuthology.orchestra.run.smithi079.stdout:file_925
2019-07-15T18:37:12.156 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2019-07-15T18:37:12.157 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi079.front.sepia.ceph.com', 'timeout': 60}
2019-07-15T18:37:12.246 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:12.246 INFO:teuthology.orchestra.run.smithi079:> true
2019-07-15T18:37:12.513 DEBUG:teuthology.misc:waited 0.359727859497
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run:Running command with timeout 10
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run.smithi079:> uptime
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079.stdout: 18:37:13 up 25 min,  0 users,  load average: 0.83, 0.60, 0.41
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run:Running command with timeout 300
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079:> rmdir -- /home/ubuntu/cephtest/mnt.0
2019-07-15T18:37:13.637 DEBUG:teuthology.orchestra.run:got remote process result: 1
2019-07-15T18:37:13.664 INFO:teuthology.orchestra.run.smithi079.stderr:rmdir: failed to remove ‘/home/ubuntu/cephtest/mnt.0’: Device or resource busy
2019-07-15T18:37:13.674 INFO:tasks.cephfs_test_runner:test_reset (tasks.cephfs.test_journal_repair.TestJournalRepair) ... ERROR

From: /ceph/teuthology-archive/pdonnell-2019-07-15_17:05:25-kcephfs-master-distro-basic-smithi/4121449/teuthology.log

hard reset doesn't appear to work...

djgalloway · 2019-07-16T14:59:33Z

The job didn't wait for the machine to die.

2019-07-15T18:37:11.527 INFO:teuthology.orchestra.console:Hard reset for smithi079 completed
...
2019-07-15T18:37:12.513 DEBUG:teuthology.misc:waited 0.359727859497
...
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079.stdout: 18:37:13 up 25 min,  0 users,  load average: 0.83, 0.60, 0.41

But you can see a connection failure a couple minutes later in the job:

2019-07-15T18:38:05.799 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:38:05.799 INFO:teuthology.orchestra.run.smithi079:> sudo rm -rf -- /etc/ceph/ceph.conf /etc/ceph/ceph.keyring /home/ubuntu/cephtest/ceph.data /home/ubuntu/cephtest/ceph.monmap /home/ubuntu/cephtest/../*.pid
2019-07-15T18:39:27.929 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)

Which makes me think the machine was on its way to booting back up when teuthology tried to clean up some artifacts there.

I'm not sure where you'd need to add a wait but I'd give it maybe 30 seconds before checking for a console login prompt (which is the indicator to teuthology that the machine is back up).

batrick · 2019-07-26T21:50:17Z

/ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4152120/teuthology.log

/ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4151970/teuthology.log

puzzling failure there, looks like the machine just never came back up

batrick · 2019-07-26T22:06:17Z

another: /ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4151951/teuthology.log

djgalloway · 2019-07-26T22:07:54Z

Looking at the teuthology.log, it appears a keystroke got sent that disrupted the automatic GRUB menu countdown. The system got reset and started to boot from the HDD but sat at the GRUB menu.

Maybe try scrapping these lines?

ceph/qa/tasks/cephfs/kernel_mount.py

Lines 191 to 196 in 54e6163

    
           con = orchestra_remote.getRemoteConsole(self.client_remote.hostname, 
        
                                                   self.ipmi_user, 
        
                                                   self.ipmi_password, 
        
                                                   self.ipmi_domain) 
        
           con.check_status(timeout=60)

power_off may allow the mounts to gracefully unmount. We don't want this if the kclient is stuck or we desire the client to "disappear" and come back. Fixes: http://tracker.ceph.com/issues/37681 Depends-on: ceph/teuthology#1296 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

After sending the reboot command, we need to wait briefly for it to be rebooted so that the kernel client doesn't voluntarily give up its Fb cap. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

batrick · 2019-07-26T22:35:05Z

I moved that to an except block for debugging. Really appreciate your help @djgalloway !

* refs/pull/28825/head: qa: wait for kernel client death qa: use hard_reset to reboot kclient Reviewed-by: David Galloway <dgallowa@redhat.com>

batrick added bug-fix cephfs Ceph File System needs-review labels Jul 1, 2019

batrick requested review from ukernel and djgalloway July 1, 2019 18:23

batrick force-pushed the i37681 branch from ddd251a to 3b862fa Compare July 1, 2019 18:47

djgalloway approved these changes Jul 1, 2019

View reviewed changes

batrick added needs-qa wip-pdonnell-testing and removed needs-review labels Jul 2, 2019

batrick force-pushed the i37681 branch from 3b862fa to bba26e8 Compare July 12, 2019 23:53

batrick force-pushed the i37681 branch from bba26e8 to 38ba14d Compare July 15, 2019 16:59

batrick force-pushed the i37681 branch from 38ba14d to 54e6163 Compare July 22, 2019 20:40

batrick mentioned this pull request Jul 26, 2019

mds: add command that config individual client session #29104

Merged

3 tasks

batrick added needs-qa wip-pdonnell-testing and removed needs-qa wip-pdonnell-testing labels Jul 26, 2019

qa: wait for kernel client death

6b83f43

After sending the reboot command, we need to wait briefly for it to be rebooted so that the kernel client doesn't voluntarily give up its Fb cap. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

batrick force-pushed the i37681 branch from ea960e5 to 6b83f43 Compare July 26, 2019 22:34

batrick merged commit 6b83f43 into ceph:master Jul 29, 2019

batrick added a commit that referenced this pull request Jul 29, 2019

Merge PR #28825 into master

2a82081

* refs/pull/28825/head: qa: wait for kernel client death qa: use hard_reset to reboot kclient Reviewed-by: David Galloway <dgallowa@redhat.com>

smithfarm mentioned this pull request Sep 7, 2019

mimic: qa: use hard_reset to reboot kclient #30233

Merged

batrick deleted the i37681 branch July 16, 2020 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: use hard_reset to reboot kclient #28825

qa: use hard_reset to reboot kclient #28825

batrick commented Jul 1, 2019

djgalloway commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 12, 2019

batrick commented Jul 12, 2019

batrick commented Jul 15, 2019

djgalloway commented Jul 16, 2019

batrick commented Jul 26, 2019 •

edited

batrick commented Jul 26, 2019

djgalloway commented Jul 26, 2019

batrick commented Jul 26, 2019

qa: use hard_reset to reboot kclient #28825

qa: use hard_reset to reboot kclient #28825

Conversation

batrick commented Jul 1, 2019

djgalloway commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 1, 2019

batrick commented Jul 12, 2019

batrick commented Jul 12, 2019

batrick commented Jul 15, 2019

djgalloway commented Jul 16, 2019

batrick commented Jul 26, 2019 • edited

batrick commented Jul 26, 2019

djgalloway commented Jul 26, 2019

batrick commented Jul 26, 2019

batrick commented Jul 26, 2019 •

edited