Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa: use hard_reset to reboot kclient #28825

Merged
merged 2 commits into from Jul 29, 2019
Merged

qa: use hard_reset to reboot kclient #28825

merged 2 commits into from Jul 29, 2019

Conversation

batrick
Copy link
Member

@batrick batrick commented Jul 1, 2019

power_off may allow the mounts to gracefully unmount. We don't want this if the
kclient is stuck or we desire the client to "disappear" and come back.

Fixes: http://tracker.ceph.com/issues/37681
Signed-off-by: Patrick Donnelly pdonnell@redhat.com

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

@djgalloway
Copy link

The teuthology hard_reset function does wait for the machine to come back up before returning. https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/console.py#L215

Is that okay for your purposes?

@batrick
Copy link
Member Author

batrick commented Jul 1, 2019

Hmm, I think we should probably defer the wait if possible. I'll fix that.

@batrick
Copy link
Member Author

batrick commented Jul 1, 2019

ceph/teuthology#1296

@batrick
Copy link
Member Author

batrick commented Jul 1, 2019

@djgalloway please have another look

@batrick
Copy link
Member Author

batrick commented Jul 12, 2019

See an unexpected error where the old mount point is busy: /ceph/teuthology-archive/pdonnell-2019-07-11_22:56:09-kcephfs-wip-pdonnell-testing-20190711.203149-distro-basic-smithi/4112066/teuthology.log

My instinct would be that the reset didn't actually happen somehow so I'm adding a call to uptime to see if that reveals it to be so.

@batrick
Copy link
Member Author

batrick commented Jul 12, 2019

@djgalloway do you see what happened?

@batrick
Copy link
Member Author

batrick commented Jul 15, 2019

2019-07-15T18:37:11.431 INFO:tasks.cephfs.test_journal_repair:Killing mount, it's blocked on the MDS we killed
2019-07-15T18:37:11.485 INFO:teuthology.orchestra.console:Performing hard reset of smithi079
2019-07-15T18:37:11.485 DEBUG:teuthology.orchestra.console:pexpect command: ipmitool -H smithi079.ipmi.sepia.ceph.com -I lanplus -U inktank -P ApGNXcA7 power reset
2019-07-15T18:37:11.527 INFO:teuthology.orchestra.console:Hard reset for smithi079 completed
2019-07-15T18:37:11.708 DEBUG:teuthology.orchestra.console:Waiting for login prompt on smithi079
2019-07-15T18:37:11.708 DEBUG:teuthology.orchestra.console:pexpect command: console -M conserver.front.sepia.ceph.com -p 3109 -f smithi079
2019-07-15T18:37:11.786 DEBUG:teuthology.orchestra.console:expect: smithi079 login
2019-07-15T18:37:11.999 DEBUG:teuthology.orchestra.console:expect before: ^M
[Enter `^Ec?' for help]^M
^M^M
Employee SKU^M
Kernel 3.10.0-957.21.3.el7.x86_64 on an x86_64^M
^M

2019-07-15T18:37:11.999 DEBUG:teuthology.orchestra.console:expect after: smithi079 login:
2019-07-15T18:37:12.153 INFO:teuthology.misc:Re-opening connections...
2019-07-15T18:37:12.153 INFO:teuthology.misc:trying to connect to ubuntu@smithi079.front.sepia.ceph.com
2019-07-15T18:37:12.154 INFO:teuthology.orchestra.run.smithi079.stdout:file_925
2019-07-15T18:37:12.156 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2019-07-15T18:37:12.157 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi079.front.sepia.ceph.com', 'timeout': 60}
2019-07-15T18:37:12.246 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:12.246 INFO:teuthology.orchestra.run.smithi079:> true
2019-07-15T18:37:12.513 DEBUG:teuthology.misc:waited 0.359727859497
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run:Running command with timeout 10
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:13.514 INFO:teuthology.orchestra.run.smithi079:> uptime
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079.stdout: 18:37:13 up 25 min,  0 users,  load average: 0.83, 0.60, 0.41
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run:Running command with timeout 300
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079:> rmdir -- /home/ubuntu/cephtest/mnt.0
2019-07-15T18:37:13.637 DEBUG:teuthology.orchestra.run:got remote process result: 1
2019-07-15T18:37:13.664 INFO:teuthology.orchestra.run.smithi079.stderr:rmdir: failed to remove ‘/home/ubuntu/cephtest/mnt.0’: Device or resource busy
2019-07-15T18:37:13.674 INFO:tasks.cephfs_test_runner:test_reset (tasks.cephfs.test_journal_repair.TestJournalRepair) ... ERROR

From: /ceph/teuthology-archive/pdonnell-2019-07-15_17:05:25-kcephfs-master-distro-basic-smithi/4121449/teuthology.log

hard reset doesn't appear to work...

@djgalloway
Copy link

The job didn't wait for the machine to die.

2019-07-15T18:37:11.527 INFO:teuthology.orchestra.console:Hard reset for smithi079 completed
...
2019-07-15T18:37:12.513 DEBUG:teuthology.misc:waited 0.359727859497
...
2019-07-15T18:37:13.595 INFO:teuthology.orchestra.run.smithi079.stdout: 18:37:13 up 25 min,  0 users,  load average: 0.83, 0.60, 0.41

But you can see a connection failure a couple minutes later in the job:

2019-07-15T18:38:05.799 INFO:teuthology.orchestra.run.smithi079:Running:
2019-07-15T18:38:05.799 INFO:teuthology.orchestra.run.smithi079:> sudo rm -rf -- /etc/ceph/ceph.conf /etc/ceph/ceph.keyring /home/ubuntu/cephtest/ceph.data /home/ubuntu/cephtest/ceph.monmap /home/ubuntu/cephtest/../*.pid
2019-07-15T18:39:27.929 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)

Which makes me think the machine was on its way to booting back up when teuthology tried to clean up some artifacts there.

I'm not sure where you'd need to add a wait but I'd give it maybe 30 seconds before checking for a console login prompt (which is the indicator to teuthology that the machine is back up).

@batrick
Copy link
Member Author

batrick commented Jul 26, 2019

/ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4152120/teuthology.log

/ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4151970/teuthology.log

puzzling failure there, looks like the machine just never came back up

@batrick
Copy link
Member Author

batrick commented Jul 26, 2019

another: /ceph/teuthology-archive/pdonnell-2019-07-26_06:38:30-kcephfs-wip-pdonnell-testing-20190726.021409-distro-basic-smithi/4151951/teuthology.log

@djgalloway
Copy link

Looking at the teuthology.log, it appears a keystroke got sent that disrupted the automatic GRUB menu countdown. The system got reset and started to boot from the HDD but sat at the GRUB menu.

Maybe try scrapping these lines?

con = orchestra_remote.getRemoteConsole(self.client_remote.hostname,
self.ipmi_user,
self.ipmi_password,
self.ipmi_domain)
con.check_status(timeout=60)

power_off may allow the mounts to gracefully unmount. We don't want this if the
kclient is stuck or we desire the client to "disappear" and come back.

Fixes: http://tracker.ceph.com/issues/37681
Depends-on: ceph/teuthology#1296
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
After sending the reboot command, we need to wait briefly for it to be
rebooted so that the kernel client doesn't voluntarily give up its Fb
cap.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Jul 26, 2019

I moved that to an except block for debugging. Really appreciate your help @djgalloway !

@batrick batrick merged commit 6b83f43 into ceph:master Jul 29, 2019
batrick added a commit that referenced this pull request Jul 29, 2019
* refs/pull/28825/head:
	qa: wait for kernel client death
	qa: use hard_reset to reboot kclient

Reviewed-by: David Galloway <dgallowa@redhat.com>
@batrick batrick deleted the i37681 branch July 16, 2020 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants