Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installer] Unreachable app server during install produces mixed/broken install state (1.2.0-rc1, 1.2.0-rc2) #5047

Closed
rocodes opened this issue Nov 29, 2019 · 5 comments

Comments

@rocodes
Copy link
Contributor

rocodes commented Nov 29, 2019

Description

During install, I have twice (rc1, rc2) experienced an issue in the tor-hidden-services task where the app server becomes unreachable.

fatal: [app]: UNREACHABLE! => {"changed", false, "msg": "SSH Error: data could not be sent to remote host \"10.20.2.2\". Make sure this host can be reached over ssh", "unreachable": true}

Subsequent ths task (Refresh ansible local facts) also fails. This produces a mixed state where the installer continues for mon, but eventually fails at the validation stage:

[validate: Confirm that a valid set of SSH auth files is present]
...
"msg": "One of the SSH `.auth_private` files is missing. Please add the missing file under ~/Persistent/securedrop/install_files/ansible_base/ and retry the install command."

This instruction is impossible because the app-ssh.auth_private file was not generated.

Steps to Reproduce

Needs reproducing. NUC7i7DNHE, Tails 4, setup: clean install with v3 onion only.

Expected Behavior

Installation completes.

Actual Behavior

The install is left in a mixed state, where SSH access to app is still possible, but not to mon, there are no .ths files in install_files/ansible_base/, but there is a tor_v3_keys.json file, a mon-ssh.auth_private file, and mon iptables rules are in place, locking out regular SSH.

It seems like there isn't really a way to "recover" from this state except to wipe the servers and start again.

Comments

Hopefully I have explained this properly.

@zenmonkeykstop
Copy link
Contributor

Haven't been able to reproduce this yet. When tor is restarted, Ansible waits for 30sec and then polls every second for 300sec, so it's hard to see how SSH wouldn't be up again in that interval.

@kushaldas
Copy link
Contributor

kushaldas commented Dec 2, 2019

I also installed a few times, and could not reproduce (on NUC5 though).

@zenmonkeykstop
Copy link
Contributor

Given that everything else was set up, it should be possible to recover from this without a reinstall by consoling in, copying the service info from the app server and manually creating app-*.auth_private files, then rerunning the installer, but starting from scratch is probably wise.

@rocodes rocodes mentioned this issue Dec 2, 2019
24 tasks
@eloquence eloquence added this to Current Sprint - 11/20 -12/4 in SecureDrop Team Board Dec 2, 2019
@eloquence eloquence moved this from Current Sprint - 11/20 -12/4 to Near Term - SecureDrop Core in SecureDrop Team Board Dec 3, 2019
@eloquence
Copy link
Member

Proposing further investigation during QA cycle for 1.3.0.

@eloquence eloquence added this to the 1.3.0 milestone Dec 19, 2019
@eloquence
Copy link
Member

We haven't seen a repro of this during recent QA cycles, so closing for now.

SecureDrop Team Board automation moved this from Near Term - SecureDrop Core to Done Nov 13, 2020
@eloquence eloquence removed this from Done in SecureDrop Team Board Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants