Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient network issues in guest VMs during install #514

Closed
eloquence opened this issue Mar 27, 2020 · 14 comments
Closed

Transient network issues in guest VMs during install #514

eloquence opened this issue Mar 27, 2020 · 14 comments

Comments

@eloquence
Copy link
Member

eloquence commented Mar 27, 2020

Both @emkll and I have observed transient network issues in guest VMs during prod installs, which cause required repository operations to fail, causing the whole install to fail.

See here for report from 0.2.3-rpm QA. Reboot did not resolve:

After reboot seeing same "temporary failure resolving" but for deb.debian.org in install-python-apt-for-repo-config :/. Network is up and that host is reachable from work VM just fine.

During my install I saw it both for deb.qubes-os.org and for deb.debian.org. Default Qubes config, i.e. no package updates over Tor for regular VMs.

In spite of those failures, required packages appeared to be correctly installed. Only restarting all VMs once more and re-running resolved.

These issues are intermittent and we've not seen them for all installs.

@eloquence eloquence added the bug label Mar 27, 2020
@eloquence eloquence added this to QA Period - 3/17-3/31 (Kanban Mode) in SecureDrop Team Board Mar 27, 2020
@eloquence eloquence moved this from Sprint #48 - 4/2-4/15 to Nominated for next sprint in SecureDrop Team Board Apr 2, 2020
@eloquence eloquence moved this from Nominated for next sprint to Near Term - SD Workstation in SecureDrop Team Board Apr 2, 2020
@conorsch
Copy link
Contributor

I do believe this is still a problem, at least on latest master. Steps to reproduce:

make clone
make clean
make all

observe sd-log provisioning failing:

sd-log:
  ----------
            ID: dsa-4371-update
      Function: cmd.script
        Result: True
       Comment: Nothing to do, apt already fixed.
       Started: 14:56:58.602699
      Duration: 179.668 ms
       Changes:   
  ----------
            ID: update
      Function: pkg.uptodate
        Result: False
       Comment: Problem encountered upgrading packages. Additional info follows:
                
                result:
                    ----------
                    pid:
                        1150
                    retcode:
                        100
                    stderr:
                        E: Failed to fetch https://apt-test.freedom.press/pool/main/s/securedrop-log/securedrop-log_0.1.1-dev-20200415-060704+buster_all.deb  Temporary failure resolving 'apt-test.freedom.press'
                        E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
                    stdout:
                        Reading package lists...
                        Building dependency tree...
                        Reading state information...
                        Calculating upgrade...
                        The following packages will be upgraded:
                          securedrop-log
                        1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
                        Need to get 4445 kB of archives.
                        After this operation, 0 B of additional disk space will be used.
                        Err:1 https://apt-test.freedom.press buster/main amd64 securedrop-log all 0.1.1-dev-20200415-060704+buster
                          Temporary failure resolving 'apt-test.freedom.press'
       Started: 14:57:00.752293
      Duration: 3008.196 ms
       Changes:   

Looking at the code, it's clear why this is failing: the apt update commands should be run only against templates:

include:
- fpf-apt-test-repo
{% if "template" in grains['id'] or grains['id'] in ["securedrop-workstation-buster", "whonix-gw-15"] %}
# Install securedrop-log package in TemplateVMs only

but instead it's outside the template block. Can someone else reproduce to confirm? Will submit PR with patch shortly.

@conorsch
Copy link
Contributor

conorsch commented Apr 15, 2020

Can reproduce reliably (3/3 attempts) with make clean && make all. However, I failed to reproduce by running securedrop-admin --apply against a pre-existing "staging" config just now. This appears to be due to lack of available updates, since I'd just run the custom GUI updater.

I'd expect the resolution failure during apt calls to appear in fresh prod installs, although I've not tested that locally yet.

@eloquence
Copy link
Member Author

I did not see this issue on a clean make clone && make clean && make dev run (on a system needing plenty of updates); I did however encounter our old friend #378. Re-running.

@eloquence
Copy link
Member Author

eloquence commented Apr 16, 2020

Second run completed without error; however, note that I didn't do any downgrades prior to the re-run. Happy to re-test with targeted downgrades if that could make a difference.

@conorsch
Copy link
Contributor

Tried again, this time purging the template RPM from dom0 (which make clean alone does not do), and was able to complete a full run, including make test passing, with no errors. Since so far the problem has only been observed on my machine, it looks like the template I was using had bad state.

If no one else can reproduce, priority should be low, although I'd still advocate for review of #535 as a cleanup task.

@conorsch
Copy link
Contributor

It occurs to me that transient network errors on the apt-test server would be a sufficient explanation for the variable behavior we're seeing. In fact, @zenmonkeykstop reported trouble pulling from apt-test around the same time window I was observing the failures described above. Only the apt-test repo showed problems for me, none of the other upstream repos.

@emkll
Copy link
Contributor

emkll commented Apr 17, 2020

Also could not reproduce the error while following steps described in #514 (comment)

@rocodes
Copy link
Contributor

rocodes commented Apr 17, 2020

Ok, so I uninstalled my prod installation with securedrop-admin --uninstall, then set up a dev env by cloning the repo to an appvm, restoring my secrets and doing the pass-io dance, and running make clean / make clone / make all (I know the clean/clone was unnecessary but I was trying to stick to these exact steps).

I have a different failure:

ID: install securedrop-log-package
Fuction: pkg.installed 
Result: False
...
E: Package 'securedrop-log' has no installation candidate.

Perhaps I did something wrong though.

@eloquence
Copy link
Member Author

I'm currently doing a securedrop-admin --apply run on a prod system to apply a configuration change and am getting the "Temporary failure resolving 'deb.debian.org'" issue reliably and repeatedly. It's not clear to me from the logs which VM is at issue, but I'll try applying all available updates and re-running.

Here's the log:
https://gist.github.com/eloquence/84634ecef304d259b303dd8d5d3c3d3a

There's no issue with the network, as far as I can tell, and manual updates appear to work fine.

@eloquence
Copy link
Member Author

According to the management VM logs, the error most recently happened with sd-gpg and previously also with sd-log. Question: Why is it attempting to do apt updates for those AppVMs? Shouldn't the updates be restricted to TemplateVMs?

@conorsch
Copy link
Contributor

The changes in #535 look like they'd resolve that case, although I don't have a good explanation for why the dns resolution problem is sporadic: if the wrong VMs (i.e. AppVMs) are targeted for apt updates, then the failure should always occur.

@eloquence
Copy link
Member Author

eloquence commented May 12, 2020

Ah, thanks for the reminder about #535. Since I'm currently running prod I don't have that fix in yet. I think it may have been "sporadic" in the sense of only occurring when updates are available. I just applied all available updates and it successfully ran without errors.

Will try to re-test for this case once I'm on staging.

@eloquence
Copy link
Member Author

I've run a few installs since #535 landed and not seen this issue since then. Tagging "needs repro" for now, we can close if we don't see evidence of it during the next QA cycle.

@eloquence
Copy link
Member Author

This appears to be resolved, feel free to reopen if you see it again.

SecureDrop Team Board automation moved this from Near Term - SD Workstation to Done Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants