Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate SecureDrop servers to Ubuntu 16.04 (Xenial) #3204

Closed
25 tasks done
eloquence opened this issue Mar 29, 2018 · 10 comments
Closed
25 tasks done

Migrate SecureDrop servers to Ubuntu 16.04 (Xenial) #3204

eloquence opened this issue Mar 29, 2018 · 10 comments
Labels
epic Meta issue tracking child issues ops/deployment

Comments

@eloquence
Copy link
Member

eloquence commented Mar 29, 2018

Ubuntu Trusty is reaching EOL in April 2019. We should upgrade SecureDrop servers to Ubuntu Xenial (16.04) before then. This will also unblock some blocked issues.

(Please keep discussion about moving to 18.04 out of scope of this issue. We will consider the best path to 18.04, but will not immediately go from 14.04 to 18.04.)

Initially this epic captures preliminary only work; we will update it as we discover more work. The preliminary work must only impact the development environment and must not have production consequences.

Tasks:

In current sprint

Stretch goals

@conorsch
Copy link
Contributor

Config tests are passing against Xenial. Will run through a few repeated runs on both Trusty and Xenial to make sure there aren't any test flakes. On the subject of flakes, over in #3206 (comment) I noted:

OSSEC (the no-messages-containing "ERROR: Incorrectly formated message from" check is failing; that's bad, and could take a while to debug).

Pleased to report that was a false alarm! The corrupted messages were caused by my running a separate app-staging VM concurrently, for use with developing the SecureDrop Workstation. Since the app-staging IPv4 addresses are hard-coded in staging, the machines were fighting, and OSSEC was correctly reporting bad messages due to inconsistent key exchange. Powering off the secondary app-staging machine and rebuilding resolved all OSSEC-related errors i nthe config tests.

@heartsucker
Copy link
Contributor

Hey, so this came up when chatting with @redshiftzero recently, and I figure I'll leave my thoughts here.

We had a timeboxed upgrade attempt from Trusty to Xenial, but I think we should not do this on production boxes. I have a few times upgraded laptops from one Debian version to the immediate next using apt dist-upgrade and fiddling with the sources.list{,.d} entries. This has worked in as much as I had a functional operating system and most things "just worked" as they did after the upgrade and next reboot.

Where things didn't work in cases where a piece of software was included as the default in version N-1 but not in version N. I mean this specifically in the case of Jessie to Wheezy when the OS went from init to systemd. This was fine because there was compatibility with all the packages, but it made debugging a lot harder because searches like wheezy upstart tor or whatever just had crap results.

It's also possible to do a bunch of manual fiddling to get the system in to the next state by installing, removing, updating alternatives, and updating the default configs, but this is a lot of work. I hesitate to suggest we do this because one problem we have with SD is that right now it's treated like a package/service when in reality it's more an appliance. Which is to say, we don't have full control or even introspection into the base OS, so if things get a little out of whack, we just don't know.

I say we should tell admins that they need to do a full reinstall and give them one (two?) releases of SD in which to do that in where we support both Trusty and Xenial. We already have the backup / restore script that does the magic of replacing the files, so this not so burdensome.

My biggest concern isn't that we botch the Trusty to Xenial upgrade, and in fact I bet we could totally nail it. The concern is that it's complex and that knowledge will be lost quickly (people forget, and likely only 2-3 people on the team will really fully understand it), and then future change will have to remember that systems could be fresh Xenials or upgrade Xenials when considering all ops / app related problems.

Another advantage of just mandating that we do reinstalls is that it's less engineering effort on our side, and this deadline feels very close. Given this constraint, it seems like a safer choice even if we disregard the long term complexities I mentioned above.

Also, I'm acknowledging that I haven't been paying super close attention to this ticket/epic and we may be too far down the line to change this now. Or we may have addressed these concerns already, and what I'm saying here is out of date.

@kushaldas
Copy link
Contributor

I say we should tell admins that they need to do a full reinstall and give them one (two?) releases of SD in which to do that in where we support both Trusty and Xenial. We already have the backup / restore script that does the magic of replacing the files, so this not so burdensome.

If we have to ask for reinstall, this is a good time to think other options in the server side too. For example, the read only file system of Atomic (the PoC I did).

@redshiftzero
Copy link
Contributor

Another advantage of just mandating that we do reinstalls is that it's less engineering effort on our side, and this deadline feels very close.

This might take development effort away, but then it puts that effort onto administrators and FPF through support (i.e. some core engineering staff would need to travel to assist with reinstalls). Many administrators already reinstalled fresh late last year, which significantly slowed development for about a month amid significant travel from the core team installing SecureDrops at major news organizations. In my opinion, we should not do reinstalls unless there is an unavoidable technical reason (we haven't come across one yet), as I don't want us to largely halt development in order to travel around and do these reinstalls. Unfortunately, if we don't assist people (ignoring the fact that we have support contracts with a bunch of news organizations) and we don't provide an upgrade path, it means that a significant fraction of instances could be on an EOL OS or stop using SecureDrop altogether, either of which is a bad outcome for sources in my opinion.

If we have to ask for reinstall, this is a good time to think other options in the server side too. For example, the read only file system of Atomic (the PoC I did).

We need to complete the Xenial transition in the next few months. In SecureDrop's current state, I don't see how we could be ready to move to a read only file system before Trusty EOLs. What do you think?

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented Nov 29, 2018

A reinstall and restore is definitely the easier target to hit for us, but I'm loathe to recommend it as I feel it puts a lot of responsibility on admins, especially those who didn't install the system in the first place (because they inherited it or because FPF did it for them). There are a lot of steps for things to go wrong, and it would be hard for us to know how they went about doing the upgrade if things did go wrong.

@heartsucker, given that we're talking about systems where we have a pretty good idea of their current state, how much of the upgrade complexity you mention could be identified ahead of time and reproduced in an Ansible playbook or similar?

@kushaldas
Copy link
Contributor

We need to complete the Xenial transition in the next few months. In SecureDrop's current state, I don't see how we could be ready to move to a read only file system before Trusty EOLs. What do you think?

I am not saying to do this before moving to Xenial, but, we should keep options open for future. And, reinstall can be a part of that story (in long term).

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented Nov 29, 2018

Also, I feel that no matter what, this won't be an unattended update. Workflow could go something like:

  1. we push out an updated securedrop-admin with an os-upgrade task that invokes Ansible to do the upgrade
  2. when they're ready, admins:
    1. back up their instance
    2. run securedrop-admin os-upgrade

(Actually the os-update task should probably just do the backup automatically, or prompt them.)

(One thing that would be cool to have from the support perspective is a way to verify the OS in use by a given instance, so we could have a Nagios check like that for SD version. Could have security implications however, though it's not like it's not common knowledge what OSes an SD instance is likely to be running.)

@zenmonkeykstop
Copy link
Contributor

@ninavizz
Copy link
Member

ninavizz commented Feb 6, 2019

tagging myself to get on my radar

@redshiftzero
Copy link
Contributor

I removed #3208 from this epic because it needs further discussion and does not need to be coupled to this issue. Closing this epic as the other work has been completed

@redshiftzero redshiftzero unpinned this issue Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Meta issue tracking child issues ops/deployment
Projects
None yet
Development

No branches or pull requests

7 participants