[JENKINS-49406] Evergreen snapshotting data safety system#1
[JENKINS-49406] Evergreen snapshotting data safety system#1
Conversation
jep/302/README.adoc
Outdated
|
|
||
| We plan to configure Jenkins to use non default values to put the workspace and builds not under the same location as job configuration itself. | ||
|
|
||
| _Workspace Root Directory_:: `${JENKINS_HOME}/workspace/${ITEM_FULLNAME}` |
There was a problem hiding this comment.
Why do you even have a workspace - executors on the master is evil and should be killed at the earliest convienience?
There was a problem hiding this comment.
JENKINS-49861 FTR. I plan to have it be possible to leave an executor on the master for use by privileged jobs (JENKINS-24513). At any rate, it is worth at least reminding readers that this setting should generally have no effect.
jep/302/README.adoc
Outdated
|
|
||
| _Workspace Root Directory_:: `${JENKINS_HOME}/workspace/${ITEM_FULLNAME}` | ||
| _Build Record Root Directory_:: | ||
| `${JENKINS_HOME}/var/${ITEM_FULL_NAME}/builds` |
There was a problem hiding this comment.
var looks a little weird. why was this chosen over something more obvious like builds
var implies you can stuff more in here - but there is not enough in the path to disambiguate from builds.
jep/302/README.adoc
Outdated
| === Why Git | ||
|
|
||
| Using filesystem-level tools offering a snapshotting feature, like LVM, ZFS or btrfs to give a few examples, was considered. | ||
| But this was evacuated because _Essentials_ vision is about providing an link:https://github.com/jenkinsci/jep/tree/71d9391744c8cc7d6595805f7fdd327eedf6811a/jep/300#automatically-updated-distribution["_easier to use_ and _easier to manage_ Jenkins environment"]. |
jep/302/README.adoc
Outdated
|
|
||
| === Man In The Middle | ||
|
|
||
| The main issue here is that an attacker could for instance instruct the evergreen client to ignore everything (by putting `*` in `.gitignore`), hence make it impossible to rollback. |
There was a problem hiding this comment.
don;t use .gitignore then. use -f to add?
move builds outside JENKINS_HOME (have a directly at the mount that is /builds /whatever /config and make "config JENKINS_HOME where the usual stuff is stored).
There was a problem hiding this comment.
don;t use .gitignore then. use -f to add?
Not convinced this fondamentally changes the issue. The goal is still to be able to upgrade the evergreen-client too. So, be it by hacking the .gitignore content, or by hacking what things are passed to -f I guess the issue is to really find robust ways to not accept instructions from some wrong instructors.
move builds outside JENKINS_HOME (have a directly at the mount that is /builds /whatever /config and make "config JENKINS_HOME where the usual stuff is stored).
Good point. I gave it some thoughts, but kept it under JENKINS_HOME to respect/be consistent with JEP-301:
There was a problem hiding this comment.
just because that is written does not mean it is correct - infact that section of JEP is bad bad bad...
why?
--pluginRoot --webroot. JENKINS_HOME is possibly mounted on a resilient storage (and is usually slower or more costly) there is no need to unpack the Jenkins plugins or war into jenkins-home...
Also logs... Hmmm if they are in jenkins home and you rollback then you loose some logs..
that;s not a good thing is it?
and well if you take it at face value...
"No other data should be written to the filesystem outside of JENKINS_HOME."
build artifacts are data... yet with evergreen you want to move these to S3 or somewhere not the filesystem. that JEP would preclude that work. //cc @rtyler
now if jep-301 wanted to say thal all data be contained under a single directory then fine (but JENKINS_HOME does not need to be at the root of that directory) IMO and if you are making a more complex system because of that descision then go revisit it...
jep/302/README.adoc
Outdated
| @@ -0,0 +1,260 @@ | |||
| = JEP-302: Evergreen snapshotting data safety system | |||
There was a problem hiding this comment.
You do not pick the number when submitting a JEP. See the instructions.
There was a problem hiding this comment.
Fixed thanks. Didn't rename the folder though to avoid GitHub folding all comments below despite me not having addressed them yet. Will do when submitting upstream.
There was a problem hiding this comment.
I think GitHub is smart enough to handle rename detection, but I can understand not wanting to test that now!
jep/302/README.adoc
Outdated
| 7. [[healthcheck]] Check it is responding in a healthy manner. | ||
| 8. If not: | ||
| .. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot]. | ||
| .. Rollback to the previous data snapshot and Essentials BOM version. |
jep/302/README.adoc
Outdated
| 8. If not: | ||
| .. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot]. | ||
| .. Rollback to the previous data snapshot and Essentials BOM version. | ||
| To do so, we will generally create a new commit (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`. |
There was a problem hiding this comment.
Yes, basically.
UPDATE: fixed
| To do so, we will generally create a new commit (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`. | ||
| .. Start (previous) Jenkins version | ||
| // what if starting the previous version doesn't work either? | ||
| 9. Report the outcome to the Essentials backend. |
There was a problem hiding this comment.
Well, I hope this is clear, but here I meant that we want to report the outcome, whatever it is, to the backend. We also very much need the result if everything goes well, to be able to push it to more instances.
jep/302/README.adoc
Outdated
|
|
||
| * Update `.gitignore` content with current Essentials release. | ||
| * `git add .` | ||
| * `git add -u ` |
jep/302/README.adoc
Outdated
| === Secrets | ||
|
|
||
| Versioning secrets is not an issue, as the snapshotting system is going to be local to the running instance. | ||
| As the Git repository data will never be pushed _outside_, no data leak is expected from this side. |
There was a problem hiding this comment.
I do not think this is acceptable. Someone will use the Git repository externally regardless of what you advertise.
Fortunately there is a simple solution: add
secrets/master.key
to .gitignore. It is created on first startup and never changed thereafter, so it would never be affected by an upgrade or downgrade. And without it, all other secrets remain (theoretically) secret.
There was a problem hiding this comment.
Well, I do not feel strongly to add it or ignore it. But well, if someone is able to access the FS and so what you are saying, in any case they have the power to read that key.
I understand what you are saying as "someone could have the bad idea to push that repo elsewhere, when it's not supposed to be, not groking the fact they're going to leak central secrets doing so, and so we'll be seen as responsible for it.".
So yes, let's be conservative and not encourage people to shoot themselves in the foot.
jep/302/README.adoc
Outdated
|
|
||
| * Apply the upgrade or downgrade, then check the instance is _running fine_: | ||
| ** No administrative monitor triggered | ||
| ** No new ERROR logs showing up |
There was a problem hiding this comment.
You mean, Level.ERROR? Hardly useful: few plugins log at this level to begin with. Almost everything of interest would be a WARNING.
There was a problem hiding this comment.
Yes and no. But your point makes me realize I really need to externalize this concept. We need a standalone design for determining Jenkins health I think. It's a critical condition/piece for many components: for testing, for evergreen-client to decide to rollback or not, etc. And it is not (necessarily) related to the data snapshotting system.
So will refactor to make this generic and point to something to be done in this regard.
Also commented in https://issues.jenkins-ci.org/browse/JENKINS-49406?focusedCommentId=332233&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-332233 about that
jep/302/README.adoc
Outdated
| ** No administrative monitor triggered | ||
| ** No new ERROR logs showing up | ||
|
|
||
| We will need to develop adhoc testing tools to be able to automatically assess the health of a Jenkins Essentials instance after an upgrade or a downgrade. |
jep/302/README.adoc
Outdated
|
|
||
| Before delivering updates on real connected instances, we must test at least the following scenarios. | ||
|
|
||
| * Apply the upgrade or downgrade, then check the instance is _running fine_: |
There was a problem hiding this comment.
I think you are missing something critical here: you need to be able to ensure that nothing interesting happening in the system between the time Jenkins completes initialization after the upgrade, and the time the health check completes. If builds run, that is probably fine (as I argue above w.r.t. build record location), though a QueueTaskDispatcher or just setQuietDown could block builds temporarily after startup. But what about users editing job or system configuration? You would need to put some kind of UI block in place, I think.
The broader question which I guess you are punting on is how useful a simplistic health check like this really is. Just starting without warnings is pretty useless if every build you try to run bombs out with some fatal low-level error. But then if you let some trial builds run, you are widening the window after which a revert could be problematic.
jep/302/README.adoc
Outdated
| Detailing this here is out of scope for this proposal. | ||
| It is however highly desirable that we centralize this logic and use it both during automated tests, and in production for the evergreen-client to automatically analyze if a product instance is healthy or is not (and decide to rollback or not, for the current matter here). | ||
|
|
||
| We will leverage the link:https://github.com/cloudbees/operations-center-acceptance-test/[Jenkins Acceptance Test Harness project] for this purpose. |
There was a problem hiding this comment.
I think you meant to link to something else.
I was not supposed to choose the number myself, obviously... Thanks Jesse for the reminder. Not renaming the folder yet, on purpose, to avoid folding all other review comments although I didn't address them yet.
Jesse is basically saying that despite the system/git repo is not designed to be pushed outside, (some) people will do so...
Was trying to use the title of a code block, starting with a dot. But here asciidoctor was confused by the .. at ..gitignore. So using the clearer form title=blah to make it work (and be less magic in the same go, which is good).
The intent here was to have a single root for builds and workspace for a given item.
|
Reminder for me, and for anyone interested, things I want to address:
If there is anything else important I missed, please don't hesitate to tell me so. |
Proposal to address Jesse's comment: > The broader question which I guess you are punting > on is how useful a simplistic health check like this > really is. Just starting without warnings is pretty > useless if every build you try to run bombs out with > some fatal low-level error. But then if you let some > trial builds run, you are widening the window after > which a revert could be problematic."
as it should have been since the beginning
jep/302/README.adoc
Outdated
| The intent is to have everything not ignored in the index. | ||
| In other words, `git status` just after a SNAPSHOT should be in the `nothing to commit, working tree clean` state. | ||
|
|
||
| We will *not* snapshot data that is too big or does not make sense to store:. |
| │ ├── nodes | ||
| │ ├── plugins | ||
| │ ├── secrets | ||
| │ ├── updates |
jep/302/README.adoc
Outdated
|
|
||
| This way, everything related to jobs, but is less important than its configuration itself is separated under a different `${JENKINS_HOME}/var/` root. | ||
|
|
||
| Though we will for now back up these directories too, having this distinction baked in from the ground up could prove useful in the future if we need to make the snapshotting and rollback processes smarter and more robust. |
There was a problem hiding this comment.
Wait, I thought we were only backing up /var/jenkins/home?
There was a problem hiding this comment.
yes, that's it. Reading, maybe I forgot to adjust a sentence.
| ====== about `/var/jenkins/home/plugins` | ||
|
|
||
| This directory contains the hpi/jpi files before extraction. | ||
| Ideally, we should be moving this elsewhere under `/var/jenkins/var/plugins`, but it's not doable yet currently (`--pluginsroot` only configures a different location for exploded plugins). |
There was a problem hiding this comment.
Careful here. Currently $JENKINS_HOME/plugins/shortName.jpi is the definitive record that a plugin was installed (there is also a shortName.jpi.disabled file if disabled), and in what version. If you do not back that up, then you are no longer versioning the Jenkins configuration.
Now I understand that for plugins in Essentials, this is exactly right—the Essentials manifest is in control of what plugins are there. (Again, excepting *.jpi.disabled files, so I think what you meant to exclude is more narrowly plugins/*.jpi.) But what about “inessential” plugins? There will be no trace in the Git history of their presence or version.
Admittedly a rollback is not going to need to revert these, either, since the Essentials upgrade would not have touched them. Just seems a little risky. People may be expecting that a Git revision (plus master.key) corresponds to a fully reconstructible Jenkins instance.
There was a problem hiding this comment.
Maybe the .gitignore can simply list those plugins actually in Essentials?
plugins/structs.jpi
# …
There was a problem hiding this comment.
People may be expecting that a Git revision (plus master.key) corresponds to a fully reconstructible Jenkins instance.
Well, then they will be deceived... It's not a backup system as we already discussed.
To me that this part is going to be out of the scope of the snapshotting system, exactly for the reason you gave: I consider that the whole plugin list is defined explicitly by the BOM, for Essentials plugin.
I would say that for others, well it's also out of scope, and it will be the evergreen client responsibility to store the versions of all plugins present there, to be able to restore the previous version if we rollback.
jep/302/README.adoc
Outdated
| === Checking Jenkins health | ||
|
|
||
| From the perspective of this proposal, this is out of scope. | ||
| But the outter _controller_ of the upgrade, the evergreen client, will need a way to decide if a rollback must be triggered or not. |
jep/302/README.adoc
Outdated
| If a version is proved to be problematic after a few days, the data snapshotting system will **not** be used. | ||
|
|
||
| This would be quite impractical because the instance probably generated actual work items during this timeframe. | ||
| So rolling back that much later would be subject to losing data. |
There was a problem hiding this comment.
also: rolling that back (separable verbs, sigh)
| Versioning secrets should not be an issue per se, as the data snapshotting system is designed to be local to the running instance. | ||
| In other words, the Git repository data will never be pushed _outside_ by the _Essentials_ code, so no data leak is normally expected from this side. | ||
|
|
||
| But as users may have the unfortunate idea to push that repository elsewhere, not being aware they are leaking all secrets, we will conservatively add `secrets/master.key` to the `.gitignore` file. |
There was a problem hiding this comment.
Would be a good place to briefly explain what this file is and why excluding it from the Git repository prevents any straightforward extraction of secrets.
There was a problem hiding this comment.
I would like to avoid (badly) repeating something that probably ought to be defined in the official docs. But can add a link.
Something like I guess:
"
Secrets are encrypted using themaster.keyfiles, so we really want to avoid attackers to be able to access bothcredentials.xmlandmaster.keyat the same time
?
|
Now submitted for official review at jenkinsci#67 |
* LOCKED => refer now to jenkinsci#67 for the official submission**
JENKINS-49406
PR to allow a first pass of reviews as I still consider this content an alpha version.
Will obviously file a PR against the final target at https://github.com/jenkinsci/jep once I feel this is ready for actual review and acceptance there.
I am well aware this might scatter feedback in two different places, but I hope to use this PR to iron out obvious issues, and officially submit upstream only once I can say there's already an agreement the proposal is decent and has been reviewed not only by myself.
Also because as per the process, I consider we are at the "Initial discussion" stage, and hence I am not supposed to submit yet.
So I'm just using a GitHub PR just for its review facilities. I could also have used Google docs or something similar for this stage of the process.
So. Feedback welcome :)
References