Skip to content

[JENKINS-49406] Evergreen snapshotting data safety system#1

Closed
batmat wants to merge 28 commits intomasterfrom
JENKINS-49406
Closed

[JENKINS-49406] Evergreen snapshotting data safety system#1
batmat wants to merge 28 commits intomasterfrom
JENKINS-49406

Conversation

@batmat
Copy link
Owner

@batmat batmat commented Mar 13, 2018

* LOCKED => refer now to jenkinsci#67 for the official submission**

JENKINS-49406

PR to allow a first pass of reviews as I still consider this content an alpha version.

Will obviously file a PR against the final target at https://github.com/jenkinsci/jep once I feel this is ready for actual review and acceptance there.

I am well aware this might scatter feedback in two different places, but I hope to use this PR to iron out obvious issues, and officially submit upstream only once I can say there's already an agreement the proposal is decent and has been reviewed not only by myself.

Also because as per the process, I consider we are at the "Initial discussion" stage, and hence I am not supposed to submit yet.
So I'm just using a GitHub PR just for its review facilities. I could also have used Google docs or something similar for this stage of the process.

So. Feedback welcome :)

References

@batmat batmat changed the title [JENKINS-49406] WIP Evergreen snapshotting data safety system WIP [JENKINS-49406] Evergreen snapshotting data safety system Mar 14, 2018
@batmat batmat changed the title WIP [JENKINS-49406] Evergreen snapshotting data safety system [JENKINS-49406] Evergreen snapshotting data safety system Mar 14, 2018

We plan to configure Jenkins to use non default values to put the workspace and builds not under the same location as job configuration itself.

_Workspace Root Directory_:: `${JENKINS_HOME}/workspace/${ITEM_FULLNAME}`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you even have a workspace - executors on the master is evil and should be killed at the earliest convienience?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JENKINS-49861 FTR. I plan to have it be possible to leave an executor on the master for use by privileged jobs (JENKINS-24513). At any rate, it is worth at least reminding readers that this setting should generally have no effect.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a mistake @jtnord @jglick, I screwed up a copy paste when writing it.

My intent was to put both workspace and builds under the same global root, and under the same item each time.

Can you have a second look on that specific part?


_Workspace Root Directory_:: `${JENKINS_HOME}/workspace/${ITEM_FULLNAME}`
_Build Record Root Directory_::
`${JENKINS_HOME}/var/${ITEM_FULL_NAME}/builds`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var looks a little weird. why was this chosen over something more obvious like builds
var implies you can stuff more in here - but there is not enough in the path to disambiguate from builds.

=== Why Git

Using filesystem-level tools offering a snapshotting feature, like LVM, ZFS or btrfs to give a few examples, was considered.
But this was evacuated because _Essentials_ vision is about providing an link:https://github.com/jenkinsci/jep/tree/71d9391744c8cc7d6595805f7fdd327eedf6811a/jep/300#automatically-updated-distribution["_easier to use_ and _easier to manage_ Jenkins environment"].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evacuated? discounted?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


=== Man In The Middle

The main issue here is that an attacker could for instance instruct the evergreen client to ignore everything (by putting `*` in `.gitignore`), hence make it impossible to rollback.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don;t use .gitignore then. use -f to add?
move builds outside JENKINS_HOME (have a directly at the mount that is /builds /whatever /config and make "config JENKINS_HOME where the usual stuff is stored).

Copy link
Owner Author

@batmat batmat Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don;t use .gitignore then. use -f to add?

Not convinced this fondamentally changes the issue. The goal is still to be able to upgrade the evergreen-client too. So, be it by hacking the .gitignore content, or by hacking what things are passed to -f I guess the issue is to really find robust ways to not accept instructions from some wrong instructors.

move builds outside JENKINS_HOME (have a directly at the mount that is /builds /whatever /config and make "config JENKINS_HOME where the usual stuff is stored).

Good point. I gave it some thoughts, but kept it under JENKINS_HOME to respect/be consistent with JEP-301:

it is expected that all mutable state and data will be written and stored in JENKINS_HOME, including: Data generated by the Jenkins process. ...

Copy link

@jtnord jtnord Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just because that is written does not mean it is correct - infact that section of JEP is bad bad bad...

why?
--pluginRoot --webroot. JENKINS_HOME is possibly mounted on a resilient storage (and is usually slower or more costly) there is no need to unpack the Jenkins plugins or war into jenkins-home...

Also logs... Hmmm if they are in jenkins home and you rollback then you loose some logs..
that;s not a good thing is it?

and well if you take it at face value...

"No other data should be written to the filesystem outside of JENKINS_HOME."

build artifacts are data... yet with evergreen you want to move these to S3 or somewhere not the filesystem. that JEP would preclude that work. //cc @rtyler

now if jep-301 wanted to say thal all data be contained under a single directory then fine (but JENKINS_HOME does not need to be at the root of that directory) IMO and if you are making a more complex system because of that descision then go revisit it...

@@ -0,0 +1,260 @@
= JEP-302: Evergreen snapshotting data safety system
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not pick the number when submitting a JEP. See the instructions.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed thanks. Didn't rename the folder though to avoid GitHub folding all comments below despite me not having addressed them yet. Will do when submitting upstream.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GitHub is smart enough to handle rename detection, but I can understand not wanting to test that now!

7. [[healthcheck]] Check it is responding in a healthy manner.
8. If not:
.. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot].
.. Rollback to the previous data snapshot and Essentials BOM version.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roll back

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

8. If not:
.. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot].
.. Rollback to the previous data snapshot and Essentials BOM version.
To do so, we will generally create a new commit (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, git revert?

Copy link
Owner Author

@batmat batmat Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically.

UPDATE: fixed

To do so, we will generally create a new commit (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`.
.. Start (previous) Jenkins version
// what if starting the previous version doesn't work either?
9. Report the outcome to the Essentials backend.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., let the heads start rolling

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I hope this is clear, but here I meant that we want to report the outcome, whatever it is, to the backend. We also very much need the result if everything goes well, to be able to push it to more instances.


* Update `.gitignore` content with current Essentials release.
* `git add .`
* `git add -u `
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray space breaks formatting

=== Secrets

Versioning secrets is not an issue, as the snapshotting system is going to be local to the running instance.
As the Git repository data will never be pushed _outside_, no data leak is expected from this side.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is acceptable. Someone will use the Git repository externally regardless of what you advertise.

Fortunately there is a simple solution: add

secrets/master.key

to .gitignore. It is created on first startup and never changed thereafter, so it would never be affected by an upgrade or downgrade. And without it, all other secrets remain (theoretically) secret.

Reference

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I do not feel strongly to add it or ignore it. But well, if someone is able to access the FS and so what you are saying, in any case they have the power to read that key.

I understand what you are saying as "someone could have the bad idea to push that repo elsewhere, when it's not supposed to be, not groking the fact they're going to leak central secrets doing so, and so we'll be seen as responsible for it.".

So yes, let's be conservative and not encourage people to shoot themselves in the foot.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, fixed.


* Apply the upgrade or downgrade, then check the instance is _running fine_:
** No administrative monitor triggered
** No new ERROR logs showing up
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, Level.ERROR? Hardly useful: few plugins log at this level to begin with. Almost everything of interest would be a WARNING.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. But your point makes me realize I really need to externalize this concept. We need a standalone design for determining Jenkins health I think. It's a critical condition/piece for many components: for testing, for evergreen-client to decide to rollback or not, etc. And it is not (necessarily) related to the data snapshotting system.

So will refactor to make this generic and point to something to be done in this regard.

Also commented in https://issues.jenkins-ci.org/browse/JENKINS-49406?focusedCommentId=332233&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-332233 about that

** No administrative monitor triggered
** No new ERROR logs showing up

We will need to develop adhoc testing tools to be able to automatically assess the health of a Jenkins Essentials instance after an upgrade or a downgrade.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ad-hoc


Before delivering updates on real connected instances, we must test at least the following scenarios.

* Apply the upgrade or downgrade, then check the instance is _running fine_:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing something critical here: you need to be able to ensure that nothing interesting happening in the system between the time Jenkins completes initialization after the upgrade, and the time the health check completes. If builds run, that is probably fine (as I argue above w.r.t. build record location), though a QueueTaskDispatcher or just setQuietDown could block builds temporarily after startup. But what about users editing job or system configuration? You would need to put some kind of UI block in place, I think.

The broader question which I guess you are punting on is how useful a simplistic health check like this really is. Just starting without warnings is pretty useless if every build you try to run bombs out with some fatal low-level error. But then if you let some trial builds run, you are widening the window after which a revert could be problematic.

Detailing this here is out of scope for this proposal.
It is however highly desirable that we centralize this logic and use it both during automated tests, and in production for the evergreen-client to automatically analyze if a product instance is healthy or is not (and decide to rollback or not, for the current matter here).

We will leverage the link:https://github.com/cloudbees/operations-center-acceptance-test/[Jenkins Acceptance Test Harness project] for this purpose.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to link to something else.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooops fixed

I was not supposed to choose the number myself, obviously...

Thanks Jesse for the reminder.

Not renaming the folder yet, on purpose, to avoid folding all other
review comments although I didn't address them yet.
batmat added 15 commits March 15, 2018 11:18
Jesse is basically saying that despite the system/git repo is
not designed to be pushed outside, (some) people will do so...
Was trying to use the title of a code block, starting with a dot.
But here asciidoctor was confused by the .. at ..gitignore.

So using the clearer form title=blah to make it work
(and be less magic in the same go, which is good).
The intent here was to have a single root for builds and
workspace for a given item.
@batmat
Copy link
Owner Author

batmat commented Mar 15, 2018

Reminder for me, and for anyone interested, things I want to address:

  1. About the point above by @jglick and @jtnord on moving $JENKINS_HOME/var out of $JENKINS_HOME, I opened a dedicated thread for feedback on the dev list. I am leaning towards I'm going to adjust the proposal with this. It makes sense to me.
  2. Another important point above by @jglick which I'm summarizing as "how do we check Jenkins is actually healthy before opening the flood gates to users: letting jobs execute, people modify things through the UI, etc.". This point is probably the most critical and needs addressing.

If there is anything else important I missed, please don't hesitate to tell me so.

Proposal to address Jesse's comment:

> The broader question which I guess you are punting
> on is how useful a simplistic health check like this
> really is. Just starting without warnings is pretty
> useless if every build you try to run bombs out with
> some fatal low-level error. But then if you let some
> trial builds run, you are widening the window after
> which a revert could be problematic."
as it should have been since the beginning
The intent is to have everything not ignored in the index.
In other words, `git status` just after a SNAPSHOT should be in the `nothing to commit, working tree clean` state.

We will *not* snapshot data that is too big or does not make sense to store:.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray :

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

│   ├── nodes
│   ├── plugins
│   ├── secrets
│   ├── updates
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ignored I think

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to .gitignore.


This way, everything related to jobs, but is less important than its configuration itself is separated under a different `${JENKINS_HOME}/var/` root.

Though we will for now back up these directories too, having this distinction baked in from the ground up could prove useful in the future if we need to make the snapshotting and rollback processes smarter and more robust.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I thought we were only backing up /var/jenkins/home?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's it. Reading, maybe I forgot to adjust a sentence.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, removed that part.

====== about `/var/jenkins/home/plugins`

This directory contains the hpi/jpi files before extraction.
Ideally, we should be moving this elsewhere under `/var/jenkins/var/plugins`, but it's not doable yet currently (`--pluginsroot` only configures a different location for exploded plugins).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful here. Currently $JENKINS_HOME/plugins/shortName.jpi is the definitive record that a plugin was installed (there is also a shortName.jpi.disabled file if disabled), and in what version. If you do not back that up, then you are no longer versioning the Jenkins configuration.

Now I understand that for plugins in Essentials, this is exactly right—the Essentials manifest is in control of what plugins are there. (Again, excepting *.jpi.disabled files, so I think what you meant to exclude is more narrowly plugins/*.jpi.) But what about “inessential” plugins? There will be no trace in the Git history of their presence or version.

Admittedly a rollback is not going to need to revert these, either, since the Essentials upgrade would not have touched them. Just seems a little risky. People may be expecting that a Git revision (plus master.key) corresponds to a fully reconstructible Jenkins instance.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the .gitignore can simply list those plugins actually in Essentials?

plugins/structs.jpi
# …

Copy link
Owner Author

@batmat batmat Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People may be expecting that a Git revision (plus master.key) corresponds to a fully reconstructible Jenkins instance.

Well, then they will be deceived... It's not a backup system as we already discussed.

To me that this part is going to be out of the scope of the snapshotting system, exactly for the reason you gave: I consider that the whole plugin list is defined explicitly by the BOM, for Essentials plugin.

I would say that for others, well it's also out of scope, and it will be the evergreen client responsibility to store the versions of all plugins present there, to be able to restore the previous version if we rollback.

=== Checking Jenkins health

From the perspective of this proposal, this is out of scope.
But the outter _controller_ of the upgrade, the evergreen client, will need a way to decide if a rollback must be triggered or not.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

If a version is proved to be problematic after a few days, the data snapshotting system will **not** be used.

This would be quite impractical because the instance probably generated actual work items during this timeframe.
So rolling back that much later would be subject to losing data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar: would risk data loss

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: rolling that back (separable verbs, sigh)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

Versioning secrets should not be an issue per se, as the data snapshotting system is designed to be local to the running instance.
In other words, the Git repository data will never be pushed _outside_ by the _Essentials_ code, so no data leak is normally expected from this side.

But as users may have the unfortunate idea to push that repository elsewhere, not being aware they are leaking all secrets, we will conservatively add `secrets/master.key` to the `.gitignore` file.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good place to briefly explain what this file is and why excluding it from the Git repository prevents any straightforward extraction of secrets.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid (badly) repeating something that probably ought to be defined in the official docs. But can add a link.

Something like I guess:

"Secrets are encrypted using the master.key files, so we really want to avoid attackers to be able to access both credentials.xml and master.key at the same time

?

@batmat
Copy link
Owner Author

batmat commented Mar 21, 2018

Now submitted for official review at jenkinsci#67

Repository owner locked as resolved and limited conversation to collaborators Mar 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants