Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize and document a labelling scheme for Jenkins nodes #93

Open
smlambert opened this issue Dec 12, 2017 · 17 comments
Open

Standardize and document a labelling scheme for Jenkins nodes #93

smlambert opened this issue Dec 12, 2017 · 17 comments

Comments

@smlambert
Copy link
Contributor

As we add more machines, and write more pipeline scripts for various builds and tests in Jenkins, it is useful to settle on a labelling scheme that will allow us flexibility and improved machine management (even taking advantage of some of the Jenkins APIs for automated labelling).

Benefits to having and documenting a 'scheme':

  • as new machines and machine capabilities are added, it is clear how to add/organize new labels and sub-categories
  • avoids label duplication
  • prevent running jobs on machines we do not want them to run on
  • added flexibility in pipeline scripts
  • triage failures faster, as it becomes clearer to see that a test fails on a machine with particular labelled attributes, but passes on other machines (different label set) example: pipeline script to run tests on sw.os.windows (doesn’t care what version), but would be interesting to note if see failures only on windows.10 machines
  • allows for certain level of machine sanity checking, especially if we automate the labelling via Jenkins APIs, so that we can compare the expected machine config (via ansible) with the actual machine config (via calls to Jenkins API http://javadoc.jenkins-ci.org/hudson/model/Node.html#getAssignedLabels--).

I suggest the schema 'tree' below, categorizing under 3 top-level roots, hw, sw, and ci (continuous integration catch-all, for all groupings not hw or sw), each with its own sub-categories.

hw.platform.zlinux
xlinux
plinux
windows
aix
zos
osx

hw.arch.s390
ppc
x86

hw.endian.le

hw.bits.64bit
32bit

hw.physical.cpu.xx
disk.xx
memory.xx

sw.os.rhel.6
sw.os.rhel.7
sw.os.ubuntu.14
sw.os.ubuntu.16
sw.os.sles.11
sw.os.sles.12
sw.os.aix.6
sw.os.aix.7
sw.os.osx.10
sw.os.windows.8
sw.os.windows.10
sw.os.zos.1_13 (where dotted version numbers represented by _ )
sw.os.zos.2_1

sw.tool.gcc.xx (where xx represents version number)
sw.tool.docker.xx
sw.tool.hypervisor.kvm, etc

ci.role.perf
ci.role.compile
ci.role.test
ci.role.test.jck

We could just start with the labels that are of direct use to build/test scripts and add as we see the need.

Do people have strong thoughts on the matter? In general, labels are of no consequence to people unless they are writing scripts and/or adding builds to Jenkins, so if you are actively working on builds, please share your thoughts. Thanks.

@smlambert
Copy link
Contributor Author

ci.sponsor.ibm
ci.sponsor.ljc
ci.sponsor.joyent
etc

@karianna karianna added this to Backlog in infrastructure Dec 12, 2017
@tellison
Copy link
Contributor

Thanks for bringing order the chaos @smlambert ! I'm +1 to the idea of structured labels.
A few minor comments:

  • Although the labels 'look' hierarchical, they are not interpreted as such; so people will have to be aware that, for example, specifying sw.tool.docker won't run on a machine only labelled as sw.tool.docker.1_7_1. We may have to have a few labels on a node where such details are available.

  • Not sure that we need the physical segment, or if we want it let's use it consistently. So (hw.physical.cpu.4 and hw.physical.endian.be) or (hw.cpu.4 and hw.endian.be) but not a mixture. I think the hw implies physical so I would just drop it.

  • What is the size of the disk you are referencing? Presumably the actual size of the workspace disk; not the free space, or minimum size required etc. Likewise for memory. Not too sure how the scripts will use this physical size info as it won't represent available storage.

  • Is the platform segment designed as a shorthand for other tag groups? i.e. why have hw.platform.zos and sw.os.zos? If the scripts care about CPU architecture then they would specify e.g. hw.arch.s390x and if they care about the OS they would specify e.g. sw.os.osx.10

  • really pedantic now ;-) hw.bits.64bit -> hw.bits.64.

So your list, updated with those comments for your consideration, becomes:

hw.arch.s390x
ppc64le
x86

hw.endian.le
be

hw.bits.64
32

hw.cpu.xx
hw.disk.xx (workspace disk size in Gb)
hw.memory.xx (size in Gb)

sw.os.rhel.6
sw.os.rhel.7
sw.os.ubuntu.14
sw.os.ubuntu.16
sw.os.sles.11
sw.os.sles.12
sw.os.aix.6
sw.os.aix.7
sw.os.osx.10
sw.os.windows.8
sw.os.windows.10
sw.os.zos.1_13 (where dotted version numbers represented by _ )
sw.os.zos.2_1

sw.tool.gcc.xx (where xx represents version number)
sw.tool.docker.xx
sw.tool.hypervisor.kvm, etc

ci.role.perf
ci.role.compile
ci.role.test
ci.role.test.jck
ci.sponsor.ibm
ci.sponsor.ljc
ci.sponsor.joyent
etc

@smlambert
Copy link
Contributor Author

smlambert commented Dec 13, 2017

I like your suggestions @tellison 👍

You are correct that writing a script that looks for a label named "sw.tool.docker" would not find a machine only labelled as sw.tool.docker.1_7_1 (and bear in mind, we may decide in this discussion that we do not care about version numbers at all for some tools/categories), but in the case where I am writing a script for a job that doesn't care what version, but just wants any machine that has docker installed, it would look for label.contains("sw.tool.docker") or label.startsWith("sw.tool.docker"), rather than label.equals("sw.tool.docker")...

I have also had the discussion with others regarding what if I am looking for a machine with memory or version greater than a certain value, and well, yes, as labels are returned as a string, I have to take the string and parse it and assign that memory or version portion to an int to compare it, but its scriptable...

For the hw labels like disk, memory etc, there are some tests that are designed to run on machines with a certain number of cores, or on machines with a certain amount of memory, but agreed, that if we decide we want these labels, it should be clear from the name what is meant by them, or best not to have them at all.

and as for turning chaos into order, I am mostly hoping for something more intuitive and useful than to refer to a map to know what machines I can use, a map that starts with the sentence, "if you are reading this, then our ... labels and ... nodes are a mystery to you..." eek!
https://cwiki.apache.org/confluence/display/INFRA/Jenkins+node+labels

@sxa
Copy link
Member

sxa commented Dec 13, 2017

I'd personally rather ditch the hierarchy, but maybe that's just me. plinux le ubuntu16.04 would seem like reasonably tags on their own. @smlambert are you planning to do granular searches on the tags to have a reason for the hieratchy? I can sort of see how it might make sense for hardware values such as RAM/Disk/CPU if we were going to base stuff on that though. I'm not massively bothered by it, but figured I'd share my view :-)

@smlambert
Copy link
Contributor Author

Thanks @sxa555, I definitely have a use case for querying for all tools on a given machine, (so return all sw.tools.* per machine, where I may not know the set of tools in the list of labels (especially if some have been newly added via ansible and (eventually) automatically added to Jenkins node via its API), to be able to ask for each specifically.

Though to your point wouldn't need the deep hierarchy there, could just have tools.docker, etc. But this hierarchy also serves as guidance for where to add new labels and is generally benign/inoffensive for anyone not writing scripts that use them and possibly addresses some future uses of these categories.

"Print me today's map of the Jenkins machine farm" - what hardware is represented in it? what software? etc.

@gdams
Copy link
Member

gdams commented Dec 14, 2017

perhaps https://ci-jck.adoptopenjdk.net/ would be a good test bed to setup a labelling scheme that we could then migrate to our main jenkins?

@gdams
Copy link
Member

gdams commented Mar 5, 2018

So I'm kind of with @sxa555 on this..... all of this sw and ci stuff seems to be adding unncessary "fluff" to the schema. I do agree that we need to standardise the labeling schema as it is currently messy and I have been working on this in ansible by auto creating machines in jenkins.

My proposed schema is as follows but I am happy to modify/add labels if people are unhappy:

linux ubuntu1604 x64 cloudcone x86_64 test ubuntu ubuntu16

@smlambert
Copy link
Contributor Author

schema - "a representation of a plan in the form of an outline or model", "organizes categories of information and the relationships among them"

The point of the 'fluff' is to organize labels into categories AND to make it clear how to add more labels as we grow and end up needing more labels. It enforces naming, that will help avoid label name conflicts later (when teams/companies want to merge their new features into this build farm).

I have worked on several projects where a schema was not put in place, and they end poorly, with a flat list of less than meaningful words (of a variety of styles, depending on who added them becoming more irregular as time marches on, since there is no obvious guidance or pattern to follow).

I am not sure I understand why the resistance to fluff. The fluff does not really cost anyone anything, since on label producer side, you have automated the labelling (therefore do not have to type in long names) and on the label consumer side, pipeline scripts are set up to use the labels, clarity in naming, make it easier to understand why a particular script is looking for a particular label.

I am happy to keep the list of labels to a minimum (only adding labels that will be 'consumed' by some script or job), but I think the labels should exemplify the 'plan'.

For the flat list linux ubuntu1604 x64 cloudcone x86_64 test ubuntu ubuntu16 there is near duplication, x64 / x86_64 and ubuntu / unbuntu16 / ubuntu1604. What scripts or jobs are using those labels presently?

@gdams
Copy link
Member

gdams commented Mar 19, 2018

the build scripts currently all use this schema

@jdekonin
Copy link
Contributor

I like the updated list based on @tellison comments, with one addition...technically the arch ppcle does not exist. Arch and endian...no??

I am not fond of the idea of labels that mean the same thing. What is the difference between x64 and x86_64? I'm not saying your use of them is invalid, I'm just interested in why one or the other.

Is cloudcone a sponser or a service?

@gdams
Copy link
Member

gdams commented Mar 25, 2018

okay so are we happy with a schema based on #93 (comment)?

@smlambert
Copy link
Contributor Author

Ok, let me clean it up, and document where I think we are at (in a README or Wiki on this repo). I will not remove any currently labelling at present. We can overlay the new schema, and switch testing over, then build scripts, then remove labels. I don't mind going around and doing the clean up of this over the next week or 2.

I also amend my initial statement about adding labels that are not used. I think we should add only those labels we actively need to differentiate machines, adding new ones only as needed.

@smlambert
Copy link
Contributor Author

Working on the doc here (will replace jpg with better image shortly):
https://github.com/smlambert/openjdk-infrastructure/blob/labels/docs/jenkinslabels.md

Note that I still need to go and look at all of the 'consumers' of existing labels, to ensure we start with the minimal set based on usage. This implies that we can and possibly should fix scripts that could be more logically correct (which I will do as I find them).

I believe some of the label needs relate to restrictions around where you can build and subsequently run some of the linux builds (due to 'compile on lowest version of gcc available' story). Unclear if the linux flavours labels are used elsewhere either.

jdekonin pushed a commit to jdekonin/adoptium-infrastructure that referenced this issue May 8, 2018
added csz25088.canlab.ibm.com ansible_host=9.23.30.88
adoptium#93
@karianna
Copy link
Contributor

karianna commented Jun 5, 2018

Hi all - did we come to a consensus here?

@smlambert
Copy link
Contributor Author

On the test side, we have.

We have in the sense that I have labelled all of the test machines with the labels I proposed and have been using those labels for a while. This allows us to use the test CI scripts at Adopt and a few other Jenkins servers / open projects that follow the same labelling schema.

Because I was not sure of consensus, I have not:

  • removed the old labels
  • added new labels on build machines (just test machines)
  • updated build scripts to use new schema

@AdamBrousseau
Copy link
Contributor

@smlambert can you open a PR for a Doc that has the schema that was decided upon. That will be easier to reference than trying to figure out which comment in this issue is the most correct version. I could do it if you wish but I know you were working on a doc already (don't want to step on any toes).

@smlambert
Copy link
Contributor Author

Sure, will do @AdamBrousseau, thanks for the nudge!

AdamBrousseau added a commit to AdamBrousseau/openj9 that referenced this issue Jul 17, 2018
- Change aix,ppcle,390
- Remove ubuntu version
- Update to hierarchical labels based on standardized
  schema defined in adoptium/infrastructure#93
- Also remove nestmates spec which was added by default (eclipse-openj9#2270)

Issue eclipse-openj9#1562
[skip ci]

Signed-off-by: Adam Brousseau <adam.brousseau88@gmail.com>
AdamBrousseau added a commit to AdamBrousseau/aqa-tests that referenced this issue Jul 19, 2018
- Conform to label convention outlined in
  adoptium/infrastructure#93

Signed-off-by: Adam Brousseau <adam.brousseau@ca.ibm.com>
smlambert pushed a commit to adoptium/aqa-tests that referenced this issue Jul 19, 2018
- Conform to label convention outlined in
  adoptium/infrastructure#93

Signed-off-by: Adam Brousseau <adam.brousseau@ca.ibm.com>
@sxa sxa modified the milestones: Icebox / On Hold, Backlog Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

7 participants