Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] GCHP needs a Continuous Integration (with a build matrix) #43

Closed
JiaweiZhuang opened this issue Sep 7, 2019 · 30 comments
Closed

Comments

@JiaweiZhuang
Copy link

Problem

The difficulty to build GCHP (despite the large improvement over early versions) is preventing user adoption and eating a lot of engineer time (e.g. on debugging makefiles). In particular, it is time-consuming to diagnose compiler/MPI-specific problems, as there are so many combinations of them.

So far this problem is being treated passively -- we stick to very few combinations we know that is working (notably ifort + OpenMPI3). Other combinations (gfortran, other MPIs) are solved case-by-case, typically when a user hit bugs on a specific system.

Suggestions

The sustainable way (which will save a lot of engineer time in the long-term) is to deal with this problem actively -- we should explore all common combinations using build matrix that is offered by most Continuous Integration (CI) services.

The components of the build matrix include:

By having a continuous build at every commit / every minor releases, we will be able to:

  • know which combination works and which doesn't
  • avoid breaking the combinations that already work
  • try to make more combinations work correctly

This also helps user to find the "shortest path" to solve their specific error. An example question is "my build is failing on Ubuntu + gfortran + mpich; which component should I change to fix the problem"? By looking at the matrix, you can see that (for example) changing the MPI can lead to a correct built.

Where to start

A simple CI (on Travis) for GC-classic is geoschem/geos-chem#11 However, the memory & compute limit on Travis probably won't allow building GCHP. Other potentially better options are:

  • Azure pipelines (free)
  • GitHub actions (free)
  • AWS CodeBuild and CodePipeline (Costs money, but allows more compute. Can potentially grab input data from S3 to further run model. GCHP also has several run-time bugs that cannot be detected at compile time.)

Tutorial-like pages:

Existing models for reference:

  • CLIMA is the only Earth science model I am aware of that has continuous integration (on Azure pipelines).
  • Trilinos uses a trilinos-autotester bot to run tests on PR. I guess it runs on a on-premise cluster.
@JiaweiZhuang
Copy link
Author

Better wait until the CMake update? @LiamBindle probably knows better.

@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 9, 2019

The build matrix would be especially useful. That would help us identify versions of our dependencies and combinations of dependencies (if any) that are broken. That would help simplify the "GCHP Quick Start Guide" which is a bit daugnting right now.

It would probably be best to wait until the CMake update IMO for the following reasons: (1) gchp_ctm is dropping support for GNUMakefiles so we will have to set up the CI again in a few months, (2) ESMF becomes an external library which significantly reduces compile-time and the build's memory requirements, and (3) set up with CMake should be faster because there is less reliance on environment variables.

I'll look into getting this going for gchp_ctm.

@yantosca
Copy link
Contributor

yantosca commented Sep 9, 2019

Maybe as a first step we could try to set up a build matrix for GC-Classic, just to get going. That would use everything except the MPI. Then we could translate that over to GCHP once the Cmake transition there is complete. Just a thought...

@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 9, 2019

That's a good idea.

If no one else plans to take this on (or needs it urgently) I could start looking at it in my downtime. It sounds fun side project. I'm pretty busy right now though, so I'm not sure when I'll be able to get around to it.

Are you looking at picking this up @JiaweiZhuang?

@yantosca
Copy link
Contributor

yantosca commented Sep 9, 2019

@LiamBindle, I might be able to help you out with this as well, as time allows. If we make a container, we should store it on the GCST dockerhub site so that we can all have access to it.

I've been recently trying to fix a lot of the GCPy issues and make updates that were requested by the GCSC to the benchmark output. But if you need a hand I could probably find some time.

@LiamBindle
Copy link
Contributor

@yantosca, sounds good. I'll keep you posted on any progress I make then.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 9, 2019

Maybe as a first step we could try to set up a build matrix for GC-Classic

It will be a good first exercise! I guess testing different GCC versions would be useful. But the biggest use case for this is indeed making GCHP build more robust. Building GC-classic is easy.

@JiaweiZhuang
Copy link
Author

I could start looking at it in my downtime.

That would be wonderful @LiamBindle ! I would recommend trying Azure pipeline first as it seems quite popular right now. I am also going to use it for my package.

@JiaweiZhuang JiaweiZhuang pinned this issue Sep 11, 2019
@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 12, 2019

Hi everyone,

Disclaimer: This is just me spilling my thoughts/findings from last night

I started looking into this last night and I think it looks pretty straightforward. I thought I'd put my finding here so that others with more expereince and knowledge (@JiaweiZhuang, @yantosca, and others) with docker and CI could potentially recongize any rabbit holes/antipatterns in my plan.


Azure pipelines

I found it really easy to set up a simple azure pipeline with a matrix. Thanks for the suggestion @JiaweiZhuang! The documentation is great, and also, their youtube tutorials made setting up an account and project really easy. (e.g. this one, this one, and this one ).

I set up a basic azure-pipeline.yml file that runs a few commands including gfortran --version with a build matrix for gcc 4, 5, 6, 7, and 8. The build commands at the end don't actually work yet because my master branch doesn't have cmake support, but it's the general idea.

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

strategy:
  matrix:
    gcc4:
      containerImage: gcc:4
    gcc5:
      containerImage: gcc:5
    gcc6:
      containerImage: gcc:6
    gcc7:
      containerImage: gcc:7
    gcc8:
      containerImage: gcc:8

container: $[ variables['containerImage'] ]

steps:
- script: |
    gfortran --version
    mkdir build
    cd build
    cmake -DRUNDIR=IGNORE -DRUNDIR_SIM=standard $(Build.Repository.LocalPath)
    make -j
    make install
  displayName: 'Building GEOS-Chem'

Here is the pipeline project: https://dev.azure.com/lrbindle/geos-chem/_build?definitionId=1.

Essentially, my plan is to just replace the containerImages in the matrix with "geos-chem-build-matrix" images.


GEOS-Chem-Build-Matrix Images

I put together a simple dockerfile that builds an image with GEOS-Chem Classic's dependencies. You can find it here.

Essentially, I was thinking that I'd set up an azure pipeline with a matrix of gcc, netcdf (pre 4.2), netcdf-c (post 4.2), and netcdf-fortran (post 4.2) that builds images and pushes them to dockerhub. Then in geos-chem's azure-pipeline.yml we pull those images.


Further steps

First, I'm going to get the "geos-chem-build-matrix" pipeline going. I thought I would start with building images with all gcc major versions, and the latest HDF5, NetCDF C and Fortran libraries. I'm planning to push these to DockerHub/liambindle/gcc-netcdf-c-netcdf-fortran with tags like "5-4.7.1-4.4.5" for thegcc, netcdf-c, and netcdf-fortran versions.

The second step will be creating a branch on GEOS-Chem like feature/AzurePipeline with an azure-pipeline.yml file similar to the one I posted above.


Next week I'm going to be focusing on prepping for our visit to Goddard, so after today, it will probably be about 2-3 weeks before I can pick this up again. I thought I should get this down to make picking it back up in a few week easier.

@yantosca
Copy link
Contributor

This is great, Liam! I think using Azure pipelines is a good move.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 12, 2019

@LiamBindle That's really fast work!

their youtube tutorials made setting up an account and project really easy.

Oh, I didn't notice this before. They have some great stuff.

I put together a simple dockerfile that builds an image with GEOS-Chem Classic's dependencies.

Great, just quick comments: LiamBindle/geos-chem-build-matrix#1 LiamBindle/geos-chem-build-matrix#2

Essentially, my plan is to just replace the containerImages in the matrix with "geos-chem-build-matrix" images.

I think we should reuse Docker images more cleverly. The build matrix can become every large. We might end up having something like:

  • 2 base OS
  • 3 gcc versions
  • 8 MPI variants and versions

That's 2x3x8=48 parallel builds. Alternatively, we can pick up some representative combinations, instead of try all possible cases. But 20+ builds would still be normal.

Some points to consider:

  • We shouldn't rebuild the MPI & NetCDF libraries before every time we build GEOS-Chem. Do you know whether Azure pipeline caches images? Or can it pull pre-built images from DockerHub / other container registries so you don't need to rebuild it every time?
  • When testing different versions of MPI, we can use a single installation of the NetCDF libraries, to save time & space. (I don't think GCHP is using the MPI features in NetCDF?)
  • To avoid maintaining too many images, a single Docker image might contain several compilers and libraries, managed by spack env. But on the other hand, if you put all libraries into a single image, it would be very huge and take forever to build (can't be parallelized). A reasonable choice is having a single compiler version for each image (the GCC image you choose is a good starting point), and build NetCDF + 8 MPI variants inside this image. Or maybe there are more clever ways.

@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 12, 2019

Thanks for the feedback!

@JiaweiZhuang, thanks, I thought you would probably have some good insight because I think you have a lot more experience with containers than I do. I think it maybe wasn't quite clear though in what I'm thinking.

Or can it pull pre-built images from DockerHub / other container registries so you don't need to rebuild it every time?

This is what I am thinking. Essentally, LiamBindle/geos-chem-build-matrix is a project that builds and pushes such images to DockerHub. We can pull these images later in geoschem/geos-chem for our build matrix pipeline. Because there are many possible combinations, as you have mentioned, I'm working on a pipeline that uses a matrix to build all these prebuilt images and push them to dockerhub.

Once those are working, we can set up a Azure Pipeline for geoschem/geos-chem that pulls these prebuilt images with a matrix. This should be easy once there are images we can just pull from Docker Hub.

I think that when I get the prebuilt image pipeline working it would be good to make the images more complex (e.g. seperate images for each gcc, and then install multiple netcdf versions in each image by using spack).

What do you think?

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 12, 2019

builds and pushes images to DockerHub.

Oh I thought you will be building the images on the fly before building GEOS-Chem (didn't see DockerHub-related commands in your repo). But now I see your plan.

From Azure docs "Docker task" section I see that:

Use this task in a build or release pipeline to build and push Docker images to any container registry using Docker registry service connection.

The Docker Registry service connection can be either Azure Container Registry or Docker Hub.

If I understand correctly, this "Docker task" is mainly for publishing/deploying/pushing Docker images (say, to get around the resource limit of DockerHub's own build).

Then, the next stage, is using Container jobs to pull pre-built images (containing MPI libraries, etc.) to further build GCHP.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 12, 2019

Basically, there are two independent steps

1. Build the base images containing NetCDF and MPI libraries.

This can be done on DockerHub itself, or on Azure pipelines, or on many other platforms (even on local machines). This is mostly what your current repo (https://github.com/LiamBindle/geos-chem-build-matrix) is doing. This step only needs to be done once, and rarely needs a rebuild.

One major question for this step is how to manage all variants of images and minimize redundant installation of libraries. A useful reference for handling image dependencies is Jupyter's docker stack: https://github.com/jupyter/docker-stacks (it has a deep dependency chain)

This step needs its own repo. Maybe eventually merging it to https://github.com/geoschem/geos-chem-docker.

2. Actually run CI inside the images built in step 1

This is where the "build matrix" is defined. Although you can still have a "build matrix" in step 1, it won't be as big as the matrix here (a single image might contain multiple libraries)

Azure pipelines will be the primary choice, as it provides more resources than Travis & Docker Hub.

This step needs to add CL config files into GEOS-Chem's code repo, so the build can be triggered at every commit. GCHP's code repo should keep a tag/hash pointing to a GC-classic commit.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 12, 2019

In terms of the resource limit on Azure pipeline, I saw in Parallel jobs that:

Public project: 10 free Microsoft-hosted parallel jobs that can run for up to 360 minutes (6 hours) each time, with no overall time limit per month.

So the matrix can't be too big if we stick to the free plan :) But 6 hours are very long, compared to 50 minutes on Travis and 2 hours on Docker Hub. A single job can probably finish 3~4 GCHP builds (another reason why merging multiple libraries into one image is useful)

As a starting point, let's just build 8~10 environments independently, choosing from:

  • Compiler: gcc 7.x, 8.x, 9.x
  • MPI: OpenMPI 3.x, Intel MPI, MVAPICH, MPICH

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 12, 2019

@LiamBindle brought some good ideas in a conservation.

  1. Should also test NetCDF 4.1.x which is the last version containing both C and Fortran parts in a single library.
  2. For GC-classic, should also test CMake in addition to Make. This helps detecting issues like [BUG/ISSUE] GEOS-Chem 12.6.0 with CMake encounters seg fault when compiling ocean_mercury_mod.F geos-chem#64
  3. Instead of "build matrix", an easier & cheaper setup is "perturbing a single component". Otherwise the total number of builds can increase too quickly. For GC-classic, a "standard setup" can be GCC 8.x + NetCDF 4.6 + CMake + Debian. Then, explore each component by:
  • Changing GCC to 4.x, 5.x, 6.x, 7.x, 8.x, 9.x
  • Changing NetCDF to 4.1.x
  • Changing CMake to Make
  • Changing base OS to CentOS
    (that's exact 10 builds in total)

For GCHP, can reduce the number of compiler variants (older gcc won't work anyways; probably just test 8.x and 9.x), and add a new "MPI" dimension.

@LiamBindle
Copy link
Contributor

Hi everyone,

Just following up on where I got earlier today.

Pipeline for generating build matrix images

I got a initial version of a pipeline for building the build matrix images done and that can be found here: LiamBindle/geos-chem-build-matrix. Because it was easiest, this inial version only builds the following images: latest netcdf-c and netcdf-fortran with gcc4, gcc5, gcc6, gcc7, gcc8, and gcc9, as well as netcdf4.1 with gcc7. These should be a good start, and we can build on these to cover multiple versions of glibc and different base OSs as next steps.

Screenshot from 2019-09-12 22-09-42

A build matrix pipeline for GEOS-Chem Classic

As a test, I set up a build matrix pipeline on the feature/AzurePipeline branch of LiamBindle/geos-chem. The pipeline can be found here: https://dev.azure.com/lrbindle/geos-chem/_build/results?buildId=35. Right now build tests are running on all 7 images (i.e. all major GCCs, and GCC 7 with the old NetCDF) for Standard, TOMAS, TransportTracers, Hg, and complexSOA_SVPOA with APM. This is what it looks like

Screenshot from 2019-09-12 22-23-43

I wasn't able to do this for on a feature branch of the offical GEOS-Chem repo because I don't have the proper permissions. To set this up for GEOS-Chem, we just need to create a GEOS-Chem organization on Azure DevOps, create a GEOS-Chem project, create a new pipeline, and then replace the default azure-pipeline.yml file with azure-pipeline.yml.

I'm going to be focusing on prepping for our Goddard visit for the next week, but I'll be happy the help set this up once I'm back.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 13, 2019

@LiamBindle Thanks for the update!!

Quick question: I cannot access the link https://dev.azure.com/lrbindle/geos-chem/_build/results?buildId=35. Got "401 - Uh oh, you do not have access." after logging into my Azure account. Would you be able to make the link public (without requiring log-in)?

@LiamBindle
Copy link
Contributor

Whoops, sorry about that. Done!

@lizziel lizziel unpinned this issue Sep 17, 2019
@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Sep 25, 2019

Just notice LLNL's Building-Linking-and-Testing (BTL) framework (https://github.com/LLNL/blt). Not sure if that's useful for our task, but @LiamBindle might find it interesting:

BLT is a streamlined CMake-based foundation for Building, Linking and Testing large-scale high performance computing (HPC) applications.

@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 27, 2019

@JiaweiZhuang Thanks for mentioning.

I've come across BLT before, but personally, I don't think a CMake macro libraries like BLT would benefit GEOS-Chem/GCHP. The new MAPL already depends on ecbuild. It might be that larger projects find these libraries convenient, but I think that for GEOS-Chem, vanilla CMake is more clear and easier to maintain—for now at least.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Oct 18, 2019

Should we move the discussion to #1 ?

Update: Will use #1 to track new GCHP development. General discussions will still be posted here.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Oct 23, 2019

@LiamBindle Just notice a paper on CI/CD for HPC using Jenkins and Singularity container: Continuous Integration and Delivery for HPC: Using Singularity and Jenkins

Following a similar idea, it is totally possible to develop a build-run-plot CI pipeline to automate the entire GEOS-Chem benchmarking workflow, with additional benefits like performance monitoring (see pydata benchmark for example). For the "delivery" side, the same pipeline can be further extended to build containers/AMIs/Conda packages/Spack packages for GC-classic/GCHP, for general science users.

CI/CD is rarely done for huge HPC codebase, but I think you have the technical capability to make it work, and then the same workflow can be adopted by other models. It can become a serious research project and lead to a GMD-style publication, if you'd like to continue working on build systems.

References:

The actual framework (AWS vs Azure vs open-source) probably doesn't matter that much, because the high-level logic of those tools are very similar, and they will face the same challenges such as how to efficiently pull GEOS-Chem's large input data into a container environment.

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 24, 2019

That's very cool. Thanks for the info!

I think run/plot pipelines would be pretty doable with a self-hosted agent or two. Self-hosting agents might also make ifort licensing easier. Right now though, I don't think I have the capacity to take something like this on, but I think this would be a really cool project for someone to pick up!

I'm going to submit a CI PR for GEOS-Chem Classic this afternoon. Looking forward to your hearing everyones feedback.

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Oct 24, 2019

I think run/plot pipelines would be pretty doable with a self-hosted agent or two. Self-hosting agents might also make ifort licensing easier.

Indeed. That CI/CD paper also runs on-prem, and that's why they use Singularity instead of Docker due to permission issues on local HPC. Running the pipeline on Harvard Odyssey/Cannon will simplify data movement, but that also has many downsides such as limited compute resources, frequent system maintenance and update, etc.

but I think this would be a really cool project for someone to pick up!

Yeah I just put the ideas here. Nothing urgent at all!

@JiaweiZhuang
Copy link
Author

JiaweiZhuang commented Oct 28, 2019

In today's telecon, @sdeastham suggested using fake/null variables in ExtData to minimize the input data size, so we can actually run the model (not just build) as part of the CI pipeline. This can catch many of GCHP's run-time errors that cannot be detected at build time.

Long time ago I tried putting /dev/null in ExtData.rc's entry to skip reading a file. Will this set the variable to 0 globally? How to set a global constant other than zero? @lizziel will look it up.

As long as we can shrink the input data to less than 1 GB, we can package those data into a container image, without having to write scripts to pull input data on the fly.

Different from a real benchmark, the simulation here doesn't have to be scientific meaningful, as the goal is just to catch bugs. Running a real benchmark on CI will be a good next step, but let's just get a synthetic simulation running for now.

@lizziel
Copy link
Contributor

lizziel commented Oct 28, 2019

From the MAPL manual dated 2014 draft:

FileTemplate: The full path to the file. The actual filename can be the real file name or a grads style template. In addition you can simply set the import to a constant by specifying the entry as /dev/null:realconstant. If no constant is specified after /dev/null with the colon the import is set to zero.

@JiaweiZhuang
Copy link
Author

Thanks @lizziel ! /dev/null:realconstant is like /dev/null:3.0, /dev/null:5.0 etc?

Is there a similar feature in HEMCO for GC-classic?

@lizziel
Copy link
Contributor

lizziel commented Oct 7, 2020

Continuous integration (build only) is available with GCHP 13.0.0. I will therefore close this issue. However, some of the discussion here is relevant for continuing to develop the CI capability to include running the model. I am therefore porting this issue to the GCHPctm repository which replaced this GCHP repository with the release of 13.0.0.

@lizziel lizziel transferred this issue from geoschem/gchp_legacy Oct 7, 2020
@lizziel lizziel closed this as completed Oct 7, 2020
@rscohn2
Copy link

rscohn2 commented Oct 23, 2020

Here are some sample configurations for using intel compilers in public CI systems : https://github.com/oneapi-src/oneapi-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants