Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Linux compatibility without breaking ARM macOS compatibility #369

Merged

Conversation

timtroendle
Copy link
Member

@timtroendle timtroendle commented May 1, 2024

Recent changes in #357 broke ARM compatibility as libnetcdf 4.7 is not available for macOS ARM. The first available version if 4.8.0 and this is also the only version that is compatible with the other dependencies.

@sjpfenninger could you please check whether this still works on linux?

Checklist

Any checks which are not relevant to the PR can be pre-checked by the PR creator. All others should be checked by the reviewer. You can add extra checklist items here if required by the PR.

  • CHANGELOG updated
  • Minimal workflow tests pass
  • Tests added to cover contribution
  • Documentation updated
  • Configuration schema updated

@timtroendle
Copy link
Member Author

@irm-codebase, could you test this on Linux? If not, can you suggest someone on Linux that could review and test?

@timtroendle
Copy link
Member Author

@sjpfenninger I am currently working on this again. I am introducing a CI test that does these checks so we don't have to do them manually.

@sjpfenninger
Copy link
Member

Great. Also running this on linux right now, it seems to be fine so far. All the envs built.

Copy link
Member

@sjpfenninger sjpfenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works in local testing

.github/workflows/environments.yaml Outdated Show resolved Hide resolved
.github/workflows/environments.yaml Outdated Show resolved Hide resolved
@timtroendle timtroendle force-pushed the fix-arm-compatibility branch 3 times, most recently from 5fd398e to 703bc0d Compare May 7, 2024 08:46
@timtroendle
Copy link
Member Author

In the process, I've decided to drop support for Intel macOS. Maintaining compat of ARM macOS and Linux is hard enough and Intel macOS will only become less important in the future.

@timtroendle
Copy link
Member Author

@brynpickering, the new tests took 4 minutes to run. Do we really want that?

Possibly, I could split the job into several jobs so they run in parallel.

Copy link
Member

@sjpfenninger sjpfenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... I spoke too soon..., this fails again in the same old way on linux:

Job 101: Extract national potentials and cost from raw biofuel data.

[...]

from ._netCDF4 import *
ImportError: f6f640ac8f16d5d71ef1d7e200f26108_/lib/python3.9/site-packages/netCDF4/../../../libnetcdf.so.19: undefined symbol: H5Pset_fapl_ros3

@sjpfenninger
Copy link
Member

I think it's good to test that the environments are installable, but this example shows that it's not enough - the failure here is some arcane library import issue that only occurs when the library is actually imported. I wonder if at least for cases like this, an additional test that simply imports all libraries, would be useful.

@timtroendle
Copy link
Member Author

Maybe. I wonder whether what we are seeing here is a bug. This should not happen, right? So I wonder if it's worth to run complex tests for such a case.

Much worse is the fact that this is breaking linux compat. That means we either break linux or mac. A real fix would require updating many more dependencies.

@brynpickering
Copy link
Member

brynpickering commented May 7, 2024

@brynpickering, the new tests took 4 minutes to run. Do we really want that?

Base conda env creation is probably the reason. You could cache it which would save time.

EDIT: actually that doesn't seem to be the reason. environment.yaml is slow to build...

@timtroendle
Copy link
Member Author

Well they are all kinda slow and because they run in sequence in just adds up. Interestingly, the ARM build was a lot faster.

@brynpickering
Copy link
Member

You could add another matrix level which is the environment file name, then they would run in parallel jobs. It would just be a longer list to look through in the checks.

@timtroendle
Copy link
Member Author

That's exactly what I am currently preparing :D

@timtroendle
Copy link
Member Author

Ok this is much faster. It clutters the list of checks slightly, but it's in fact cleaner too.

I will now try to find a solution to the dependency problem.

h5py was introduced in 7b7bc8b. This was a mistake, as h5py is not and was never a dependency in any env.
@timtroendle
Copy link
Member Author

timtroendle commented May 7, 2024

This is a mess.

What I found out so far:

  • Pinning h5py was a mistake. It was never a dependency and has been introduced as a dependency.
  • libnetcdf 4.8 seems the main problem for Linux compat. Previous versions of Euro-Calliope always used <4.8 because of some third packages (maybe numpy, scipy, or gdal) that required 4.7 instead of 4.8.
  • libnetcdf <4.8 isn't available for ARM macOS
  • libnetcdf 4.8.1 doesn't fix the problem of Linux compat.
  • libnetcdf 4.9.2 fixes the problem of Linux compat. It requires many updates in other dependencies.
  • calliope 0.6.10 requires libnetcdf < 4.9

Seems like a deadlock to me.

@timtroendle
Copy link
Member Author

We could provide different versions of libnetcdf for different platforms: 4.8 for Mac, 4.7 for Linux. Snakemake is able to do that, but it would be in a hacky way.
https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#freezing-environments-to-exactly-pinned-packages

@timtroendle timtroendle marked this pull request as draft May 7, 2024 16:35
@timtroendle
Copy link
Member Author

@brynpickering, @sjpfenninger, doesn't this mean that current Calliope is broken on Linux? Or am I missing something here?

@irm-codebase
Copy link
Contributor

@brynpickering, @sjpfenninger, doesn't this mean that current Calliope is broken on Linux? Or am I missing something here?

If by that you mean v0.7, no, it works quite well (I've tested the latest dev version on my machine).

@timtroendle
Copy link
Member Author

Thanks, good to know. In fact, I thought of 0.6.10. Even here, I am pretty sure it's not broken, but then some of my bullet points above must be wrong:

  • libnetcdf <4.8 isn't available for ARM macOS
  • libnetcdf 4.8.1 doesn't fix the problem of Linux compat.
  • libnetcdf 4.9.2 fixes the problem of Linux compat. It requires many updates in other dependencies.
  • calliope 0.6.10 requires libnetcdf < 4.9

@timtroendle
Copy link
Member Author

For reference, the above bullet points were incomplete. Obviously, the problem is not in a single library but in the wrong combination of libraries. The following list works on both linux and ARM mac and should solve this issue here:

    - netCDF4=1.6.2
    - hdf5=1.12.2
    - libnetcdf=4.8.1

On Linux, there seems to be a problem with compatibilities of netcdf libraries that is not considered by the conda packages. The libraries seem to be libnetcdf and hdf5. Here, I am pinning these libraries to a combination that works with Linux, ARM Mac, and all other dependencies we have to introduce the smallest change possible.
@timtroendle
Copy link
Member Author

This (^) seems to fix the Linux compat issue but it requires massive updates in all envs that include GDAL. This is because libgdal depends on hdf5, too, and requires updates. That then requires updates in GDAL, which requires updates in fiona, rasterio, rasterstats, which then require updates in numpy, scipy, and even Python.

It would make sense to fiddle out the minimal change that works, but I won't have time before next week to work on that.

@timtroendle timtroendle changed the title Fix macOS ARM compatibility by upgrading to libnetcdf 4.8.0 Fix Linux compatibility without breaking ARM macOS compatibility May 8, 2024
@brynpickering
Copy link
Member

A nightmare... In the non-calliope envs, you could do libnetcdf>4.8 to avoid the need to update the GDAL-related things?

@brynpickering
Copy link
Member

Necessary updates don't seem to have been quite so drastic as you suggested @timtroendle. I think this is now ready for review!

@brynpickering brynpickering marked this pull request as ready for review May 10, 2024 16:34
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@timtroendle
Copy link
Member Author

Necessary updates don't seem to have been quite so drastic as you suggested @timtroendle. I think this is now ready for review!

Awesome, thanks for the work! I just love my drama. I'll run the minimal workflow on both systems to verify nothing obvious is broken.

@timtroendle
Copy link
Member Author

The minimal workflow runs on Mac (after taking care of several regressions #376 #375 #374).

I cannot test the minimal workflow on Linux as LSF has been switched off at ETH (and the cluster interface is broken, see #375).

Copy link
Member

@brynpickering brynpickering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for an update to the workflow file, otherwise good to go.

Co-authored-by: Bryn Pickering <17178478+brynpickering@users.noreply.github.com>
@timtroendle timtroendle merged commit d62b8ce into calliope-project:develop May 14, 2024
24 checks passed
@timtroendle timtroendle deleted the fix-arm-compatibility branch May 14, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants