Updating to Keras 3.0 and migrating to PyTorch #418

IgorTatarnikov · 2024-05-10T16:18:12Z

Before submitting a pull request (PR), please read the contributing guide.

Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
tensorflow has become increasingly difficult to support (e.g. lack of GPU support on native Windows). Switching to keras 3.0 allows us to migrate to using torch as the backend instead of tensorflow. This will make future maintenance easy and allow us to support Python 3.11+.

What does this PR do?
Upgrades to keras 3.0.
Sets torch as the default backend.
Removes any functions related to tensorflow (error suppression etc...)

References

Closes #279
brainglobe/brainglobe.github.io#177
brainglobe/brainglobe.github.io#183
#266

How has this PR been tested?

All tests pass on CI, basic workflows have been tested by manual inspection.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Yes, see brainglobe/brainglobe.github.io#183

Checklist:

The code has been tested locally
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

…emp fix)

* check if Keras present * change TF to Keras in CI * remove comment * change dependencies in pyproject.toml for Keras 3.0

* remove pytest-lazy-fixture as dev dependency and skip test (with WG temp fix) * change tensorflow dependency for cellfinder * replace keras imports from tensorflow to just keras imports * add keras import and reorder * add keras and TF 2.16 to pyproject.toml * comment out TF version check for now * change checkpoint filename for compliance with keras 3. remove use_multiprocessing=False from fit() as it is no longer an input. test_train() passing * add multiprocessing parameters to cube generator constructor and remove from fit() signature (keras3 change) * apply temp garbage collector fix * skip troublesome test * skip running tests on CI on windows * remove commented out TF check * clean commented out code. Explicitly pass use_multiprocessing=False (as before) * remove str conversion before model.save * raise test_detection error for sonarcloud happy * skip running tests on windows on CI * remove filename comment and small edits

* change some old references to TF for the import check * change TF cached model to Keras

* replace tensorflow Tensor with keras tensor * add case for TF prep in prep_model_weights * add different backends to pyproject.toml * add backend configuration to cellfinder init file. tests passing with jax locally * define extra dependencies for cellfinder with different backends. run tox with TF backend * run tox using TF and JAX backend * install TF in brainmapper environment before running tests in CI * add backends check to cellfinder init file * clean up comments * fix tf-nightly import check * specify TF backend in include guard check * clarify comment * remove 'backend' from dependencies specifications * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <61896994+IgorTatarnikov@users.noreply.github.com> --------- Co-authored-by: Igor Tatarnikov <61896994+IgorTatarnikov@users.noreply.github.com>

* use jax backend in brainmapper tests in CI * skip TF backend on windows * fix pip install cellfinder for brainmapper CI tests * add keras env variable for brainmapper CLI tests * fix prep_model_weights

* replace tensorflow Tensor with keras tensor * add case for TF prep in prep_model_weights * add different backends to pyproject.toml * add backend configuration to cellfinder init file. tests passing with jax locally * define extra dependencies for cellfinder with different backends. run tox with TF backend * run tox using TF and JAX backend * install TF in brainmapper environment before running tests in CI * add backends check to cellfinder init file * clean up comments * fix tf-nightly import check * specify TF backend in include guard check * clarify comment * remove 'backend' from dependencies specifications * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <61896994+IgorTatarnikov@users.noreply.github.com> * PyTorch runs utilizing multiple cores * PyTorch fix with default models * Tests run on every push for now * Run test on torch backend only * Fixed guard test to set torch as KERAS_BACKEND * KERAS_BACKEND env variable set directly in test_include_guard.yaml * Run test on python 3.11 * Remove tf-nightly from __init__ version check * Added 3.11 to legacy tox config * Changed legacy tox config for real this time * Don't set the wrong max_processing value * Torch is now set as the default backend * Tests only run with torch, updated comments * Unpinned torch version * Add codecov token (#403) * add codecov token * generate xml coverage report * add timeout to testing jobs * Allow turning off classification or detection in GUI (#402) * Allow turning off classification or detection in GUI. * Fix test. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor to fix code analysis errors. * Ensure array is always 2d. * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <61896994+IgorTatarnikov@users.noreply.github.com> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Igor Tatarnikov <61896994+IgorTatarnikov@users.noreply.github.com> * Support single z-stack tif file for input (#397) * Support single z-stack tif file for input. * Fix commit hook. * Apply review suggestions. * Remove modular asv benchmarks (#406) * remove modular asv benchmarks * recover old structure * remove asv-specific lines from gitignore and manifest * prune benchmarks * Adapt CI so it covers both new and old Macs, and installs required additional dependencies on M1 (#408) * naive attempt at adapting to silicon mac CI * run include guard test on Silicon CI * double-check hdf5 is needed * Optimize cell detection (#398) (#407) * Replace coord map values with numba list/tuple for optim. * Switch to fortran layout for faster update of last dim. * Cache kernel. * jit ball filter. * Put z as first axis to speed z rolling (row-major memory). * Unroll recursion (no perf impact either way). * Parallelize cell cluster splitting. * Parallelize walking for full images. * Cleanup docs and pep8 etc. * Add pre-commit fixes. * Fix parallel always being selected and numba function 1st class warning. * Run hook. * Older python needs Union instead of |. * Accept review suggestion. * Address review changes. * num_threads must be an int. --------- Co-authored-by: Matt Einhorn <matt@einhorn.dev> * [pre-commit.ci] pre-commit autoupdate (#412) updates: - [github.com/pre-commit/pre-commit-hooks: v4.5.0 → v4.6.0](pre-commit/pre-commit-hooks@v4.5.0...v4.6.0) - [github.com/astral-sh/ruff-pre-commit: v0.3.5 → v0.4.3](astral-sh/ruff-pre-commit@v0.3.5...v0.4.3) - [github.com/psf/black: 24.3.0 → 24.4.2](psf/black@24.3.0...24.4.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: sfmig <33267254+sfmig@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Simplify model download (#414) * Simplify model download * Update model cache * Remove jax and tf tests * Standardise the data types for inputs to all be float32 * Force torch to use CPU on arm based macOS during tests * Added PYTORCH_MPS_HIGH_WATERMARK_RATION env variable * Set env variables in test setup * Try to set the default device to cpu in the test itself * Add device call to Conv3D to force cpu * Revert changes, request one cpu left free * Revers the numb cores, don't use arm based mac runner * Merged main, removed torch flags on cellfinder install for guards and brainmapper * Lowercase Torch * Change cache directory --------- Co-authored-by: sfmig <33267254+sfmig@users.noreply.github.com> Co-authored-by: Kimberly Meechan <24316371+K-Meech@users.noreply.github.com> Co-authored-by: Matt Einhorn <matt@einhorn.dev> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Alessandro Felder <alessandrofelder@users.noreply.github.com> Co-authored-by: Adam Tyson <code@adamltyson.com>

# Conflicts: # .github/workflows/test_and_deploy.yml # .github/workflows/test_include_guard.yaml # cellfinder/core/main.py # cellfinder/core/tools/prep.py # cellfinder/core/train/train_yml.py # tests/core/conftest.py

IgorTatarnikov · 2024-05-10T16:25:40Z

To do:
Documentation:

Update documentation on the brainglobe website, tracked in [Feature] Update cellfinder/brainmapper documentation to account for keras 3.0, torch migration brainglobe.github.io#183
Prep blog post? Not sure if this needs to be done in sync with the merge into main, or the release of v1.3.0, tracked in cellfinder v1.3.0 blog post brainglobe.github.io#177

Sanity checks for a "regular" cellfinder workflow (should be done via the napari GUI and via the brainmapper CLI where possible):

Detect using the default cellfinder model
Detect using a custom model
Curate a set of cells and use it to retrain the default model
Detect using the new updated model

IgorTatarnikov · 2024-05-10T16:47:40Z

Tests are currently not running mac-latest since I was having issues getting torch to behave on CI, see here. I couldn't find an elegant way of forcing torch to not use the mps device on CI specifically while still allowing it to be used normally.

Not sure how to proceed there. Tests pass locally when run on my personal machine (M2 MacBook Pro running macOS 14.4.1).

adamltyson · 2024-05-13T12:22:05Z

Tests are currently not running mac-latest since I was having issues getting torch to behave on CI, see here. I couldn't find an elegant way of forcing torch to not use the mps device on CI specifically while still allowing it to be used normally.
Not sure how to proceed there. Tests pass locally when run on my personal machine (M2 MacBook Pro running macOS 14.4.1).

Naively, it seems like it should be possible (we're not the only people using torch & GH actions!). Ofc you've looked into it, so it's not simple.

I would suggest:

Give it a bit more of a go to see if you can fix it (there's a large online torch community)
If 1. fails, write up an issue with what you've tried and don't let this PR get derailed by a relatively small problem

…e set to cpu

cellfinder/__init__.py

adamltyson · 2024-05-23T14:32:59Z

tests/core/test_integration/test_train.py

@@ -35,5 +35,5 @@ def test_train(tmpdir):
    sys.argv = train_args
    train_run()

-    model_file = os.path.join(tmpdir, "model.h5")
+    model_file = os.path.join(tmpdir, "model.keras")


is this the new default extension? This should be added to brainglobe/brainglobe.github.io#189 (I think there's a couple of places in the docs that reference the .h5 files directly).

If we're saving the whole model then we have to use the .keras extension. For weights, the files must now end with .weights.h5. I'll double check for references to .h5 in our documentation.

adamltyson · 2024-05-23T14:50:37Z

@IgorTatarnikov I compared this PR with version 1.2.0 and the classification looks considerably worse. This was using the napari plugin, all parameters default, using the pre-trained model.

The only other difference was the Python version (old cellfinder env is 3.10, and the torch version is 3.12), but that shouldn't affect this.

Do you know what's going on?

TensorFlow

Torch

adamltyson · 2024-05-23T14:54:10Z

Also, if I try to run pytest locally, I get a huge stacktrace that leads to a keras error: ModuleNotFoundError: No module named 'tensorflow'

…issing2

adamltyson · 2024-05-24T08:15:35Z

The classification problem seems to be resolved on that machine by deleting and re-downloading the model, not sure what's going on.

Also, if I try to run pytest locally, I get a huge stacktrace that leads to a keras error: ModuleNotFoundError: No module named 'tensorflow'

This issue remains, the backend needs to be set explictly by editing the ~.keras/keras.json file (I needed to do this on two machines). Can we override this?

adamltyson · 2024-05-24T08:26:00Z

On my mac (m1), I get this error:

E NotImplementedError: Exception encountered when calling MaxPooling3D.call().
E
E The operator 'aten::max_pool3d_with_indices' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I can fix it for detection by setting PYTORCH_ENABLE_MPS_FALLBACK=1, however this isn't ideal because a) others will just see a big red error, and b) presumably we want to use MPS?

Also, even then, upon training, I get this error:

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

adamltyson · 2024-05-24T14:26:25Z

After some more testing, everything seems to work fine on my mac, the only issue is automatically setting the backend to torch.

adamltyson

This is great, thanks @IgorTatarnikov @sfmig!

I've tested it on macOS, Ubuntu & Windows now. It works fine across all OSs, but it does seem to be very slow on Windows. I think this is out of scope for this review, so I've raised #426 so we can look into it later on.

I think the only remaining thing is setting the backend automatically. If this can be sorted, we can create an rc and get some more people to test upgrading.

IgorTatarnikov · 2024-05-28T12:57:23Z

I'm not sure why the automatic backend setting isn't working as it stands. I'll remove the conditional and always set the KERAS_BACKEND variable to be torch when the package is imported. This removes some flexibility but we weren't officially planning to support custom backends so if someone needs it they may have to edit the package code locally, or fork it.

adamltyson · 2024-05-28T13:09:29Z

Is it worth getting rid of the other code in the init file now? There's not a lot of point setting the backend to torch, then checking whether it's TF or JAX.

adamltyson · 2024-05-28T14:40:08Z

I think this is ready to merge then @IgorTatarnikov unless there's anything else you want to do?

IgorTatarnikov · 2024-05-28T14:46:52Z

I think it's ready! Once this is merged we can merge this PR in brainglobe-workflows to fix the brainmapper tests

sfmig and others added 11 commits February 7, 2024 13:50

remove pytest-lazy-fixture as dev dependency and skip test (with WG t…

3046294

…emp fix)

Test Keras is present (#374)

99cbda0

* check if Keras present * change TF to Keras in CI * remove comment * change dependencies in pyproject.toml for Keras 3.0

Replace TF references in comments and warning messages (#378)

5ad5c1b

* change some old references to TF for the import check * change TF cached model to Keras

Run cellfinder with JAX in Windows tests in CI (#382)

ca80c6d

* use jax backend in brainmapper tests in CI * skip TF backend on windows * fix pip install cellfinder for brainmapper CI tests * add keras env variable for brainmapper CLI tests * fix prep_model_weights

Merge branch 'main' into cellfinder-to-keras-3

32a0a56

Set pooling padding to valid by default on all MaxPooling3D layers

0150d07

Removed tf error suppression and other tf related functions

7731b3c

Merge branch 'main' into cellfinder-to-keras-3

13770d3

# Conflicts: # .github/workflows/test_and_deploy.yml # .github/workflows/test_include_guard.yaml # cellfinder/core/main.py # cellfinder/core/tools/prep.py # cellfinder/core/train/train_yml.py # tests/core/conftest.py

IgorTatarnikov added 6 commits May 17, 2024 11:54

Force torch to use cpu device when CELLFINDER_TEST_DEVICE env variabl…

b4799ac

…e set to cpu

Added nev variable to test step

4c07fc6

Use the GITHUB ACTIONS environemntal variable instead

1027b59

Added docstring for fixture setting device to cpu on arm based mac

a891005

Revert changes to no_free_cpus being fixture, and default param

1a56198

Fixed typo in test_and_deploy.yml

cdbfb42

alessandrofelder mentioned this pull request May 20, 2024

TensorFlow to PyTorch migration brainglobe/brainglobe.github.io#189

Open

5 tasks

IgorTatarnikov and others added 3 commits May 23, 2024 11:52

Set multiprocessing to false for the data generators

9ff10d9

Merge branch 'main' into cellfinder-to-keras-3

0339a8f

Update all cache steps to match

2ce45fa

IgorTatarnikov marked this pull request as ready for review May 23, 2024 13:33

IgorTatarnikov requested a review from a team May 23, 2024 13:33

adamltyson requested review from adamltyson and removed request for a team May 23, 2024 13:33

IgorTatarnikov mentioned this pull request May 23, 2024

Remove calls to TensorFlow related cellfinder functions brainglobe/brainglobe-workflows#112

Merged

5 tasks

adamltyson reviewed May 23, 2024

View reviewed changes

cellfinder/__init__.py Outdated Show resolved Hide resolved

Remove reference to TF

e98e0af

adamltyson mentioned this pull request May 23, 2024

Move imports out of functions #423

Open

adamltyson reviewed May 23, 2024

View reviewed changes

IgorTatarnikov added 2 commits May 23, 2024 16:15

Make sure tests can run locally when GITHUB_ACTIONS env variable is m…

d4b7ecd

…issing2

Removed warning when backend is not configured

68ed410

Set the label tensor to be float32 to ensure compatibility with mps

bebfce9

adamltyson mentioned this pull request May 28, 2024

Check performance on Windows #426

Open

adamltyson approved these changes May 28, 2024

View reviewed changes

Always set KERAS_BACKEND to torch on init

7ad06c6

Remove code in __init__ checking for if backend is installed

666749f

IgorTatarnikov merged commit cbdecaf into main May 28, 2024
16 of 18 checks passed

adamltyson mentioned this pull request May 28, 2024

drop Python 3.8, support Python 3.11 brainglobe/brainglobe-workflows#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating to Keras 3.0 and migrating to PyTorch #418

Updating to Keras 3.0 and migrating to PyTorch #418

IgorTatarnikov commented May 10, 2024 •

edited

IgorTatarnikov commented May 10, 2024 •

edited

IgorTatarnikov commented May 10, 2024

adamltyson commented May 13, 2024

adamltyson May 23, 2024

IgorTatarnikov May 28, 2024

adamltyson commented May 23, 2024

adamltyson commented May 23, 2024

adamltyson commented May 24, 2024

adamltyson commented May 24, 2024

adamltyson commented May 24, 2024

adamltyson left a comment

IgorTatarnikov commented May 28, 2024

adamltyson commented May 28, 2024

adamltyson commented May 28, 2024

IgorTatarnikov commented May 28, 2024

Updating to Keras 3.0 and migrating to PyTorch #418

Updating to Keras 3.0 and migrating to PyTorch #418

Conversation

IgorTatarnikov commented May 10, 2024 • edited

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

IgorTatarnikov commented May 10, 2024 • edited

IgorTatarnikov commented May 10, 2024

adamltyson commented May 13, 2024

adamltyson May 23, 2024

Choose a reason for hiding this comment

IgorTatarnikov May 28, 2024

Choose a reason for hiding this comment

adamltyson commented May 23, 2024

adamltyson commented May 23, 2024

adamltyson commented May 24, 2024

adamltyson commented May 24, 2024

adamltyson commented May 24, 2024

adamltyson left a comment

Choose a reason for hiding this comment

IgorTatarnikov commented May 28, 2024

adamltyson commented May 28, 2024

adamltyson commented May 28, 2024

IgorTatarnikov commented May 28, 2024

IgorTatarnikov commented May 10, 2024 •

edited

IgorTatarnikov commented May 10, 2024 •

edited