For results see SuperNEMO private documentation. If you have any question feel free to contact me adam.mendl@cvut.cz or amendl@hotmail.com
Information in this section are mostly for CCLyon in2p3 computation cluster.
Almost everything runs ot top of python3
. On CCLyon use python
sourced with root
via
ccenv root 6.22.06
- loadspython 3.8.6
(does not work now)- since July 12 2023
module add Analysis/root/6.22.06-fix01
- loadspython 3.9.1
(currently, this is the way to go)
This software should be installed in python or Anaconda environment (python environment is prefered since it can access both sourced root package and all gpu related software directly, however it is still possible to make it work with Anaconda)
root
- Root is not needed to be explicitly installed in python or Anaconda environment, any sourced Root on CCLyon should work - minimum tested verion 6.22.06 (since July 12 2023 6.22.06-fix01 on CCLyon). PyROOT is required.cudatoolkit
,cudnn
- Should be already installed on CCLyontensorflow
- (ideally 2.13, older version produce some random bug with invalid version of openSSL build on CCLyon. However, there seems to be BUG in 2.13 regarding training on multiple GPUs)keras
- Should be part oftensorflow
keras-tuner
- hyperparameter tuningnumpy
maplotlib
,seaborn
- plottingscikit-learn
- some helper functionspydot
,graphviz
- drawing modelsargparse
Optional:
tensorrt
- really useful, makes inference and training of models faster on CPU. It can be also installed in two steps, first installnvidia-pyindex
and thennvidia-tensorrt
. To use it succesfully within scripts, you should import it before tensorflow!tensorboard
- Should be part oftensorflow
tensorboard_plugin_profile
- profilingnvidia-pyindex
,nvidia-tensorrt
- For TensorRT supportnvidia-smi
- For checking usage and available memory on NVIDIA V100 GPU (on CCLyon)ray
- For massively parallelized hyperparameter optimization. Use this commandpython -m pip install -U "ray[data,train,tune,serve]"
to install.
Example is at example_exec.sh
. Run it with sbatch --mem=... -n 1 -t ... gres=gpu:v100:N example_exec.sh
if you have access to GPU, where N
is number of GPUs you want to use (currently CCLyon does not allow me to use more than three of them) Otherwise, leave out gres
option.
Scripts can use two strategies. To use only one GPU use option --OneDeviceStrategy "/gpu:0"
. If you want to use more GPUs, use for example --MirroredStrategy "/gpu:0" "/gpu:1" "/gpu:2"
. For some reason, I was never able to use more than 3 GPUs on CClyon.
If you start job from bash instance with some packages, modules or virtual environment loaded, you should unload them/deactivate them (use module purge --force
). Best way is to start from fresh bash instance (on many places I use environment variables to pass arguments to scripts so starting things fresh is really really important!)
- source
root
(andpython
) - currently usemodule add Analysis/root/6.22.06-fix01
- create python virtual environment (if not done yet)
- install packages (if not done yet)
- load python virtual environment (or add
#! <path-to-your-python-bin-in-envorinment>
to first line of your script)
We test models on real data and compared them with TKEvent. Unfortunately, it is not possible to open root
files produced by TKEvent library from python with the same library sourced since this library might be built with different version of python and libstdc++. Fortunately, workaround exists. We need to download and build two versions of TKEvent. First version will be built in the manner described in TKEvent README.md. The second library shoudl be build (we ignore the red_to_tk
target) with following steps:
module add ROOT
whereROOT
is version ofroot
library used bytensorflow
(currentlymodule add Analysis/root/6.22.06-fix01
)TKEvent/TKEvent/install.sh
to build library
Now, we can use red_to_tk
from the first library to obtain root file with TKEvent
objects and open this root file with the second version of TKEvent
library.
If the collaboration will want to use keras models inside software, the best way is probably to use cppflow . It is single header c++ library for acessing TensoFlow C api. This means that we will not have to build TensorFlow from source and we should not be restricted by root/python/gcc/libstdc++ version nor calling conventions.
Not to be restricted by the organization structure of our folders, we use this script to load, register and return local modules.
def import_arbitrary_module(module_name,path):
import importlib.util
import sys
spec = importlib.util.spec_from_file_location(module_name,path)
imported_module = importlib.util.module_from_spec(spec)
sys.modules[module_name] = imported_module
spec.loader.exec_module(imported_module)
return imported_module
sbatch
andtensorflow
sometimes fail to initialize libraries (mainly to source python from virtual environment or root) - start the script again ideally from new bash instance without any modules nor virtual environment loaded.tensorflow
sometimes runs out of memory - Don't use checkpoints fortensorboard
. Another cause of this problem might be training more models in one process, we can solve this bykeras.backend.clear_session()
. If this error occurs after several hours of program execution, check out functiontf.config.experimental.set_memory_growth
.- TensorFlow 2.13 distributed training fail - tensorflow/tensorflow#61314
- Sometimes, there are strange errors regarding ranks of tensors while using custom training loop in
gan.py
- looks like really old still unresolved bug inside core tensorflow library. However, the workaround is to pass only one channel into CNN architecture and concat them withtf.keras.layers.Concatenate
numpy
ocasionally stops working (maybe connected to this issue), failing to import (numpy.core._multiarray_umath
or other parts ofnumpy
library). Is is accompanied by message:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
Simplest solution is to create new environment and install all libraries inside that.
Since I use absolute paths on many places you have to update paths after downloading this project. Rule of thumb is that change is required for:
- scripts that load data into tf.dataset pipelines (search for
TFile
and update paths to your simulations/measured data), - all *.sh files. Paths that require changing are: a) root, python, g++ and other libraries; b) python environments and c) scripts
Please note, that this description is rather short. For more detailed overview, see SuperNEMO documentation. For my implementation of CapsNET, see this repository.