This repo contains an implementation of om2vec as described in the paper 2410.13148. It is designed for use on Prometheus neutrino telescope events. om2vec uses deep learning to encode optical module (OM) timing information into a latent vector representation. Training scripts and pre-trained checkpoints are provided in this repo. Additionally, we provide a script that allows you to convert your Prometheus dataset files into latents, given a model checkpoint.
om2vec/
├── configs/
│ └── example.cfg
├── convert_to_latents.py
├── dataloaders/
│ ├── data_utils.py
│ └── prometheus.py
├── LICENSE
├── README.md
├── requirements.txt
├── run.py
├── scripts/
│ └── preprocess_prometheus.py
├── utils.py
└── vae.py
- PyTorch
- PyTorch Lightning
- NumPy
- Awkward Arrays
- yaml
You can install the requirements by running pip install -r requirements.txt.
Download pre-trained checkpoints here: Google Drive, for latent_dim 32, 64, 128 models respectively.
Additionally, Prometheus datasets are being uploaded to the Harvard dataverse.
You can convert your existing Prometheus Parquet data files into om2vec latents by using the script convert_to_latents.py. The script will take in Prometheus files and output files in the same format, with an added column latents.
Provide the path to the directory containing the Prometheus datasets, the path to the output directory, and the path to the model checkpoint (you can train your own or use the provided pre-trained checkpoints above), like so
python convert_to_latents.py --input_dir /PATH_TO_PROMETHEUS_DATASETS/ --output_dir /PATH_TO_OUTPUT_DIR/ --ckpt /PATH_TO_CHECKPOINT/
Additional optional arguments to the script are detailed below:
chunk_size: number of events per Parquet file in the output (default: 5000)max_time: maximum time (in ns) when binning the hits from Prometheus (default: 6400)num_bins: number of bins to use when binning the hits from Prometheus (default: 6400)
IMPORTANT NOTE: max_time and num_bins should match with what the model checkpoint was trained with (pre-trained checkpoints are trained with the default values of 6400)
The output Parquet files generated by the script contains an extra column called latents which contains the latent representations from om2vec.
[In]: import awkward as ak
...: data = ak.from_parquet('./example_output.parquet')
...: data.fields
[Out]: ['mc_truth', 'photons', 'latents']
[In]: len(data)
[Out]: 5000
[In]: # list of OMs contained in the first event
...: data[0].latents
[Out]: <Array [[49, 1, -0.373, ..., -1.66, 0.116], ...] type='8 * var * float64'>
[In]: # first OM in the first event
...: data[0].latents[0]
[Out]: <Array [49, 1, -0.373, 1.1, ..., 0.928, -1.66, 0.116] type='66 * float64'>
Each latent representation of an OM contains [string_id, sensor_id, latents...]. In the above example we use the 64-dim om2vec latent so each contains 66 values in total.
Use run.py and a configuration file to run training and/or inference.
python run.py -c ./configs/example.cfg
Before training, you will need to pre-process events from raw Prometheus output. See scripts/preprocess_prometheus.py. This script will generate Parquet files of individual binned OM timing distributions from Prometheus events. You need to provide an input_dir containing the Prometheus parquet files and an output_dir.
python scripts/preprocess_prometheus.py --input_dir /PATH_TO_PROMETHEUS_DATASETS/ --output_dir /PATH_TO_OUTPUT_DIR/
You can also provide the following optional arguments:
chunk_size: Number of individual binned OM timing distributions per file (Default: 250000)max_time: maximum time (in ns) when binning the hits from Prometheus (default: 6400)num_bins: number of bins to use when binning the hits from Prometheus (default: 6400)
We provide an example configuration file in configs/example.cfg. See details on the options below:
accelerator: type of PyTorch Lightning accelerator to use (e.g. 'gpu')num_devices: number of devices to deploy model ontraining: boolean for training mode. UseFalseto run in test/inference mode.dataloader: type of dataloader to use. Implemented dataloaders: 'prometheus'checkpoint: path to checkpoint file if loading model from checkpointresume_training: Set toTrueif loading from a checkpoint and resuming trainingproject_name: wandb project nameproject_save_dir: wandb project save directorymodel_options: arguments to be passed into the om2vec network initializationdata_options.train_data_files: list of paths to directories containing the training data (pre-processed)data_options.train_data_file_ranges: list of ranges ([x, y]) indicating which/how many files in each directory to use. must be same length astrain_data_filesdata_options.valid_data_files: list of paths to directories containing the validation data (pre-processed)data_options.valid_data_file_ranges: list of ranges ([x, y]) indicating which/how many files in each directory to use. must be same length asvalid_data_filesdata_options.file_chunk_size: number of timing distributions per file. set to same aschunk_sizeduring pre-processingtraining_options: various training hyperparameters