update readme to lhco text

ewencedr · Oct 9, 2023 · 20e7e7c · 20e7e7c
1 parent 3ff80dc
commit 20e7e7c
Showing 1 changed file with 27 additions and 97 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 <div align="center">
 
-# EPiC Flow Matching
+# LHCO EPiC Flow Matching
 
 [![python](https://img.shields.io/badge/-Python_3.10-blue?logo=python&logoColor=white)](https://www.python.org/)
-[![pytorch](https://img.shields.io/badge/PyTorch_1.10+-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/get-started/locally/)
-[![lightning](https://img.shields.io/badge/-Lightning_1.9+-792ee5?logo=pytorchlightning&logoColor=white)](https://pytorchlightning.ai/)
+[![pytorch](https://img.shields.io/badge/PyTorch_2.0+-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/get-started/locally/)
+[![lightning](https://img.shields.io/badge/-Lightning_2.0+-792ee5?logo=pytorchlightning&logoColor=white)](https://pytorchlightning.ai/)
 [![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)
 [![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)
 [![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) <br>
@@ -16,13 +16,27 @@
 
 ## Description
 
-This is the official repository implementing the EPiC Flow Matching point cloud generative machine learning models from arxiv1111.11111.
+This repository contains the code for the Flow Matching Generative Models from [XXX](https://arxiv.org/abs/1111.11111). For the preparation of the data see this [repository](https://github.com/ewencedr/FastJet-LHCO), the second generative model and the classifier can be found [here](https://github.com/ViniciusMikuni/LHCO_diffusion). 
+We used the [LHC Olympics](https://lhco2020.github.io/homepage/) R&D anomaly detection dataset.
 
-EPiC Flow Matching is a [Continuous Normalising Flow](https://arxiv.org/abs/1806.07366) that is trained with a simulation free approach called [Flow Matching](https://arxiv.org/abs/2210.02747). The model uses [DeepSet](https://arxiv.org/abs/1703.06114) based [EPiC layers](https://arxiv.org/abs/2301.08128) for the architecture, which allow for good scalability to high set sizes.
+> Physics beyond the Standard Model that is resonant in one or more dimensions has been a
+longstanding focus of countless searches at colliders and beyond. Recently, many new strategies
+for resonant anomaly detection have been developed, where sideband information can be used in
+conjunction with modern machine learning, in order to generate synthetic datasets representing the
+Standard Model background. Until now, this approach was only able to accommodate a relatively
+small number of dimensions, limiting the breadth of the search sensitivity. Using recent innovations
+in point cloud generative models, we show that this strategy can also be applied to the full phase
+space, using all relevant particles for the anomaly detection. As a proof of principle, we show that
+the signal from the R&D dataset from the LHC Olympics is findable with this method, opening up
+the door to future studies that explore the interplay between depth and breadth in the representation
+of the data for anomaly detection.
+
+For the generation of dijet events, we use a chain of multiple generative models. One particle feature Model generates the jet constituents based on jet features that are generated by a jet feature model. The jet feature model generates the jet features for both jets in the event and is conditioned on the dijet mass of the jet pair. This conditioniong allows us to train on sideband region in dijet mass and sample in the signal region, where a signal is expected. 
 
-Additionally to the EPiC Flow Matching model, this repository also implements various other loss functions that correspond to other flow matching/diffusion based models, like [Conditional Flow Matching](https://arxiv.org/abs/2302.00482) and [DDIM](https://arxiv.org/abs/2010.02502) based [PC-Jedi](https://arxiv.org/abs/2303.05376).
+The particle feature model is the EPiC Flow Matching model introduced [here](https://arxiv.org/abs/2310.00049).
+EPiC Flow Matching is a [Continuous Normalising Flow](https://arxiv.org/abs/1806.07366) that is trained with a simulation free approach called [Flow Matching](https://arxiv.org/abs/2210.02747). The model uses [DeepSet](https://arxiv.org/abs/1703.06114) based [EPiC layers](https://arxiv.org/abs/2301.08128) for the architecture, which allow for good scalability to high set sizes.
 
-The models are tested on the [JetNet dataset](https://zenodo.org/record/6975118). The JetNet dataset is used in particle physics to test point cloud generative deep learning architectures. It consists of simulated particle jets produced by proton proton collisions in a simplified detector. The dataset is split into jets originating from tops, light quarks, gluons, W bosons and Z bosons and has a maximum number of 150 particles per jet.
+The jet feature model is also trained with Flow Matching, but since it doesn't model point clouds, a simple fully connected architecture is used.
 
 This repository uses [pytorch lightning](https://www.pytorchlightning.ai/index.html), [hydra](https://hydra.cc/docs/intro/) for model configurations and supports logging with [comet](https://www.comet.com/site/) and [wandb](https://wandb.ai/site). For a deeper explanation of how to use this repository, please have a look at the [template](https://github.com/ashleve/lightning-hydra-template) directly.
 
@@ -55,102 +69,18 @@ LOG_DIR="/folder/folder/"
 COMET_API_TOKEN="XXXXXXXXXX"
 ```
 
-Train model with default configuration
-
-```bash
-# train on CPU
-python src/train.py trainer=cpu
-
-# train on GPU
-python src/train.py trainer=gpu
-```
+Before training, the [data](https://lhco2020.github.io/homepage/) needs to be downloaded. Because the data simply consists of all jet constituents from a dijet event, the data has to be clustered and prepared first. This can be done with [this](https://github.com/ewencedr/FastJet-LHCO) code.
 
-Train model with chosen experiment configuration from [configs/experiment/](configs/experiment/)
+Then the jet feature model can be trained with
 
 ```bash
-python src/train.py experiment=experiment_name.yaml
+python src/train.py experiment=lhco/lhco_jet_features
 ```
 
-You can override any parameter from command line like this
 
-```bash
-python src/train.py trainer.max_epochs=20 data.batch_size=64
-```
-
-The experiments include
-
-<details>
-  <summary>
-    <b>fm_tops30_cond</b>
-  </summary>
-  EPiC Flow Matching trained on top30 dataset with conditioning on jet mass and pt
-</details>
-<details>
-  <summary>
-    <b>fm_tops30</b>
-  </summary>
-  EPiC Flow Matching trained on top30 dataset with no additional conditioning. Jet size conditioning is a neccessity for the architecture
-</details>
-<details>
-  <summary>
-    <b>fm_tops150_cond</b>
-  </summary>
-  EPiC Flow Matching trained on top150 dataset with conditioning on jet mass and pt
-</details>
-<details>
-  <summary>
-    <b>fm_tops150</b>
-  </summary>
-  EPiC Flow Matching trained on top150 dataset with no additional conditioning. Jet size conditioning is a neccessity for the architecture
-</details>
-<details>
-  <summary>
-    <b>fm_alljet150_cond</b>
-  </summary>
-  EPiC Flow Matching trained on all jet types with a maximum of 150 particles per jet and conditioning on jet mass and pt.
-</details>
-<details>
-  <summary>
-    <b>diffusion_tops30_cond</b>
-  </summary>
-  EPiC Jedi (DDIM diffusion based) trained on top30 dataset with conditioning on jet mass and pt
-</details>
-<details>
-  <summary>
-    <b>diffusion_tops30</b>
-  </summary>
-  EPiC Jedi (DDIM diffusion based) trained on top30 dataset with no additional conditioning. Jet size conditioning is a neccessity for the architecture
-</details>
-<details>
-  <summary>
-    <b>diffusion_tops150_cond</b>
-  </summary>
-  EPiC Jedi (DDIM diffusion based) trained on top150 dataset with conditioning on jet mass and pt
-</details>
-<details>
-  <summary>
-    <b>diffusion_tops150</b>
-  </summary>
-  EPiC Jedi (DDIM diffusion based) trained on top150 dataset with no additional conditioning. Jet size conditioning is a neccessity for the architecture
-</details>
-<details>
-  <summary>
-    <b>diffusion_alljet150_cond</b>
-  </summary>
-  EPiC Jedi (DDIM diffusion based) trained on all jet types with a maximum of 150 particles per jet and conditioning on jet mass and pt.
-</details>
-<br>
-
-During training and evaluation, metrics and plots can be logged via comet and wandb. After training the model will be evaluated automatically and the final results will be saved locally and logged via the selected loggers. The evaluation can also be manually started like this
+and the particle feature model with
 
 ```bash
-python src/eval.py experiment=experiment_name.yaml ckpt_path=checkpoint_path
+python src/train.py experiment=lhco/lhco_both_jets
 ```
-
-You can also specify the config file that was saved at the beginning of the training
-
-```bash
-python src/eval.py cfg_path=<cfg_file_path> ckpt_path=<checkpoint_path>
-```
-
-Notebooks are available to quickly train, evaluate models and create plots.
+After training both models, one can generate the events with the `lhco_full_eval` notebook. The classifier training and evaluation plots were done with [this](https://github.com/ViniciusMikuni/LHCO_diffusion) code.