Wrapper and implementation for comparing models for multiomics data integration; Check out our paper "An in-depth comparison of linear and non-linear joint embedding methods for bulk and single-cell multi-omics" here: https://doi.org/10.1101/2023.04.10.535672
MOST RECENT UPDATES ARE FOUND IN THE generic-impl BRANCH!
Recommended installation uses a new Anaconda environment. To ease the process, this project includes an environment file. This can be plugged into Anaconda following this short tutorial.
Then to run the project, simply call the run.py wrapper with the desired config file like so:
python run.py -c configs/geme.yamlThis default call to run.py will run all implemented models. For specific models, use one or combine the
-poe -moe -mofa -mvib -cgae flags.
To see additional arguments, use argparse's built in -h command:
python run.py -h- The experiment name is set in the config file, or in the command line using
-e experiment_name - All results and model outputs will be stored in the project's
resultsdirectory - All metrics are written to a TensorBoard file, also found in the results folder
- All informational print statements are saved to a log.txt file, using a logger found in the util folder.
- For every model, the Z is saved to the results folder, together with a UMAP representation.
- Imputation is done right after model training. The code for the imputation file is therefore found in the main file of each model.
- All options for data/model specific features can be set in a config file found in the
configsfolder.
Task 1 Imputation:
- MOFA+, PoE, CGAE, Baseline have this up and running
- MoE is under construction to use other logic (see below)
- MVIB is not suitable for imputation.
Task 2 Survival Time Prediction:
- A file exists in the
data_preprocessingfolder that creates splits of the data to use in this task. - Implementation will be based on the Momix implementation, using the R
survivalpackage. - Currently under construction
- There are also files left from cancer stage classification in the CGAE and MVAE folders.
The implementation of MOFA+ is provided by MOFA+.
The code allows for the model to be trained in pure Python. This model is saved to a .hdf5 file in the appropriate directory.
Then the model's W and Z matrices (see documentation) have to be fetched using R. The Python library r2py takes care of that, but in the case of issues
there is a MOFA_downstream.R file attached.
Some notes:
- The model is trained on the training set + validation set
- Using the Moore-Penrose inverse (pseudoinverse) of the W matrix, we can multiply this with new data (Y) from a test set to get a corresponding Z for that test set.
- We can do this for both omics and then impute back from their respective Z's to the other omic's Y matrix.
The MVAE code was originally adapted from the Product-of-Experts MVAE as developed by Wu and Goodman.
- Their implementation has remained mostly intact. Their Product-of-Experts function in
model.pyand their test/training methods for example. - The VAE architecture was changed to a more standard Vanilla-VAE architecture, based on the Pytorch-VAE.
- The loss function in
train.pywas rewritten to also work more like the Pytorch-VAE. Their loss function uses Binary cross entropy. - Currently, this library was extended to also use a Mixture-of-Experts approach. Using all the same code but the actual combining of Gaussians. BEWARE: this code was written by what I thought was correct, but is not fully backed by a specific paper.
- To use Mixture-of-Experts, see next section.
Instead of writing an in-house implementation with chance of scrutiny, this approach will be adapted from MMVAE. Currently, there is some work done on reusing their logic in the MVAE model.py file.
It is not yet in finalized state.
Based on the following paper.
Some to-do's are listed in the Drive doc, concerning implementation of the MultiOmicsVAE in the nets.py file.
The CGAE model is inspired by this paper.
.
├── LICENSE
├── README.md
├── .gitignore
├── configs
│ └── geme.yaml
│ └── gegcn.yaml
│ └── gcnme.yaml
│ └── brca2_gegcn.yaml
│ └── etc.
├── data/
├── environment.yml
├── results/
├──
├── run.py
└── src
├── baseline/
├── CGAE/
├── data_preprocessing/
├── MOFA2/
├── MVAE/
├── MVIB/
├── util/
├── nets.py
└── survival.py
- Stavros Makrodimitris S.Makrodimitris@tudelft.nl
- Tamim Abdelaal T.R.M.Abdelaal-1@tudelft.nl
- Bram Pronk I.B.Pronk@student.tudelft.nl
- Marcel Reinders M.J.T.reinders@tudelft.nl
- Argelaguet, R. and Velten, B. and Arnol, D. and Dietrich, S. and Zenz, T. and Marioni, J. C. and Buettner, F. and Huber, W. and Stegle, O., Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets, Mol Syst Biol, 14.6, 2018.
- Argelaguet, R. and Arnol, D. and Bredikhin, D. and Deloro, Y. and Velten, B. and Marioni, J.C. and Stegle, O.}, MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data, Genome Biology, 21.1, pp. 111, 2020.
- Section will be expanded on in the future.
