This is code for da pengster (bio ML stuffs)
The background for this project is that we are given two datasets:
- coupled single-cell measurements of the form
(DNA, RNA)
, whereDNA
is a ~220k dimensional vector of ATAC-seq chromatin accessibility measurements for 220k genes andRNA
is a ~24k vector of measured gene expression - coupled single-cell measurements of the form
(RNA, protein)
, whereRNA
is defined as before andprotein
is a 140 dimensional vector of surface level protein measurements for 140 different proteins
jepa.py
contains a Joint Embedding (Predictive) Architecture framework that wraps several individual models (joint embeddings work like CLIP or this, but what I am doing has more moving parts)models.py
contains the actual models implemented (currently Enformer and a better version of dilated nets)trainer.py
contains a big boy training framework, complete with lr scheduling, patience algorithms, etc.datasets.py
implements custom PyTorch Dataset objects to allow for efficient dataloading of the massive.h5
files we useenv_init.sh
sets up the python virtual environment to make things nice
The rest of the code is exploratory and not finalized (well, nothing is finalized but you know what i mean)