hehe

This is code for da pengster (bio ML stuffs)

The background for this project is that we are given two datasets:

coupled single-cell measurements of the form (DNA, RNA), where DNA is a ~220k dimensional vector of ATAC-seq chromatin accessibility measurements for 220k genes and RNA is a ~24k vector of measured gene expression
coupled single-cell measurements of the form (RNA, protein), where RNA is defined as before and protein is a 140 dimensional vector of surface level protein measurements for 140 different proteins

jepa.py contains a Joint Embedding (Predictive) Architecture framework that wraps several individual models (joint embeddings work like CLIP or this, but what I am doing has more moving parts)
models.py contains the actual models implemented (currently Enformer and a better version of dilated nets)
trainer.py contains a big boy training framework, complete with lr scheduling, patience algorithms, etc.
datasets.py implements custom PyTorch Dataset objects to allow for efficient dataloading of the massive .h5 files we use
env_init.sh sets up the python virtual environment to make things nice

The rest of the code is exploratory and not finalized (well, nothing is finalized but you know what i mean)

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
pkls		pkls
.gitignore		.gitignore
README.md		README.md
architectures.py		architectures.py
autoencoder.py		autoencoder.py
cite.py		cite.py
datasets.py		datasets.py
ensemble_cite.py		ensemble_cite.py
jepa.py		jepa.py
lookup_cite_loci.py		lookup_cite_loci.py
losses.py		losses.py
mega.py		mega.py
mel.py		mel.py
model.py		model.py
multi.py		multi.py
requirements.txt		requirements.txt
silly.py		silly.py
submission.py		submission.py
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py

edogariu/hehe