Repository for the deep learning models I used in my 2019 summer vacation research at UNSW
e18MouseData.py provides a Dataset class
E18MouseData which can be used create a PyTorch friendly Dataset from GSE93421_bbrain_aggregate_matrix.hdf5.
This code is intended to be used with GSE93421_brain_aggregate_matrix.hdf5 (ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93421/suppl/GSE93421_brain_aggregate_matrix.hdf5). Further information is available here; however, I have been unable to find a thorough description detailing how this dataset is organized.
The following sections represent my best guess at the dataset's structure. The hdf5 file contains 7 1D lists under the head node 'mm10'.
barcodes (n ~= 1.3 million)
Barcode identifier for each sequenced cell
data (n ~= 2.6 billion)
Count data. Each entry corresponds to a reading for a specific gene and cell. See below for details ...
genes and gene_names (n = 27998)
indicies (n ~= 2.6 billion)
This list has a 1-1 correspondence with data. Each entry represents an index in
gene_names (0 <= v < 27998). It indicates what gene the corresponding entry in
data is refering to.
indptr (n ~= 1.3 million)
This is has a 1-1 correspondence with
barcodes. Each entry ris a pointer to an index in
data (monotonically increasing with 0 <= v <~ 2.6 billion). Each entry in
data between two consective values of
indptr are count data for the same cell with the corresponding gene given by
Notes on Computational Resources
This dataset is very large, especially in it's full sparse representation (~36 billion datapoints). This code will require approximately 170GB of RAM to load the full dataset (I provide the option to only load a fraction of it in). It takes about 15 minutes to load even using 20 processes in parallel on a dual socket Intel E5-2699 (2.2GHz).
- My fork of pt-sdae (including a branch compatible with PyTorch 0.35 for Cuda 7.5)
- h5py (for loading in data)
- sharedmem (to allow large shared-memory numpy arrays between processes)
- sklearn (for tsne)
- MulticoreTSNE (for multicore compatible tsne)
- umap-learn (for umap)