Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29
Open
chlee-tabin wants to merge 1 commit into
Open
Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29chlee-tabin wants to merge 1 commit into
chlee-tabin wants to merge 1 commit into
Conversation
) The cell x peak matrix in generateMatrix() only ever stores 0/1 binary indicators (set via matrix[oi, celliddict[cellid]] = 1 in the inner loop), but np.zeros defaults to float64 -- 8 bytes per cell unnecessarily. Switching to dtype=np.int8 cuts peak memory 8x with zero algorithmic change; downstream np.sum() operations work unchanged (they return int64 by default). Measured impact on duck (Anas platyrhynchos) snATAC multiome: | Library | n_cells | float64 | int8 | |------------|---------|---------------|-----------| | ARC08 | 29,523 | OOM at >240G | 34G peak | | PILOTARC4A | 34,622 | OOM at >246G | 20G peak | Without this, AMULET is unrunnable on 10x multiome libraries >=25K cells on standard 256G cluster nodes. Doublet rates with the int8 patch match what AMULET produces on the smaller libs where float64 fits (3.42-7.76% range, q<0.01, n=12 ARC libs). Sci-ATAC-seq3 and similar low-coverage assays don't hit this because per-cell fragment counts are ~5-10x lower. Long-term, scipy.sparse.csr_matrix would be even better (the indicator is >99%% zeros for typical scATAC data), but the int8 change is a strictly-necessary first step that unblocks modern high-coverage 10x users without any other code change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #28.
Summary
One-line fix:
generateMatrixallocates the cell × peak indicator matrix asnp.zeros((n_peaks, n_cells))(defaults to float64 — 8 bytes/cell) but the matrix only ever stores binary 0/1 values. Changing the dtype tonp.int8cuts peak memory 8× with no algorithmic change.The matrix is only ever written to via
matrix[oi, celliddict[cellid]] = 1. All downstream operations (np.sum(matrix, axis=1)ininferRepeats, etc.) work unchanged on int8 — sums return int64 by default.Why this matters
Reproduction context (duck Anas platyrhynchos snATAC multiome, 12 cellranger-arc 2.1.0 libraries, 4 K – 35 K cells/lib, median ~25,000 unique HQ fragments/cell):
(SLURM
MaxRSSfromsacct; HMS-O2shortpartition, 257 GB node cap.)Without this patch AMULET is unrunnable on these libraries on standard cluster hardware. Hits 10x Chromium scATAC / multiome users with 25 K+ cells/library; sci-ATAC-seq3 and similar low-coverage assays (~4 K reads/cell, e.g. Qiu et al. 2026) don't trigger it because per-cell fragment counts are 5–10× lower.
Verification
Doublet rates produced from int8 vs float64 matrices are identical, confirmed on the 4 small libraries (ARC0S-duck, PILOTARC3, PILOTARC4B, ARC04) where float64 happens to fit in memory. Final 12-library doublet-rate range: 3.42 – 7.76 % at FDR q<0.01.
Follow-ups
Long-term,
scipy.sparse.csr_matrixwould be even better — the indicator matrix is >99% zeros for typical scATAC data, so sparse would drop memory another order of magnitude. That's a bigger refactor though. This int8 change is a strictly-necessary first step that unblocks AMULET on modern high-coverage 10x data with a one-line patch.Note on coexistence with #27
This PR is orthogonal to #27 (which fixes
np.object → objectfor NumPy ≥1.24). Both fixes are needed for AMULET to run cleanly on a modern conda env; they touch disjoint lines and can be merged in either order.