Memory usage problems #69

pschloss · 2023-07-14T20:50:20Z

I'm trying to use gemelli to generate a distance matrix on a large, sparse ASV table that has 97791 ASVs and 285 samples. I have a couple of larger datasets that I'd also like to analyze to get the rclr-based Aitchison distances. Unfortunately, this dataset is seg faulting when I give it 98 GB of RAM and the two other datasets throw an error indicating that numpy is unable to allocate 309 and 450 GB of space for the array which are on the order of 200k by 200k. I readily admit that I'm not 100% that I'm using gemelli correctly and wondering whether there's a way around the memory limitations to generate the distance matrix. Here's the script I'm running along with a compressed version of the data for the 97791 ASV by 285 sample dataset. Any help you can offer would be great.

#!/usr/bin/env python3

import numpy as np
from biom import Table
import pandas as pd
from gemelli.preprocessing import rclr_transformation
from gemelli.rpca import rpca

np.seterr(divide = 'ignore') # for rclr

df = pd.read_csv("data.gemelli.tsv", sep = "\t")

# convert to biom-formatted file
biom = Table(df.to_numpy(),
             df.index,
             df.columns) 

ordination, distance_matrix = rpca(biom) 

distance_matrix.to_data_frame().to_csv("data.gemelli.dist", sep='\t')

data.gemelli.tsv.gz

The text was updated successfully, but these errors were encountered:

cameronmartino · 2023-08-03T18:42:30Z

Hi Pat,

Thanks for reporting this!

This surprised me since I have run large datasets like the AGP & EMP through RPCA on my laptop without an issue. It turns out that the matrix provided is very high rank (almost full rank), higher than I have observed before (e.g. higher than the 88 soils dataset - see here). We noted in the original paper (see here), "A low-rank constraint can possibly cause misleading results in the case of high-rank data sets. High-rank data sets may occur in microbiome datasets as a gradient between samples and features.".

The memory requirements are blowing up since the low-rank assumptions of RPCA that allow for efficient calculation are being violated. Basically, you would need the number of PCs to be equal to the number of samples to be able to reconstruct the matrix well from the singular value decomposition because each sample is so different from all other samples. I would highly suggest that RCLR/RPCA not be used with this data for that reason.

Normally, our low-rank assumption test would issue a warning for this but for some reason, it is not working on this matrix. I added an issue #70 to dig into this ASAP and make sure we return an informative error for this in the next version and prevent this memory blow-up!

Thanks again for reporting this,

Cameron

pschloss · 2023-08-03T18:59:03Z

My pleasure - is it safe to assume that if the data successfully goes through without blowing up the RAM that it is ok to use RCLR/RPCA with these data?

cameronmartino · 2023-08-03T19:27:25Z

Good question. No for now, while I work out where the bug is in the rank-based screen that should have caught this edge case and raised an input error.

An easy way to check in the interim is to use another method without stringent low-rank assumptions (e.g., heatmap of a beta-diversity distance), and if the data has reasonable sample clusters (fewer clusters than the number of samples) then it should be okay. However, if it shows a gradient of samples mentioned above, such as is seen in a single infant short time series or in the 88 soils example then I would not proceed. The study design itself should also be a helpful screen, if you have 50 samples from armpits and 50 from fecal material, sequenced and processed in a conventional way (i.e., no commands applied outside of well-tested pipelines), it would be safer to assume the data is low-rank based on prior literature.

cameronmartino · 2024-01-18T17:15:47Z

Hi @pschloss,

Addressing this issue. The function auto-rpca has now been depreciated. There is no way I know of (or could find) to determine the underlying low-rank structure explicitly ahead of time in this type of data. Moreover, the underlying full-rank data seen here (which is causing your memory issues) is not something I have encountered in real data, I checked 100s of datasets (all data in Qiita) and ran large ones (e.g., EMP/AGP) in a matter of minutes with low memory (my laptop). Since this does not seem to be a common issue and there is no obvious way to issue a warning / value error, I am closing this issue. Feel free to open it if you have other ideas.

To help with future issues along these lines I added a cross-validation method among other QC methods, you can find them in the README in the version released today.

Thanks again,

Cameron

cameronmartino mentioned this issue Aug 3, 2023

low-rank determination not working with high-rank data #70

Closed

cameronmartino closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage problems #69

Memory usage problems #69

pschloss commented Jul 14, 2023

cameronmartino commented Aug 3, 2023

pschloss commented Aug 3, 2023

cameronmartino commented Aug 3, 2023

cameronmartino commented Jan 18, 2024

Memory usage problems #69

Memory usage problems #69

Comments

pschloss commented Jul 14, 2023

cameronmartino commented Aug 3, 2023

pschloss commented Aug 3, 2023

cameronmartino commented Aug 3, 2023

cameronmartino commented Jan 18, 2024