Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage problems #69

Closed
pschloss opened this issue Jul 14, 2023 · 4 comments
Closed

Memory usage problems #69

pschloss opened this issue Jul 14, 2023 · 4 comments

Comments

@pschloss
Copy link

I'm trying to use gemelli to generate a distance matrix on a large, sparse ASV table that has 97791 ASVs and 285 samples. I have a couple of larger datasets that I'd also like to analyze to get the rclr-based Aitchison distances. Unfortunately, this dataset is seg faulting when I give it 98 GB of RAM and the two other datasets throw an error indicating that numpy is unable to allocate 309 and 450 GB of space for the array which are on the order of 200k by 200k. I readily admit that I'm not 100% that I'm using gemelli correctly and wondering whether there's a way around the memory limitations to generate the distance matrix. Here's the script I'm running along with a compressed version of the data for the 97791 ASV by 285 sample dataset. Any help you can offer would be great.

#!/usr/bin/env python3

import numpy as np
from biom import Table
import pandas as pd
from gemelli.preprocessing import rclr_transformation
from gemelli.rpca import rpca

np.seterr(divide = 'ignore') # for rclr

df = pd.read_csv("data.gemelli.tsv", sep = "\t")

# convert to biom-formatted file
biom = Table(df.to_numpy(),
             df.index,
             df.columns) 

ordination, distance_matrix = rpca(biom) 

distance_matrix.to_data_frame().to_csv("data.gemelli.dist", sep='\t')

data.gemelli.tsv.gz

@cameronmartino
Copy link
Collaborator

Hi Pat,

Thanks for reporting this!

This surprised me since I have run large datasets like the AGP & EMP through RPCA on my laptop without an issue. It turns out that the matrix provided is very high rank (almost full rank), higher than I have observed before (e.g. higher than the 88 soils dataset - see here). We noted in the original paper (see here), "A low-rank constraint can possibly cause misleading results in the case of high-rank data sets. High-rank data sets may occur in microbiome datasets as a gradient between samples and features.".

The memory requirements are blowing up since the low-rank assumptions of RPCA that allow for efficient calculation are being violated. Basically, you would need the number of PCs to be equal to the number of samples to be able to reconstruct the matrix well from the singular value decomposition because each sample is so different from all other samples. I would highly suggest that RCLR/RPCA not be used with this data for that reason.

Normally, our low-rank assumption test would issue a warning for this but for some reason, it is not working on this matrix. I added an issue #70 to dig into this ASAP and make sure we return an informative error for this in the next version and prevent this memory blow-up!

Thanks again for reporting this,

Cameron

@pschloss
Copy link
Author

pschloss commented Aug 3, 2023

My pleasure - is it safe to assume that if the data successfully goes through without blowing up the RAM that it is ok to use RCLR/RPCA with these data?

@cameronmartino
Copy link
Collaborator

Good question. No for now, while I work out where the bug is in the rank-based screen that should have caught this edge case and raised an input error.

An easy way to check in the interim is to use another method without stringent low-rank assumptions (e.g., heatmap of a beta-diversity distance), and if the data has reasonable sample clusters (fewer clusters than the number of samples) then it should be okay. However, if it shows a gradient of samples mentioned above, such as is seen in a single infant short time series or in the 88 soils example then I would not proceed. The study design itself should also be a helpful screen, if you have 50 samples from armpits and 50 from fecal material, sequenced and processed in a conventional way (i.e., no commands applied outside of well-tested pipelines), it would be safer to assume the data is low-rank based on prior literature.

@cameronmartino
Copy link
Collaborator

Hi @pschloss,

Addressing this issue. The function auto-rpca has now been depreciated. There is no way I know of (or could find) to determine the underlying low-rank structure explicitly ahead of time in this type of data. Moreover, the underlying full-rank data seen here (which is causing your memory issues) is not something I have encountered in real data, I checked 100s of datasets (all data in Qiita) and ran large ones (e.g., EMP/AGP) in a matter of minutes with low memory (my laptop). Since this does not seem to be a common issue and there is no obvious way to issue a warning / value error, I am closing this issue. Feel free to open it if you have other ideas.

To help with future issues along these lines I added a cross-validation method among other QC methods, you can find them in the README in the version released today.

Thanks again,

Cameron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants