Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table.collapse performance issue #887

Closed
qiyunzhu opened this issue Dec 23, 2022 · 4 comments · Fixed by #888
Closed

Table.collapse performance issue #887

qiyunzhu opened this issue Dec 23, 2022 · 4 comments · Fixed by #888

Comments

@qiyunzhu
Copy link
Contributor

I think that the Table.collapse method may not have the best performance in its current form. Woltka calls this method to collapse BIOM tables. I found that it is quite slow when working with large BIOM tables, especially those of functional units (ORFs, KOs, UniRefs, etc.), which are common in shotgun metagenomics. For example, collapsing 2,065,892 ORFs by 25 samples (a very small test dataset) took 14m03.86s and 7.45 GB RAM. When working with real datasets it could be slower.

I did a simple benchmark using the Moving Pictures dataset, which has 545 ASVs mapped to 197 genera, from 34 samples. Using biom-format's collapse, it took 217 ms ± 838 µs.

res = table.collapse(lambda id_, _: map_[id_], norm=False, axis='observation')

I did some research and found that there may be more efficient solutions, at least in certain scenarios. The most straight-forward method may be Pandas' groupby, which is well-optimized and generic. In my benchmark it took 1.37 ms ± 23.8 µs.

res = table.to_dataframe(dense=True).groupby(map_).agg(sum)

I found several articles discussing the underlying mechanisms and alternative implementations of groupby: here, here, and here. By learning from these articles, I also came up with a pure NumPy solution, which isn't as fast as grouby but close. It took 2.38 ms ± 2.7 µs.

keys = [map_[x] for x in table.ids('observation')]
uniq, group = np.unique(keys, return_inverse=True)
res = np.zeros([uniq.shape[0], table.shape[1]])
np.add.at(res, group, table.matrix_data.toarray())

Both the Pandas and the NumPy solutions may suffer from a heavy bottleneck of converting the sparse matrix in a BIOM table into a dense matrix. However, according to these articles, there appears to be a solution utilizing SciPy's sparse matrix. I don't quite understand it, but I guess there may be a solution that does the job inside BIOM without converting.

Attached is my code for benchmarking. @wasade

collapse.ipynb.zip

@mortonjt
Copy link
Contributor

I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles.
But if you just need a groupby function, you can replace it with a sparse matrix multiplication which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime.

@wasade
Copy link
Member

wasade commented Dec 23, 2022 via email

@mortonjt
Copy link
Contributor

mortonjt commented Dec 23, 2022 via email

@wasade
Copy link
Member

wasade commented Dec 30, 2022

This issue conflates one-to-many and one-to-one modes. The former is implicated in the seed of the issue with the woltka collapse; this was a bug and is now fixed in #888.

For one-to-one, while it is true it is possible to write a faster implementation, the data presented here (as well as out-of-band) do not indicate the present performance of one-to-one is a meaningful bottleneck. Until data showing one-to-one is a bottleneck, we will not be pursing further optimization beyond what was already done in #866 and #761.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants