Table.collapse performance issue #887

qiyunzhu · 2022-12-23T22:13:08Z

I think that the Table.collapse method may not have the best performance in its current form. Woltka calls this method to collapse BIOM tables. I found that it is quite slow when working with large BIOM tables, especially those of functional units (ORFs, KOs, UniRefs, etc.), which are common in shotgun metagenomics. For example, collapsing 2,065,892 ORFs by 25 samples (a very small test dataset) took 14m03.86s and 7.45 GB RAM. When working with real datasets it could be slower.

I did a simple benchmark using the Moving Pictures dataset, which has 545 ASVs mapped to 197 genera, from 34 samples. Using biom-format's collapse, it took 217 ms ± 838 µs.

res = table.collapse(lambda id_, _: map_[id_], norm=False, axis='observation')

I did some research and found that there may be more efficient solutions, at least in certain scenarios. The most straight-forward method may be Pandas' groupby, which is well-optimized and generic. In my benchmark it took 1.37 ms ± 23.8 µs.

res = table.to_dataframe(dense=True).groupby(map_).agg(sum)

I found several articles discussing the underlying mechanisms and alternative implementations of groupby: here, here, and here. By learning from these articles, I also came up with a pure NumPy solution, which isn't as fast as grouby but close. It took 2.38 ms ± 2.7 µs.

keys = [map_[x] for x in table.ids('observation')]
uniq, group = np.unique(keys, return_inverse=True)
res = np.zeros([uniq.shape[0], table.shape[1]])
np.add.at(res, group, table.matrix_data.toarray())

Both the Pandas and the NumPy solutions may suffer from a heavy bottleneck of converting the sparse matrix in a BIOM table into a dense matrix. However, according to these articles, there appears to be a solution utilizing SciPy's sparse matrix. I don't quite understand it, but I guess there may be a solution that does the job inside BIOM without converting.

Attached is my code for benchmarking. @wasade

collapse.ipynb.zip

The text was updated successfully, but these errors were encountered:

mortonjt · 2022-12-23T22:36:45Z

I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles.
But if you just need a groupby function, you can replace it with a sparse matrix multiplication which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime.

wasade · 2022-12-23T22:44:33Z

Collapse is a very generalized function and it is very likely special cases can be well optimized. We have not invested much in optimizing this method either. Jamie andor Qiyun, would either of you be willing to issue a PR? If not then it's likely this will be pushed off for a while unless the performance concern is a critical application issue. Is it? The largest issue Im aware of is Qiita but this is not its primary bottleneck (ie less bang for the buck to focus on)

…

On Fri, Dec 23, 2022, 14:36 Jamie Morton ***@***.***> wrote: I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles. But if you just need a groupby function, you can replace it with a sparse matrix multiplication <https://stackoverflow.com/a/39651444> which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime. — Reply to this email directly, view it on GitHub <#887 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADTZMT5JRKZRIPT56WJVXTWOYSQPANCNFSM6AAAAAATIC5PJU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mortonjt · 2022-12-23T23:04:14Z

I can do it sometime in Mar / Apr since I’ve done before (see here: https://github.com/mortonjt/ancomP) On Fri, Dec 23, 2022 at 5:44 PM Daniel McDonald ***@***.***> wrote:

…

Collapse is a very generalized function and it is very likely special cases can be well optimized. We have not invested much in optimizing this method either. Jamie andor Qiyun, would either of you be willing to issue a PR? If not then it's likely this will be pushed off for a while unless the performance concern is a critical application issue. Is it? The largest issue Im aware of is Qiita but this is not its primary bottleneck (ie less bang for the buck to focus on) On Fri, Dec 23, 2022, 14:36 Jamie Morton ***@***.***> wrote: > I can't comment on the general purpose of the collapse function, since it > seems to come with quite a few bells as whistles. > But if you just need a groupby function, you can replace it with a sparse > matrix multiplication <https://stackoverflow.com/a/39651444> which should > scale well with sparse data. It doesn't look like that isn't going on in > the collapse code, but I would be curious if that could improve the runtime. > > — > Reply to this email directly, view it on GitHub > < #887 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AADTZMT5JRKZRIPT56WJVXTWOYSQPANCNFSM6AAAAAATIC5PJU > > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#887 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXJ63RC3NOBPVGJGMSTWOYTNXANCNFSM6AAAAAATIC5PJU> . You are receiving this because you commented.Message ID: ***@***.***>

wasade · 2022-12-30T00:44:54Z

This issue conflates one-to-many and one-to-one modes. The former is implicated in the seed of the issue with the woltka collapse; this was a bug and is now fixed in #888.

For one-to-one, while it is true it is possible to write a faster implementation, the data presented here (as well as out-of-band) do not indicate the present performance of one-to-one is a meaningful bottleneck. Until data showing one-to-one is a bottleneck, we will not be pursing further optimization beyond what was already done in #866 and #761.

wasade mentioned this issue Dec 30, 2022

Avoid dense transform collapse #888

Merged

antgonza closed this as completed in #888 Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table.collapse performance issue #887

Table.collapse performance issue #887

qiyunzhu commented Dec 23, 2022

mortonjt commented Dec 23, 2022

wasade commented Dec 23, 2022 via email

mortonjt commented Dec 23, 2022 via email

wasade commented Dec 30, 2022

Table.collapse performance issue #887

Table.collapse performance issue #887

Comments

qiyunzhu commented Dec 23, 2022

mortonjt commented Dec 23, 2022

wasade commented Dec 23, 2022 via email

mortonjt commented Dec 23, 2022 via email

wasade commented Dec 30, 2022