heap overflow with very long sequences #6

bertsky · 2018-10-22T00:37:37Z

I get an extreme memory consumption when trying to align very long sequences with Python 3.6 (strings of about 10k characters). Calling SequenceMatcher.get_opcodes() never terminates, allocating more and more (up to 28 GB resident) until interrupted. This happens even with identical a and b, as long as the sequence is long enough.

Minimal example:

from edit_distance.code import SequenceMatcher
a = u"x" * 10000 # dummy text
b = u"x" * 10000 # same
matcher = SequenceMatcher(a, b)
matcher.get_opcodes()

The text was updated successfully, but these errors were encountered:

belambert · 2018-10-22T04:17:06Z

Hi @bertsky - the problem is that to get the opcodes, the library creates an O(n*m) matrix to keep track of partial alignments. If you just need the distance then it doesn't need that matrix and will only use O(min(n, m)) memory. So running matcher.distance() should be much faster and less memory intensive than running get_opcodes(). I'm not sure if there is a general purpose way to get the alignment with less than O(n*m) memory.

bertsky · 2018-10-22T07:46:39Z

Hi @belambert,

I do need the alignments themselves, not just the distances.

Of course, the general complexity of this is O(n*m). But that bound should not be tight for the benign case of sequences that are very similar, wouldn't you agree?

And even if this was implemented pessimistically, the above would indicate quite a large linear factor. With n=m=10e4, storing a matrix with 4 bytes (uint32) for each index would require 1.6e9 bytes (2 GB). But I already get over 4 GB with n=m=4000.

(In contrast, difflib takes 11 MB on n=m=10e4.)

belambert · 2018-10-26T23:46:33Z

Thanks @bertsky for bringing this up. I've taken a look at the code, and the memory use could be reduced. In particular, it's creating m*n opcode lists ['insert', 0, 1, 2], while it would be sufficient to just save the operation and create the opcode lists after the shortest path has been found. I.e. these:
https://github.com/belambert/edit-distance/blob/master/edit_distance/code.py#L320

That could conceivably reduce the the memory use by 5x or so. I'm not sure when I'd have a chance to do that though. Maybe that's something you want to take on?

I'm not familiar with a less than quadratic algorithm, so I'd love to learn about one if that's possible.

bertsky · 2018-10-29T16:30:19Z

Yes, that would improve on the linear factor of memory requirement. I guess the string opcodes themselves are not an issue, because they should be interned by the compiler.

But above that factor, space complexity itself should be O(min(s,m,n)) where s is the length of the alignment path in question, and time complexity O(s*min(m,n)). So memory consumption needs to be linear, not quadratic. And very similar sequences should require both very little time and space.

See the section time and space complexity in edlib's README. It refers to the algorithms by Myers 1999 and Ukkonen 1985.

Unfortunately, I do not have time to assist with a PR right now. (I still think keeping a difflib-like API is a highly valuable characteristic. And edlib does not even work on non-ASCII strings.)

belambert · 2018-11-15T03:03:17Z

I made some changes on a branch efficient-bps which reduce the memory consumption a lot. The example you gave, now completes using about 800MB of memory on my machine. I'd like to do a bit more testing before merging this, but maybe you want to try out the branch?

It's still quadratic in space, with almost all memory use being a single nxm back pointer array (of pointers), so it won't work so well if you need to bump the sizes of the sequences up another order of magnitude.

belambert · 2019-12-04T22:19:31Z

@bertsky I had a train ride today and took the opportunity to test the optimization I wrote a while back. Merged it and deployed to pypi. I'm able to run the example you give with <1GB of memory now. I'm going to close this issue since I'm unlikely to implement anything that's sub-quadratic. Let me know if this works for you now.

bertsky mentioned this issue Oct 14, 2019

Long sequences run into RecursionError eseraygun/python-alignment#11

Open

belambert closed this as completed Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heap overflow with very long sequences #6

heap overflow with very long sequences #6

bertsky commented Oct 22, 2018

belambert commented Oct 22, 2018

bertsky commented Oct 22, 2018

belambert commented Oct 26, 2018

bertsky commented Oct 29, 2018

belambert commented Nov 15, 2018

belambert commented Dec 4, 2019

heap overflow with very long sequences #6

heap overflow with very long sequences #6

Comments

bertsky commented Oct 22, 2018

belambert commented Oct 22, 2018

bertsky commented Oct 22, 2018

belambert commented Oct 26, 2018

bertsky commented Oct 29, 2018

belambert commented Nov 15, 2018

belambert commented Dec 4, 2019