-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
heap overflow with very long sequences #6
Comments
Hi @bertsky - the problem is that to get the opcodes, the library creates an |
Hi @belambert, I do need the alignments themselves, not just the distances. Of course, the general complexity of this is And even if this was implemented pessimistically, the above would indicate quite a large linear factor. With (In contrast, difflib takes 11 MB on |
Thanks @bertsky for bringing this up. I've taken a look at the code, and the memory use could be reduced. In particular, it's creating m*n opcode lists That could conceivably reduce the the memory use by 5x or so. I'm not sure when I'd have a chance to do that though. Maybe that's something you want to take on? I'm not familiar with a less than quadratic algorithm, so I'd love to learn about one if that's possible. |
Yes, that would improve on the linear factor of memory requirement. I guess the string opcodes themselves are not an issue, because they should be interned by the compiler. But above that factor, space complexity itself should be See the section time and space complexity in edlib's README. It refers to the algorithms by Myers 1999 and Ukkonen 1985. Unfortunately, I do not have time to assist with a PR right now. (I still think keeping a difflib-like API is a highly valuable characteristic. And edlib does not even work on non-ASCII strings.) |
I made some changes on a branch It's still quadratic in space, with almost all memory use being a single |
@bertsky I had a train ride today and took the opportunity to test the optimization I wrote a while back. Merged it and deployed to pypi. I'm able to run the example you give with <1GB of memory now. I'm going to close this issue since I'm unlikely to implement anything that's sub-quadratic. Let me know if this works for you now. |
I get an extreme memory consumption when trying to align very long sequences with Python 3.6 (strings of about 10k characters). Calling
SequenceMatcher.get_opcodes()
never terminates, allocating more and more (up to 28 GB resident) until interrupted. This happens even with identicala
andb
, as long as the sequence is long enough.Minimal example:
The text was updated successfully, but these errors were encountered: