Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue with 0.4.0+ #77

Closed
ov7a opened this issue Aug 15, 2018 · 7 comments
Closed

Memory issue with 0.4.0+ #77

ov7a opened this issue Aug 15, 2018 · 7 comments

Comments

@ov7a
Copy link

ov7a commented Aug 15, 2018

It seems that In newer versions (0.4.0+) JsonDiff consumes significantly more memory. Most probably this is related to #60 and getLCS method.

For version 0.4.4:

An exception or error caused a run to abort: Java heap space 
java.lang.OutOfMemoryError: Java heap space
	at com.flipkart.zjsonpatch.InternalUtils.longestCommonSubsequence(InternalUtils.java:31)
	at com.flipkart.zjsonpatch.JsonDiff.getLCS(JsonDiff.java:446)
	at com.flipkart.zjsonpatch.JsonDiff.compareArray(JsonDiff.java:337)
	at com.flipkart.zjsonpatch.JsonDiff.generateDiffs(JsonDiff.java:324)
	at com.flipkart.zjsonpatch.JsonDiff.compareObjects(JsonDiff.java:425)
	at com.flipkart.zjsonpatch.JsonDiff.generateDiffs(JsonDiff.java:327)
	at com.flipkart.zjsonpatch.JsonDiff.asJson(JsonDiff.java:47)
	at com.flipkart.zjsonpatch.JsonDiff.asJson(JsonDiff.java:38)

I'm comparing two quite large jsons, around 700 Kb each, with around 300Mb of memory.
It is caused by comparing two arrays with around 40 000 of strings each.

Version 0.3.10 works just fine.

@LouizFC
Copy link
Collaborator

LouizFC commented Aug 15, 2018

If I am correct 0.4.0 is the update that removed Apache Commons.

The way @vishwakarma implemented the current algorithm uses an array for collecting the diffs and then produces the result.

If I am not wrong, the LCS algorithm from Apache uses a kind of cursor called "Snake" that accumulates the result just by transversing the two lists without creating extra objects, just adding a counter.

I will take a look in the 0.3.10 and the current version using a profiler. Do you have some dummy data so I can replicate it?

@ov7a
Copy link
Author

ov7a commented Aug 15, 2018

I've made a MWE in gist.

@vishwakarma
Copy link
Member

vishwakarma commented Aug 16, 2018

@ov7a Agreed, it can because of the nature of LCS algorithm ( internally used in case of arrays ) due to its runtime & spacetime complexity. I will work on it with your example as benchmark to improve it.
@LouizFC , kindly let me know if you need any assistance here on this, meanwhile i will be also working on it and will get in touch with for the alternates.
@ov7a Thank you for bringing this issue.

@LouizFC
Copy link
Collaborator

LouizFC commented Aug 16, 2018

@vishwakarma I did not have the time for profiling it yet, but I checked the Apache implementation (more importantly, this) and it uses two arrays as "buffers" instead of a matrix (our implementation). Maybe thats the cause? I will investigate it further.

I had just a quick look, so this is more a guess than anything else. I will try to take a serious look at this on the weekend.

@LouizFC
Copy link
Collaborator

LouizFC commented Aug 20, 2018

The problem is more severe than I though.

But first, a brief disclaimer: I rarely use profilers. In this case I used VirtualVM. So if the results are a bit off I beg your pardon.

0.3.10

  • No VM args:
    0.3.10
    We have very linear heap allocation and little spike present on the creation of the final patch object.

0.4.4

  • No Vm args: OutOfMemoryError
  • -Xms2048m -Xmx2048m: OutOfMemoryError
  • -Xms4096m -Xmx4096m:
    0.4.4
    The program doesn't even run without VM args. The heap allocation is also linear, without any spikes, but the heap memory used is almost 20x more than the 0.3.10 version.
  • Replaced current LCS implementantion with 0.3.10 implementation:
    Apache 0.4.4
    The memory allocations revert back to 0.3.10.

@vishwakarma What are your thoughts. Should we try to reintroduce the Apache Commons dependency or solve it with our own code?

In my opinion, we should reintroduce Apache Collections for now. I was digging into Apache Collections and the way they produce their EditScript is really fascinating, we could learn a lot from there and apply the same techniques in the rest of our library in the future.

Edit: Fixed Markdown for images.

Edit2: I tested some more and the 10 MB difference from 0.3.10 and 0.4.4 (with Apache) is due to the better string concatenation and the precompiled Regexes that were introduced

@vishwakarma
Copy link
Member

vishwakarma commented Aug 20, 2018 via email

@vishwakarma
Copy link
Member

Fixed by PR #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants