Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #97

Closed
jwr opened this issue Oct 7, 2015 · 23 comments
Closed

Memory leak #97

jwr opened this issue Oct 7, 2015 · 23 comments
Labels

Comments

@jwr
Copy link
Contributor

jwr commented Oct 7, 2015

I started working with larger data sets and I'm running out of memory. From a quick look with the yourkit profiler it seems that the heap gets filled up with clojure.core.async$mult$reify__6387 objects, stored in a clojure.lang.PersistentHashMap$INode[32].

A way to trigger this is simply to try to perform lots of operations:

(doseq [i (range 5000000)]
    (-> (r/db db) (r/table "docs") (r/insert [{:id i}]) (r/run *conn*)))

If I let the above snippet run for several minutes, memory consumption will start rising after a while, eventually exhausting the 4GB heap.

I've attached a snapshot of the memory allocation by class, hopefully this will help.
2015-10-07 at 22 47

@danielcompton
Copy link
Collaborator

Does it GC at all? Which version are you running?

@jwr
Copy link
Contributor Author

jwr commented Oct 8, 2015

Yes. And in the beginning things look just fine, it is only after a while that the heap starts growing. I am running 0.10.1.

I can offer a wild guess that it isn't a GC problem at all, but rather one of timing: there are leftovers of core.async structures because there is no time to process them, and things start accumulating. I suspect that if I inserted a pause every one million transactions, things might clear up (I'll actually try that).

This is a critical bug, it prevents me from using clj-rethinkdb at all.

@jwr
Copy link
Contributor Author

jwr commented Oct 8, 2015

Turns out my wild guess was wrong: just running that loop in 100k increments, with pauses inbetween runs, is enough to eventually exhaust the heap. Some core.async structures related to pub/sub are being retained forever.

Apparently people didn't notice it because nobody makes a lot of transactions. But every clj-rethinkdb app out there right now leaks memory.

@danielcompton
Copy link
Collaborator

It'd be great to know if this was a core.async issue or something that we've done wrong. My (clearly naive) assumption was that the sub channels would be GC'd when there were no references to them, but that doesn't seem to be the case.

I don't have a lot of time to look at this this week, are you able to dig into this a little bit? @danielytics may have some thoughts as he worked on this?

@jwr
Copy link
Contributor Author

jwr commented Oct 8, 2015

I'm sorry, there is no way I'll be able to spend time on this in the near future :-( I just have to use something else for now, I can't put the project on hold.

@danielytics
Copy link
Contributor

@danielcompton do you know if send-continue-query can return a response type of 1? If so, then token won't get removed from the waiting list of tokens.

I don't know if this can happen with insert since insert shouldn't return a streaming response, so I don't think this can be the cause of the memory leak, but maybe worth a look anyway...

The per-query subs are being unsubbed, so that shouldn't be a problem either and as far as I'm aware the channels get GC'd when they are no longer references (so when send-query* returns) - core.async/close! doesn't affect GC as far as I'm aware.

Id like to see what happens if the queries are run at a slower rate (but for a longer period of time) to see if its a timing issue (I don't think it is, because the queries are run synchronously). I'd also like to run the core.async logic without rethinkdb to see if its a core.async bug or a clj-rethinkdb bug. I likely won't have time until Sunday at the earliest to try this myself.

@danielcompton
Copy link
Collaborator

@jwr can you give me more details on your testing setup, including JVM opts, where RethinkDB is running, a graph of memory usage over time, how long the tests were running for, any logs created, what exact error you got, e.t.c.? I ran your test myself with a 512 MB (and 4GB) heap and while I could see the memory use going up, the JVM was able to successfully GC it away.

screenshot of visualvm 9-10-15 4 21 07 pm

Over the longer term, GC's started to get more frequent, which needs to be looked into further. Still, I wasn't seeing the heap exhaustion that you described.

screenshot of visualvm 9-10-15 4 31 19 pm

@danielcompton
Copy link
Collaborator

This looks very suspect: ASYNC-90

@danielcompton
Copy link
Collaborator

Running this over a much longer period of time, I can see the inevitable:

screenshot of tweetbot 9-10-15 5 37 01 pm

@jwr
Copy link
Contributor Author

jwr commented Oct 9, 2015

@danielcompton I don't think the exact JVM opts are important — as you later noticed, you'll exhaust the heap eventually anyway. I normally run with default JVM settings on Mac OS X, which is equivalent to -Xmx4096M, I think. RethinkDB runs locally.

@danielytics
Copy link
Contributor

Ouch... I believe ASYNC-90 is the cause of this. core.async creates a new mult for every topic and as ASYNC-90 says, when there are no taps left, the mult is still kept around. That means that once you sub to a topic, a mult is created that is never ever released.

We use the token as the topic, meaning that there is a new mult created for every single query and that never gets freed.

ASYNC-90 is a rather nasty bug and it seems to have been known for about a year. I think core.async really needs more love and attention :(

I'll reimplement the logic without pub/sub, but I won't have time until at earliest Sunday, if not later.

@danielytics
Copy link
Contributor

I'm looking into a fix for this now.

It looks like I should be able to use a mult directly and tap it for each query (and untap on response). This way there is one mult per connection, instead of one per query as there is with pub/sub.

@jwr
Copy link
Contributor Author

jwr commented Oct 10, 2015

@danielytics it might be worth commenting on the ASYNC-90 issue in JIRA so that everyone knows that people do run into this problem. I voted the issue up, but I don't want to comment, because I don't fully understand the problem.

danielcompton added a commit that referenced this issue Oct 15, 2015
Relates to #97
danielcompton added a commit that referenced this issue Oct 15, 2015
There is a memory leak (http://dev.clojure.org/jira/browse/ASYNC-90)
using pub/sub. However as all of our pubs are one off, we can use
unsub-all instead of unsub which will reclaim the memory for that pub.

Relates to #97
@danielcompton
Copy link
Collaborator

This is a beautiful sight. Before and after fixing the issue (around 8:30):

screenshot of yourkit java profiler 2015 build 15074 16-10-15 9 14 37 am

screenshot of yourkit java profiler 2015 build 15074 16-10-15 9 20 12 am

a274a04

@danielcompton
Copy link
Collaborator

@jwr I'm pretty sure this fixes the memory leak, it did in my testing on a 64 MB VM. Can you install master and test it against your use case please?

@bhurlow
Copy link

bhurlow commented Oct 15, 2015

bravo!

@jwr
Copy link
Contributor Author

jwr commented Oct 17, 2015

I'm happy to report that the leak is now gone! Thanks everyone!

(Incidentally — how do you "manually" use an unreleased dependency in leiningen? I worked around this by placing my test code within a checked-out copy of clj-rethinkdb, but that wouldn't always work)

@apa512
Copy link
Owner

apa512 commented Oct 17, 2015

lein install should work if I'm understanding correctly what you're trying to do, i.e. clone a repo, make some changes and have those picked up by another project.

@danielcompton
Copy link
Collaborator

Yep, when you install, it will install a JAR to your local Maven repo. You should see the version come up when you install it. I'll publish a snapshot soon though.

@jwr
Copy link
Contributor Author

jwr commented Oct 19, 2015

Thanks for the tips!

As for the bug fix, I would humbly suggest that it merits an immediate release of a new version.

@danielytics
Copy link
Contributor

👍 agree with @jwr

Great fix, @danielcompton! Nice and simple. I tried to refactor the code to just use a mult directly and not use a pub at all, but, while the change was quite small, it caused strange errors and I didn't have time to figure out why. So I'm happy to go with your version.

@apa512
Copy link
Owner

apa512 commented Oct 19, 2015

Deployed as [com.apa512/rethinkdb "0.11.0"].

@bhurlow
Copy link

bhurlow commented Oct 19, 2015

🍦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants