Memory leak #97

jwr · 2015-10-07T20:48:55Z

I started working with larger data sets and I'm running out of memory. From a quick look with the yourkit profiler it seems that the heap gets filled up with clojure.core.async$mult$reify__6387 objects, stored in a clojure.lang.PersistentHashMap$INode[32].

A way to trigger this is simply to try to perform lots of operations:

(doseq [i (range 5000000)]
    (-> (r/db db) (r/table "docs") (r/insert [{:id i}]) (r/run *conn*)))

If I let the above snippet run for several minutes, memory consumption will start rising after a while, eventually exhausting the 4GB heap.

I've attached a snapshot of the memory allocation by class, hopefully this will help.

The text was updated successfully, but these errors were encountered:

danielcompton · 2015-10-07T21:05:06Z

Does it GC at all? Which version are you running?

jwr · 2015-10-08T06:20:26Z

Yes. And in the beginning things look just fine, it is only after a while that the heap starts growing. I am running 0.10.1.

I can offer a wild guess that it isn't a GC problem at all, but rather one of timing: there are leftovers of core.async structures because there is no time to process them, and things start accumulating. I suspect that if I inserted a pause every one million transactions, things might clear up (I'll actually try that).

This is a critical bug, it prevents me from using clj-rethinkdb at all.

jwr · 2015-10-08T07:36:40Z

Turns out my wild guess was wrong: just running that loop in 100k increments, with pauses inbetween runs, is enough to eventually exhaust the heap. Some core.async structures related to pub/sub are being retained forever.

Apparently people didn't notice it because nobody makes a lot of transactions. But every clj-rethinkdb app out there right now leaks memory.

danielcompton · 2015-10-08T09:39:47Z

It'd be great to know if this was a core.async issue or something that we've done wrong. My (clearly naive) assumption was that the sub channels would be GC'd when there were no references to them, but that doesn't seem to be the case.

I don't have a lot of time to look at this this week, are you able to dig into this a little bit? @danielytics may have some thoughts as he worked on this?

jwr · 2015-10-08T09:40:52Z

I'm sorry, there is no way I'll be able to spend time on this in the near future :-( I just have to use something else for now, I can't put the project on hold.

danielytics · 2015-10-08T10:15:49Z

@danielcompton do you know if send-continue-query can return a response type of 1? If so, then token won't get removed from the waiting list of tokens.

I don't know if this can happen with insert since insert shouldn't return a streaming response, so I don't think this can be the cause of the memory leak, but maybe worth a look anyway...

The per-query subs are being unsubbed, so that shouldn't be a problem either and as far as I'm aware the channels get GC'd when they are no longer references (so when send-query* returns) - core.async/close! doesn't affect GC as far as I'm aware.

Id like to see what happens if the queries are run at a slower rate (but for a longer period of time) to see if its a timing issue (I don't think it is, because the queries are run synchronously). I'd also like to run the core.async logic without rethinkdb to see if its a core.async bug or a clj-rethinkdb bug. I likely won't have time until Sunday at the earliest to try this myself.

danielcompton · 2015-10-09T03:42:50Z

@jwr can you give me more details on your testing setup, including JVM opts, where RethinkDB is running, a graph of memory usage over time, how long the tests were running for, any logs created, what exact error you got, e.t.c.? I ran your test myself with a 512 MB (and 4GB) heap and while I could see the memory use going up, the JVM was able to successfully GC it away.

Over the longer term, GC's started to get more frequent, which needs to be looked into further. Still, I wasn't seeing the heap exhaustion that you described.

danielcompton · 2015-10-09T04:11:08Z

This looks very suspect: ASYNC-90

danielcompton · 2015-10-09T04:37:49Z

Running this over a much longer period of time, I can see the inevitable:

jwr · 2015-10-09T07:46:18Z

@danielcompton I don't think the exact JVM opts are important — as you later noticed, you'll exhaust the heap eventually anyway. I normally run with default JVM settings on Mac OS X, which is equivalent to -Xmx4096M, I think. RethinkDB runs locally.

danielytics · 2015-10-09T10:09:07Z

Ouch... I believe ASYNC-90 is the cause of this. core.async creates a new mult for every topic and as ASYNC-90 says, when there are no taps left, the mult is still kept around. That means that once you sub to a topic, a mult is created that is never ever released.

We use the token as the topic, meaning that there is a new mult created for every single query and that never gets freed.

ASYNC-90 is a rather nasty bug and it seems to have been known for about a year. I think core.async really needs more love and attention :(

I'll reimplement the logic without pub/sub, but I won't have time until at earliest Sunday, if not later.

danielytics · 2015-10-10T15:40:43Z

I'm looking into a fix for this now.

It looks like I should be able to use a mult directly and tap it for each query (and untap on response). This way there is one mult per connection, instead of one per query as there is with pub/sub.

jwr · 2015-10-10T20:05:11Z

@danielytics it might be worth commenting on the ASYNC-90 issue in JIRA so that everyone knows that people do run into this problem. I voted the issue up, but I don't want to comment, because I don't fully understand the problem.

Relates to #97

There is a memory leak (http://dev.clojure.org/jira/browse/ASYNC-90) using pub/sub. However as all of our pubs are one off, we can use unsub-all instead of unsub which will reclaim the memory for that pub. Relates to #97

danielcompton · 2015-10-15T20:56:18Z

This is a beautiful sight. Before and after fixing the issue (around 8:30):

a274a04

danielcompton · 2015-10-15T20:57:40Z

@jwr I'm pretty sure this fixes the memory leak, it did in my testing on a 64 MB VM. Can you install master and test it against your use case please?

bhurlow · 2015-10-15T21:03:04Z

bravo!

jwr · 2015-10-17T11:52:30Z

I'm happy to report that the leak is now gone! Thanks everyone!

(Incidentally — how do you "manually" use an unreleased dependency in leiningen? I worked around this by placing my test code within a checked-out copy of clj-rethinkdb, but that wouldn't always work)

apa512 · 2015-10-17T19:57:08Z

lein install should work if I'm understanding correctly what you're trying to do, i.e. clone a repo, make some changes and have those picked up by another project.

danielcompton · 2015-10-17T20:29:09Z

Yep, when you install, it will install a JAR to your local Maven repo. You should see the version come up when you install it. I'll publish a snapshot soon though.

jwr · 2015-10-19T10:22:57Z

Thanks for the tips!

As for the bug fix, I would humbly suggest that it merits an immediate release of a new version.

danielytics · 2015-10-19T10:56:12Z

👍 agree with @jwr

Great fix, @danielcompton! Nice and simple. I tried to refactor the code to just use a mult directly and not use a pub at all, but, while the change was quite small, it caused strange errors and I didn't have time to figure out why. So I'm happy to go with your version.

apa512 · 2015-10-19T12:38:38Z

Deployed as [com.apa512/rethinkdb "0.11.0"].

bhurlow · 2015-10-19T15:25:02Z

🍦

danielcompton added the bug label Oct 7, 2015

danielytics mentioned this issue Oct 8, 2015

Not derefing conn atom in send-query #98

Closed

danielcompton added a commit that referenced this issue Oct 15, 2015

Use unsub-all

1398a80

Relates to #97

danielcompton mentioned this issue Nov 15, 2015

Speed up rethinkdb driver issue #107 #108

Closed

danielcompton closed this as completed Nov 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #97

Memory leak #97

jwr commented Oct 7, 2015

danielcompton commented Oct 7, 2015

jwr commented Oct 8, 2015

jwr commented Oct 8, 2015

danielcompton commented Oct 8, 2015

jwr commented Oct 8, 2015

danielytics commented Oct 8, 2015

danielcompton commented Oct 9, 2015

danielcompton commented Oct 9, 2015

danielcompton commented Oct 9, 2015

jwr commented Oct 9, 2015

danielytics commented Oct 9, 2015

danielytics commented Oct 10, 2015

jwr commented Oct 10, 2015

danielcompton commented Oct 15, 2015

danielcompton commented Oct 15, 2015

bhurlow commented Oct 15, 2015

jwr commented Oct 17, 2015

apa512 commented Oct 17, 2015

danielcompton commented Oct 17, 2015

jwr commented Oct 19, 2015

danielytics commented Oct 19, 2015

apa512 commented Oct 19, 2015

bhurlow commented Oct 19, 2015

Memory leak #97

Memory leak #97

Comments

jwr commented Oct 7, 2015

danielcompton commented Oct 7, 2015

jwr commented Oct 8, 2015

jwr commented Oct 8, 2015

danielcompton commented Oct 8, 2015

jwr commented Oct 8, 2015

danielytics commented Oct 8, 2015

danielcompton commented Oct 9, 2015

danielcompton commented Oct 9, 2015

danielcompton commented Oct 9, 2015

jwr commented Oct 9, 2015

danielytics commented Oct 9, 2015

danielytics commented Oct 10, 2015

jwr commented Oct 10, 2015

danielcompton commented Oct 15, 2015

danielcompton commented Oct 15, 2015

bhurlow commented Oct 15, 2015

jwr commented Oct 17, 2015

apa512 commented Oct 17, 2015

danielcompton commented Oct 17, 2015

jwr commented Oct 19, 2015

danielytics commented Oct 19, 2015

apa512 commented Oct 19, 2015

bhurlow commented Oct 19, 2015