-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pool customization options #602
add pool customization options #602
Conversation
Thanks for submitting this. I think I would prefer to fix Dirigiste, if possible. PRs can be submitted at https://github.com/clj-commons/dirigiste At a first glance, why isn't this being garbage-collected? Are you collecting stats yourself somehow? 35k seems like an awful lot stats objects. |
@KingMob for some reason stats are not disposed. My fix was to implement Pool without stats at all. Unfortunately I do not have much time but I'll try to check what could be the reason. |
@Yneth Can you share a minimal test case reproducing it? |
Are you sure the |
|
@Yneth Even if you don't have a test case, did you save the profiling session? It would be nice to take a look at it. |
@KingMob no, but I am sure that it gives problems. My fix was to create custom pool implementation without metrics, those that are using Reservoir. It is a bit hard to create minimal test as I need to do some investigation beforehand and I am short on time. My case looks the following way: Also I see uncatched error warning from manifold even though I am using m/catch everywhere, just saying in case it could also be a root cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would still like to fix Dirigiste, but I'm becoming more open to the idea of this PR, and allowing people to customize their pool.
@KingMob I guess giving a possibility to specify the whole pool would be nice, but it would require making |
I'm generally OK with making it public, but leaving it out of the generated documentation via Also, if you can generate a memory profile, we'd still really appreciate it. It's harder to debug this without one. |
@Yneth Any updates? After doing some work with Dirigiste recently, I want your PR. 😄 |
will send an update later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold to the idea of this PR.
To me, we'll have an hard time documenting create-pool-fn
and crate-pool-ctrl-fn
as it's not that obvious.
I would rather focus on fixing Dirigiste
to ensure we are never leaking memory and
it's a stable building block.
A test case, thread dump, heap dump or anything that could help us reproduce
this issue would be awesome.
Any update from your side? What we actually need from you to determine if the issue is related to I wasn't able to reproduce on my side but the expected pattern is this: I have three
From the printscreen you provided, we can only determine a lot of long have been allocated on the heap, but not |
Ok.. ok... (require '[aleph.http :as http])
(require '[manifold.deferred :as d])
(require '[clj-commons.byte-streams :as bs])
(comment
(http/start-server (constantly {:body "hello!"}) {:port 8081})
(defn get [nb]
(apply d/zip (mapv (fn [_] (d/chain (http/get (format "http://%s.localhost:8081/" (java.util.UUID/randomUUID)))
:body
bs/to-string)) (range nb))))
(d/loop [d (get 10)]
(d/chain' d
(fn [_]
(prn "Done!")
(Thread/sleep 300)
(d/recur (get 10000)))))) Each different domain creates me a new Reservoir. They are supposed to be cleaned up by the following code // clear out any unused queues
_lock.lock();
for (Map.Entry<K,Stats> entry : _stats.entrySet()) {
K key = entry.getKey();
if (entry.getValue().getUtilization(1) == 0
&& _queues.get(key).objects.get() == 0) {
_queues.remove(key).shutdown();
// clean up stats so they don't remain in memory forever
_queueLatencies.remove(key);
_taskLatencies.remove(key);
_queueLengths.remove(key);
_utilizations.remove(key);
_taskArrivalRates.remove(key);
_taskCompletionRates.remove(key);
_taskRejectionRates.remove(key);
}
}
_lock.unlock(); https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L331-L332 It does its job when the conditions are met. https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L321-L375 I would be interested to know if you have the same pattern @Yneth. Just a hunch: Not a bug on Dirigiste but most likely on the Aleph client. |
@arnaudgeiser Your test case, that crawls a lot of domains, reminds me a lot of:
As for the objects not being cleaned up:
|
OK, I misread a couple things. Your code is recurring with another 10000 domains. In my tests, this works out to ~770k per domain. Still highish, but there's 5 major maps of domain->reservoir, and each reservoir has an array of 4096 longs/double, so we have 10010 x 5 x 4096 x 8 bytes = ~1.6 GB. 770k per domain is still high, though; I know Dirigiste is tunable, we could probably make it consume less. I also took a closer look at the utilizations, which is where it gets interesting. With one conn per domain, it's continually computing the utilization level as 2 (if I'm computing Double/longBitsToDouble correctly). In the code, it seems that It feels like there's a bug here. Regardless of whether the conn is released or disposed, you'd think an unused connection shouldn't count towards utilization. I'm not sure if it's an edge artifact of having exactly 1 item per domain, or if this has always been a problem, and that's just where we see it first, when testing zillions of domains. |
Just forced it to make 3 conns per domain, and all I got was endless utilization values of 4. I need to set this aside for now. Even if it's a leak, it mostly affects people talking to many different domains, so I'm not sure how high priority it is. |
Do not get me wrong, everything is fine with my test case. This pattern, which is not something we should expect from a production But again, without a heap dump here, we are stuck doing hypothesis |
Ah yes, interesting lead here. |
I will spend some time to write unit tests for Dirigiste beause some parts of the code seems to be off, really. A few examples: public void sample(double n) {
int cnt = _count.incrementAndGet();
if (cnt <= RESERVOIR_SIZE) {
_values.set(cnt-1, Double.doubleToLongBits(n));
} else {
int idx = ThreadLocalRandom.current().nextInt(cnt);
if (idx < RESERVOIR_SIZE) {
_values.set(idx, Double.doubleToLongBits(n));
}
}
}
(def atomic-integer (java.util.concurrent.atomic.AtomicInteger. Integer/MAX_VALUE))
(.incrementAndGet atomic-integer) ;; -2147483648
(.incrementAndGet atomic-integer) ;; -2147483647
(def atomic-long-array (java.util.concurrent.atomic.AtomicIntegerArray. 4096))
(.set atomic-long-array -2147483648 0.5) ==> ArrayIndexOutOfBoundsException
https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L61-L71 The last
https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Stats.java#L166-L168 |
Yeah, there's some weird stuff going on, with few tests, and a lot of that is why I'm not a fan of Dirigiste. Based on a sampling period of 10ms, we should expect overflow and crash after ~8 months left untouched. However, the stats are meant to be replaced periodically. For both Executor and Pool, in
Actually, it's weirder than that. It picks a random number from 0 to cnt-1, and only if that's < 4096 does it store the sample at all. This has the effect of updating the reservoir less and less frequently as time goes on, until it's reset. The longer the control period, the less and less likely that later samples will have an effect. I'm not sure what the rationale is over using a ring buffer, and the commit messages are no help. That being said, using the value in the last position when looking at the percentiles is fine. Inside the reservoir's The problem is that the utilization reservoir contains the same value everywhere, so it doesn't matter what percentile we look at. (The behavior above where later samples in the control period are less likely to be recorded is also a problem, but if the stats are properly reset, it should still work once we have a fallow period.) So either a) the utilization is never being calculated correctly in our scenario (all scenarios?), or b) it's calculated correctly, but the stats aren't being reset, and thus by the time utilization drops to zero, the odds of it being recorded are low, and the odds of all non-zero samples being displaced is vanishingly low. I think it's more likely to be a; if b were true, I should be able to find other values somewhere, but I can't find any. It's really annoying that .toArray and .toMap aren't simple transformation fns. Sorry you lost time on that. |
Thanks for the explanations, I missed so many things here.
Woow... at least that explains the "high" rate allocation I was seeing.
Right
Understood, I missed the fact that
Nevermind, got it.
Let's figure this out then! |
Keep in mind, this is still low priority to my way of thinking. The Dirigiste pooling problem only applies to outgoing client connections, where talking to zillions of different servers is unlikely. The server uses Netty to pool incoming connections/channels, so it should be unaffected by Dirigiste's pool stats issue. The most likely scenario is a few outside servers are talked to, and a few MBs are wasted, but that's it. If you're burning to fix it, I won't stop you tho 😄 |
Yes, I agree. By adding tests, I expect to see patterns where it could go wrong Going back to this PR, I'm still not sold to it I know you want to sunset Dirigiste, but I would prefer stabilizing it. Whether we have no news from @Yneth on the following days, I will take over that PR. |
I agree that other memory leaks of Dirigiste should be fixed. I'm just afraid that the pool leak is a red herring for their other issues. And while I'd like to sunset it, that's more of a "next version" goal. Replacing it in current Aleph makes me think of these lines from the movie State and Main:
What kind of escape hatch are you thinking of, if you don't want to use this PR? |
I'm also afraid of this!
We'll use this PR. But I am resigned, not convinced! |
@Yneth Are you still interested in this? |
gentlemen excuse for the late response. Regarding PR, I do not like it. Almost everything I've been thinking of is not very user friendly. My use case involves querying up to 1000 hosts that change on daily basis. I do not promise, but I will try to provide heap dump later today |
factory method of pool is dependant on local variables of the API method, as well as some private functions. my temporary solution was to copy paste some parts and roll out my own method creating my own dirigiste pool without stats and it works fine and the PR itself is not high priority for me |
@Yneth Don't restrict yourself on where or how to fix this. We're open to the right solution, and if you've followed the conversation so far, the most likely correct place is where Dirigiste computes pool utilization and/or cleans up old pool objects. Also, based on your initial screenshot, which shows the leak is from the Stats reservoirs, I'm pretty sure Arnaud and I are looking at the same leak, so I don't think we need a dump anymore. (Of course, if that's not your leak, please let us know.) |
In any case, the heap dump can help us diagnose whether we are making the correct assumptions. |
@Yneth : Are you hitting the same ports on those hosts? https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L97 |
I spent hours adding (JUnit) tests to Dirigiste and I start to have a better So.. UtilizationTheorically, this calculation is supposed to be at least partially correct:
Considering no object are present on the pool at the beginning:
[1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L269 Reservoir sizeI reached the conclusion that using a fixed value of static final int RESERVOIR_SIZE = 4096; We should use By default
PR : https://github.com/arnaudgeiser/dirigiste/pull/1/files Use a Ring Buffer insteadWhile it can be seen as an optimization, replacing the stats every second is not ideal. CautiousI will invalidate the above information as I progress. |
Long story short: It's definitely The fix might be as simple as this: double getUtilization(int available, int queueLength, int objects) {
if(objects==0 && queueLength==0) {
return 0;
}
return 1.0 - ((available - queueLength) / Math.max(1.0, objects));
} However, we are now subject to race condition (youpi...). So close and so far at the same time. For the good news, I've been able to cover around [1] : https://github.com/clj-commons/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L409 |
I finally reached the bottom of this. Here is the related PR on Dirigiste : clj-commons/dirigiste#32 |
The PR is ready for review. Even if we fix the dirigiste bugs, it would make sense to be able to implement your own Pool and Controller. Sold! |
Merged. |
It might not be the best way to add such customisation, as it clashes with other fields (total-connections, target-utilization, max-queue-size, stats-callback) but it should work.
The rationale of this change is to add possibility to pick your own pool implementation. In my case dirigiste pool does not work due to memory leak in