Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High contention on InternalIndex.innerIndex() #18053

Closed
polyfractal opened this issue Apr 28, 2016 · 6 comments · Fixed by #18060
Closed

High contention on InternalIndex.innerIndex() #18053

polyfractal opened this issue Apr 28, 2016 · 6 comments · Fixed by #18060
Labels
:Core/Infra/Core Core issues without another label

Comments

@polyfractal
Copy link
Contributor

I was running some informal benchmarking on my home cluster for fun and found my max throughput. Interestingly, the cluster didn't appear saturated on any particular resource and I was unable to coax the indexing rate higher no matter what I did.

I reran the test while recording telemetry via Flight Recorder (using XX:+DebugNonSafepoints as recommended by @danielmitterdorfer and @rmuir). I didn't notice anything too outrageous, but there was very high contention on InternalEngine.innerIndex():

default_lock_count

On a 6min test, it racked up 3242 contention events and ~24min of CPU time. The innerIndex() method synchronizes on a pool of locks, which are set to be numProcessors * 10

My machine has 32 cores (16 physical + hyperthreading), so the pool has 320 locks. @nik9000 and @jpountz suggested it may be an issue with collisions in that pool. I wasn't really sure of the math to verify, but I found an article which gives a formula to calculate the probability of generating k unique draws out of N possible values.

Since our case is a bit more complex -- pool count (N) scales in relation to core count (k) -- so I put together a spreadsheet to calculate everything1. Assuming I didn't do anything silly with the math, the collision rate is grim. For my machine, 32 simultaneous requests have a ~79% chance of at least one collision in a pool of 320 locks.

image
y-axis: probability of collision
x-axis: number of simultaneous requests on an equal number of cores

Bumping the pool count

As a quick test, I bumped the lock multiplier from * 10 to * 1000, meaning my machines had 32000 available locks, which gives a theoretical ~1.5% collision rate. Rerunning the test consistently gives a 20k docs/s speedup, and drops contention down to 297 events and ~3min CPU time.

high_lock_count

Next?

All of this is touching delicate code that I don't understand, and my informal test may be biased by something I'm not accounting for. So I'm not really sure what or if there is a next step. But I thought these charts were interesting enough that more knowledgeable folks may want to have a look.

Footnotes

  1. Spreadsheet calculating collision probability
  2. To verify my math, I pinned the pool count to 320 and calculated the same metric against 1-64 requests. It shows a sigmoid like the original article, so I think my math is correct.
@polyfractal polyfractal added the :Core/Infra/Core Core issues without another label label Apr 28, 2016
@jasontedor
Copy link
Member

so I think my math is correct

The math is correct. It's just the birthday problem which is exactly the formula that I see in your spreadsheet.

@jasontedor
Copy link
Member

An observation here is that the number of locks should scale like the square of the number of cores for a targeted probability of collision but we are only scaling linearly. That is, if we want to target a probability p for 2 * c threads running on c cores (indexing and bulk thread pools are limited to the number of cores), we would want approximately -2 * c^2 / log (1 - p) locks in the lock pool.

I'm not saying that that is the approach that we should take, just saying that this is why this is breaks down so hard as core counts increase.

@mikemccand
Copy link
Contributor

mikemccand commented Apr 28, 2016

An increase of 20 K docs/sec sounds substantial! What was the indexing rate before you increased the lock count?

It's frustrating that ES even needs these locks ... it's because if ES sends the same uid to Lucene concurrently, i.e. with different documents, there's no way for ES to know which document "won".

We have talked about having Lucene return a sequence number so that callers could figure it out ... and @s1monw may have a patch somewhere that started on this. I wonder if that would be enough for ES to remove the locks? E.g. we could use this sequence number to correctly update the version map ...

@polyfractal
Copy link
Contributor Author

Total throughput went from ~270,000 to ~290,000 docs/s.

Note that was in a 4-node cluster, unsure if similar/less/better gains would be had from a single-node. Although I did check and the contention issues seem to exist on a single node too (which makes sense, based on what @jasontedor said about linear lock pool scaling)

There was some variance in between runs, but the difference seemed to hold up. I have half a mind to rerun the tests a few dozen times and do a t-test just to make sure :)

@otisg
Copy link

otisg commented Apr 29, 2016

Nice find! For what it's worth, I was running a profiler on ES today and saw this same "innerIndex" stuff as one of the top hotspots (ES 5.0 alpha2, built from source). This was on a small 2-datanode ES cluster indexing ~20K docs/sec.

@jpountz
Copy link
Contributor

jpountz commented Apr 29, 2016

Maybe a short term fix for this would be to use something like KeyedLock. This would still prevent two index/delete operations on the same id from occurring concurrently, but two different ids would never share the same lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants