Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added module to enable CPU affinity #1641

Merged
merged 14 commits into from
Nov 13, 2018
Merged

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Sep 4, 2018

Motivation

This is part of a set of changes aimed at reducing latency in BK at the expense of other aspects (eg: max throughput). While not intended to be used as default settings, they might be good to have whenever the latency becomes critical.

Pinning a thread to a particular CPU will ensure no other process will execute on that CPU reducing all scheduler induced context switches that will cause latency jittery.

A given thread that wants to get pinned to a CPU just needs to call:

CpuAffinity.acquireCore();

It's called acquireCore() because it will also disable hyper-threading on the pinned cpu.

Subsequent PRs will use this module to have the option to pin critical threads to available CPUs.

Changes

  • Added JNI module to call sched_setaffinity() to pin a thread to a particular CPU
  • Automatically discover available isolated CPUs
  • Acquire file-based locks to allow multiple processes on same machine to acquire CPUs independently.

@merlimat merlimat added this to the 4.8.0 milestone Sep 4, 2018
@merlimat merlimat self-assigned this Sep 4, 2018
@eolivelli eolivelli modified the milestones: 4.8.0, 4.9.0 Sep 6, 2018
@merlimat
Copy link
Contributor Author

Ping. Please take a look again.

Copy link
Contributor

@ivankelly ivankelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch itself looks good. A few nits.

However, I think we need an overarching document for these "performance" improvements, with benchmark results, detailed instructions to show how to reproduce the scenario in which performance changes, and the overall goals and approach for the perf improvements.

The current patches feel very piecemeal and handwavey, and this becomes a problem later when someone wants to change something, and the change is objected to on performance grounds but no solid facts on why it can't change.

cpu-affinity/pom.xml Outdated Show resolved Hide resolved
</profile>

<profile>
<id>Linux</id>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the profile capitalized, while the other isn't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pom was copied from the existing circe-checksum one. The profile name is not that important since it's automatically enabled based on the current OS.

@merlimat
Copy link
Contributor Author

merlimat commented Nov 8, 2018

@ivankelly @eolivelli @sijie Please take another look.

However, I think we need an overarching document for these "performance" improvements, with benchmark results, detailed instructions to show how to reproduce the scenario in which performance changes, and the overall goals and approach for the perf improvements.

This is just the underlying implementation. I will work on the docs and instructions when the whole busy-wait change set is ready.

@ivankelly
Copy link
Contributor

However, I think we need an overarching document for these "performance" improvements, with benchmark results, detailed instructions to show how to reproduce the scenario in which performance changes, and the overall goals and approach for the perf improvements.

This is just the underlying implementation. I will work on the docs and instructions when the whole busy-wait change set is ready.

Do you have some preliminary numbers then to show why this change is valuable?

@merlimat
Copy link
Contributor Author

merlimat commented Nov 9, 2018

Do you have some preliminary numbers then to show why this change is valuable?

@ivankelly

This are the numbers on my local tests. It's not using the BK client lib directly, but rather through ManagedLedger (from Pulsar) which has one more thread handoff, to have callbacks serialized to 1 thread hashed on topic name.

Conditions:

  • Client/server on same machine with TCP connection
  • Tested on a bare-metal node
  • Client sends 1K rps (size 1KB) and waits for response
  • Multiple request outstanding, responses in order
  • Latency measured in millis
  50pct 95pct 99pct 99.9pct 99.99pct 99.999pct max
Regular queue 0.078 0.092 0.110 0.152 0.232 0.430 0.538
Spin-Wait Journal 0.065 0.073 0.087 0.132 0.428 0.483 0.819
Spin-Wait Workers 0.049 0.059 0.066 0.095 0.449 0.465 0.466
Spin-Wait IO Threads 0.036 0.047 0.052 0.075 0.436 0.445 0.452
Disable HyperThread 0.029 0.038 0.041 0.054 0.204 0.388 0.396
CPU-Affinity 0.029 0.037 0.040 0.050 0.066 0.090 0.104

The remaining bulk of the median latency I think is related to SO_BUSY_POLL (netty/netty#8268) being ineffective on loopback interface.

Charts:
image

@merlimat
Copy link
Contributor Author

merlimat commented Nov 9, 2018

run pr validation

@ivankelly
Copy link
Contributor

@merlimat thanks. Do you have any numbers for just this change? Like without the spin-wait stuff and HT disablement?

@merlimat
Copy link
Contributor Author

@ivankelly All the changes in the table above are incremental, so the last 2 lines show the effect of Cpu isolation:

  50pct 95pct 99pct 99.9pct 99.99pct 99.999pct max
Disable HyperThread 0.029 0.038 0.041 0.054 0.204 0.388 0.396
CPU-Affinity 0.029 0.037 0.040 0.050 0.066 0.090 0.104

I think it doesn't make sense to test cpu isolation without the other changes because the baseline noise will make the effects not visible.

To summarize: cpu-affinity is used to remove interference from other tasks (OS thread and other processes) from a particular thread in order to avoid context switches and reduce the latency in the long tail.

Without busy-poll, the bulk of context switches will still be there (eg: when sleeping while waiting for items in a queue), so the cpu-isolation won't provide much benefits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants