Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: Improve hostByHashing distribution #5229

Merged
merged 1 commit into from Dec 5, 2022

Conversation

skeetmtp
Copy link
Contributor

@skeetmtp skeetmtp commented Dec 2, 2022

Small change, but long story. It's not easy to explain, sorry if it's not clear, feel free to ask questions.

I recently upgraded caddy from 2.4.x to 2.6.2, unfortunately the load balancing becomes quite unbalanced as
you can see on this graph:

image

I made some test code to expose this behaviour:

	policy := new(HeaderHashSelection)
	policy.Field = "CF-Connecting-IP"

	pool := UpstreamPool{
		{Host: new(Host), Dial: "foobar:4001"},
		{Host: new(Host), Dial: "foobar:4002"},
		{Host: new(Host), Dial: "foobar:4003"},
		{Host: new(Host), Dial: "foobar:4004"},
		{Host: new(Host), Dial: "foobar:4005"},
		{Host: new(Host), Dial: "foobar:4006"},
		{Host: new(Host), Dial: "foobar:4007"},
		{Host: new(Host), Dial: "foobar:4008"},
	}

	var distribution []int = make([]int, len(pool))

	req, _ := http.NewRequest("GET", "/", nil)

	rand.Seed(0)
	for i := 0; i < 100; i++ {
		req.Header.Set("CF-Connecting-IP", fmt.Sprintf("%d.%d.%d.%d", rand.Intn(255), rand.Intn(255), rand.Intn(255), rand.Intn(255)))
		h := policy.Select(pool, req, nil)
		 for idx, up := range pool {
			if up == h {
				distribution[idx]++
			}
		 }
	}

	for idx, dis := range distribution {
		t.Logf("distribution %s=%d", pool[idx].String(), dis)
	}

Here the output

distribution foobar:4001=14
distribution foobar:4002=7
distribution foobar:4003=9
distribution foobar:4004=8
distribution foobar:4005=7
distribution foobar:4006=3
distribution foobar:4007=6
distribution foobar:4008=46

There is clearly something wrong, 46% of requests are going to same upstream.

I made another test code to pin point the issue:

        // test to concat IP + upsteam
	for i := 1; i < 9; i++ {
		s := fmt.Sprintf("192.168.200.44foobar:400%c", int('0') + i);
		h := fnv.New32a()
		_, _ = h.Write([]byte(s))
		res := h.Sum32()
		fmt.Printf("%s %08x\n", s, res)
	}

        // test to concat upsteam + IP
	for i := 1; i < 9; i++ {
		s := fmt.Sprintf("foobar:400%c192.168.200.44", int('0') + i);
		h := fnv.New32a()
		_, _ = h.Write([]byte(s))
		res := h.Sum32()
		fmt.Printf("%s %08x\n", s, res)
	}

output

192.168.200.44foobar:4001 bc724e3c
192.168.200.44foobar:4002 bf7252f5
192.168.200.44foobar:4003 be725162
192.168.200.44foobar:4004 b9724983
192.168.200.44foobar:4005 b87247f0
192.168.200.44foobar:4006 bb724ca9
192.168.200.44foobar:4007 ba724b16
192.168.200.44foobar:4008 c5725c67

foobar:4001192.168.200.44 f26b0b92
foobar:4002192.168.200.44 6ae61c9f
foobar:4003192.168.200.44 bf67d194
foobar:4004192.168.200.44 5e8f33d9
foobar:4005192.168.200.44 a61bfa36
foobar:4006192.168.200.44 91a23fb3
foobar:4007192.168.200.44 75d0cec8
foobar:4008192.168.200.44 4d2967bd

As you can see when using same kind of concatenation between field and upstream as in hostByHashing() function
lot of bits are the same between all hashes
But if you swap the concatenation (as done in this commit) then hashes are well diffused.

My point is last byte push to FNV-1a tend to affect few bits in resulting hash, the idea is to change
the concatenation order between the key and the upstream strings, so the upstream last byte have
more impact on the hash diffusion.

I think there is little to no risk to do this change in term of performance.
Another solution would be to change the hash method but it could have more impact.

Some reference to support my point:
https://softwareengineering.stackexchange.com/questions/273809/should-changes-to-fnv-1as-input-exhibit-the-avalanche-effect
haproxy implements an avalanche algorithm on the
result of the hashing function : https://github.com/haproxy/haproxy/blob/master/doc/internals/hashing.txt

* If upstreams are all using same host but with different ports
ie:
foobar:4001
foobar:4002
foobar:4003
...
Because fnv-1a has not a good enough avalanche effect
Then the hostByHashing result is not well balanced over
all upstreams

As last byte FNV input tend to affect few bits, the idea is to change
the concatenation order between the key and the upstream strings
So the upstream last byte have more impact on hash diffusion
Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow! Thanks for digging into this. I had no idea about that property of this hash function. (But I have always been skeptical of it.)

I've recently discovered Blake3: https://github.com/BLAKE3-team/BLAKE3 -- and I'm using it for one of my projects (but I have not analyzed it myself yet).

I can merge this PR for starters, but I'd be interested in your thoughts on using Blake3 for this (maybe a short output length) instead of FNV.

Thank you again, I really appreciate this contribution!

@mholt mholt added bug 🐞 Something isn't working optimization 📉 Performance or cost improvements labels Dec 2, 2022
@mholt mholt merged commit d4a7d89 into caddyserver:master Dec 5, 2022
@mholt mholt added this to the v2.6.3 milestone Dec 5, 2022
@tirz
Copy link

tirz commented Feb 9, 2023

I've recently discovered Blake3: https://github.com/BLAKE3-team/BLAKE3 -- and I'm using it for one of my projects (but I have not analyzed it myself yet).

Blake3 is a cryptographic hash, maybe something like aHash could be a better fit ? - https://github.com/tkaitchuck/ahash

@mholt
Copy link
Member

mholt commented Feb 9, 2023

@tirz (That's written in Rust, not Go -- I'd have to find a Go implementation.) Sure, but it's still way faster than SHA-1 for instance. If it's fast, does it matter that it's cryptographic?

@tirz
Copy link

tirz commented Feb 9, 2023

Well, no, but a cryptographic hash is usually slower than a hash without this constraint. I guess we just need a good cascading effect for our use-case.

It's just that I already tried aHash and Blake3 as a hashing algorithm for a HashMap and aHash was significantly faster.

But since I do not know the Go ecosystem, I do not have any lib to recommend.
And by the way, the Rust std::HashMap uses SipHash1-3 (which is a cryptographic hash), so I take back my comment.

@mholt
Copy link
Member

mholt commented Feb 9, 2023

No worries. We appreciate your participation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working optimization 📉 Performance or cost improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants