Improvement on cache invalidation #462

Licenser · 2020-01-11T17:55:31Z

I've recently been fiddling with a multi threaded application using crossbeam channels. On threadripper I noticed that performance would degrade rapidly when the sender and receiver were on different CCX's (or in other words when cache wasn't shared between sender and receiver).

With a bit of digging I found that in the array implementation of channels does suffer from the buffer not being cach aligned.

I wrapped the buffer in a CachePadded and it improved significantly in my tests over 2x in some cases. that said obviously the tests only capture a tiny bit and using a single 64 bit value in them definitely is the extreme case to trigger this edge case. Still it looks lice a nice improvement.

I will keep this as a draft for now as while the benchmarks looks nice real world impact I measured is not as big as I hoped ™️ so I think I have a bit more digging to do.

this

running 24 tests
test bounded_0::create    ... bench:          45 ns/iter (+/- 0)
test bounded_0::mpmc      ... bench:  48,288,434 ns/iter (+/- 7,449,535)
test bounded_0::mpsc      ... bench:  79,192,749 ns/iter (+/- 2,066,971)
test bounded_0::spmc      ... bench:  86,323,323 ns/iter (+/- 2,397,011)
test bounded_0::spsc      ... bench:  29,002,652 ns/iter (+/- 1,547,061)
test bounded_1::create    ... bench:         195 ns/iter (+/- 3)
test bounded_1::mpmc      ... bench:  20,014,231 ns/iter (+/- 436,181)
test bounded_1::mpsc      ... bench: 109,318,843 ns/iter (+/- 4,693,295)
test bounded_1::oneshot   ... bench:         180 ns/iter (+/- 1)
test bounded_1::spmc      ... bench:  97,771,406 ns/iter (+/- 3,625,496)
test bounded_1::spsc      ... bench:  18,986,039 ns/iter (+/- 238,972)
test bounded_n::mpmc      ... bench:   5,376,086 ns/iter (+/- 422,042)
test bounded_n::mpsc      ... bench:  11,749,680 ns/iter (+/- 560,767)
test bounded_n::par_inout ... bench:  13,453,292 ns/iter (+/- 966,845)
test bounded_n::spmc      ... bench:  89,016,467 ns/iter (+/- 3,262,106)
test bounded_n::spsc      ... bench:   4,137,098 ns/iter (+/- 375,743)
test unbounded::create    ... bench:         109 ns/iter (+/- 1)
test unbounded::inout     ... bench:          39 ns/iter (+/- 0)
test unbounded::mpmc      ... bench:   3,024,718 ns/iter (+/- 179,688)
test unbounded::mpsc      ... bench:   5,306,185 ns/iter (+/- 362,481)
test unbounded::oneshot   ... bench:         175 ns/iter (+/- 2)
test unbounded::par_inout ... bench:  10,732,447 ns/iter (+/- 542,191)
test unbounded::spmc      ... bench:  92,086,599 ns/iter (+/- 1,790,785)
test unbounded::spsc      ... bench:   1,303,073 ns/iter (+/- 16,593)

master

running 24 tests
test bounded_0::create    ... bench:          45 ns/iter (+/- 0)
test bounded_0::mpmc      ... bench:  47,513,539 ns/iter (+/- 7,685,319)
test bounded_0::mpsc      ... bench:  79,297,255 ns/iter (+/- 1,721,529)
test bounded_0::spmc      ... bench:  86,583,535 ns/iter (+/- 2,025,047)
test bounded_0::spsc      ... bench:  29,433,918 ns/iter (+/- 3,792,133)
test bounded_1::create    ... bench:         120 ns/iter (+/- 5)
test bounded_1::mpmc      ... bench:  19,896,780 ns/iter (+/- 523,015)
test bounded_1::mpsc      ... bench: 106,761,448 ns/iter (+/- 4,330,258)
test bounded_1::oneshot   ... bench:         138 ns/iter (+/- 3)
test bounded_1::spmc      ... bench: 100,886,592 ns/iter (+/- 2,866,250)
test bounded_1::spsc      ... bench:  28,713,988 ns/iter (+/- 1,218,632)
test bounded_n::mpmc      ... bench:   6,456,962 ns/iter (+/- 516,168)
test bounded_n::mpsc      ... bench:  13,604,237 ns/iter (+/- 338,683)
test bounded_n::par_inout ... bench:  12,855,325 ns/iter (+/- 1,735,288)
test bounded_n::spmc      ... bench:  97,568,112 ns/iter (+/- 3,793,568)
test bounded_n::spsc      ... bench:   2,035,692 ns/iter (+/- 753,005)
test unbounded::create    ... bench:         112 ns/iter (+/- 2)
test unbounded::inout     ... bench:          39 ns/iter (+/- 0)
test unbounded::mpmc      ... bench:   3,014,406 ns/iter (+/- 308,277)
test unbounded::mpsc      ... bench:   5,213,754 ns/iter (+/- 159,838)
test unbounded::oneshot   ... bench:         165 ns/iter (+/- 1)
test unbounded::par_inout ... bench:  10,640,906 ns/iter (+/- 743,346)
test unbounded::spmc      ... bench:  91,300,215 ns/iter (+/- 2,178,182)
test unbounded::spsc      ... bench:   1,480,523 ns/iter (+/- 47,803)

cynecx · 2020-01-11T19:23:31Z

Note that this will significantly increase memory usage of channels which is not really desirable (Since with this change a slot value’s size will be aligned to a multiple of 128 bytes, at least on x86-64).

Licenser · 2020-01-11T20:32:38Z

That's a good point, especially for small values the memory growth would be quite a bit, OTOH especially for them the performance difference is significant too.

I'm not sure what the right trade off is, perhaps it'd be better suited as a own flavor.

Aling elements in the array channel to the cache

27acf3b

Licenser marked this pull request as ready for review January 18, 2020 02:20

Licenser mentioned this pull request Jan 22, 2020

[discossuion] Contemplating task schedulers on NUMA systems async-rs/async-std#686

Open

jeehoonkang added the crossbeam-channel label May 20, 2020

taiki-e mentioned this pull request May 2, 2024

[wip] Cumulative micro opts (1-2% perf. gain) #1092

Draft

Licenser closed this Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement on cache invalidation #462

Improvement on cache invalidation #462

Licenser commented Jan 11, 2020

cynecx commented Jan 11, 2020 •

edited

Loading

Licenser commented Jan 11, 2020

Improvement on cache invalidation #462

Improvement on cache invalidation #462

Conversation

Licenser commented Jan 11, 2020

this

master

cynecx commented Jan 11, 2020 • edited Loading

Licenser commented Jan 11, 2020

cynecx commented Jan 11, 2020 •

edited

Loading