-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
too big CPU load & event loop stall on massive socket close #894
Comments
I have just profiled whats happening on the "stall moment" You can open perf-kernel.svg in any Browser to look performance Graph Too much objects release in the same moment blocks Event Loop.
Tools used for perf monitoring: |
ouch, thanks @allright , we'll look into that |
One more possible design - is provide the FAST custom allocator/deallocator (like in std C++) for promises. Which really have preallocated memory & not really calls malloc/free every time object deallocated or calls it one time for big group of objects. Another possible design - is object reuse pool. Really, it can preallocate many needed objects at the app start & deallocate it only on app stop. Or manage it automatically. Real server application usually tuned on the place for maximum possible connections/speed - so we do not need real retain/dealloc during app life (just only on start/stop). @weissi What do you think? |
@allright Swift unfortunately doesn’t let you choose the allocator. It will always use |
@weissi Yes, I think it is reference counting. |
but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this
then the Swift compiler might emit code like this:
certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools. |
Yes. Not all. But for example, Channel Handlers may be allocated/deallocated using factory
// use channelHandler So - use this way for each object like Promise and etc.. |
@allright sure, you could even implement this today for |
Yes, the number of operations is the same, but the moment of operations is not the same. Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? |
That's not totally accurate. If you take out a handler from a pipeline, there will be reference counts changed, whether that handler will be re-used or not. Sure, if they are re-used, then you don't need do deallocate which causes even more reference count decreases.
Totally agreed. I'm just saying that caching your handlers (which you can do today, you don't need anything from NIO) won't remove all reference count traffic when tearing down the pipeline. |
I see. Let's try to fix what we can and test!) |
Also, as in performance graph I think reference count change not takes a lot of time. So lets optimise alloc/dealloc speed by reuse pools or any other way. |
What you could do is store a thread local let threadLocalMyHandlers = NIOThreadLocal<CircularBuffer<MyHandler>>(value: .init(capacity: 32))
extension EventLoop {
func makeMyHandler() -> MyHandler {
if threadLocalMyHandlers.value.count > 0 {
return threadLocalMyHandlers.value.removeFirst()
} else {
return MyHandler()
}
}
} and in func handlerRemoved(context: ChannelHandlerContext) {
self.resetMyState()
threadLocalMyHandlers.value.append(self)
} (code not tested or compiled, just as an idea)
agreed |
good idea) will test later) |
The reason |
hm ... |
I have just tested this, but it is not enough (a lot of Promises cause retain/release), and these promises must be reused too. But I figured out that stalls happen while massive handlerRemoved function called. So I think the best solution will be to automatically distribute in time invokeHandlerRemoved() ... calling. |
|
"limit the number of outstanding tasks that execute in any event loop tick"
|
In Real world: Server do not need dynamic memory allocation/deallocation during processing. EchoHandler() -> BackPressureHandler() -> IdleStateHandler() -> ... some other low level handlers like TCP and etc... EchoHandler: 100000 It completely solves our problem - no massive allocations/deallocations during processing. Possible steps to implement:
P.S. |
I have gotten an issue using the Vapor based on the SwiftNIO (vapor/vapor#1963) |
@AnyCPU your issue isn't related to this. |
@weissi is it related to SwiftNIO? |
I don't think so but we'd need more information to be 100% sure. Let's discuss this on the Vapor issue tracker. |
I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. |
Your graph above shows that most of the overhead is in |
I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread) |
I'm not think that atomic increment/decrement of retain count takes a lot of time. Could you test this hypothesis? |
It's an atomic increment and decrement. Just check your own profile, the |
@weissi , Do you think that this function too slow? |
Look at this comment: It says - problem in alloc/dealloc |
Well 'slow' is relative but in your profile up top, that function took about 30% of the time. |
Yes, but that very very much depends on the state of the CPU caches and what's on which cacheline etc. |
// 53% _swift_release_dealloc Really it means that 83% of time takes malloc/free. |
Also this code is not a problem, if retain/release in the middle takes only 10% of time. |
But we must to avoid alloc/free for Feature/Promises in the pipeline.
|
I have run the wrk tool using two profiles:
The issue occurs with the first option on very first run. I hope it will somehow help. |
I means that you can increase the number of threads in EventLoopGroup: So. Do not use the same machine for running wrk, it consumes the CPU, and have influence to your swift-nio server behaviour. Tests are incorrect. |
The only allocations that happen per packet/event are the buffer for the bytes. There are no futures/promises allocated. If you set a custom |
Yes. No problem here. |
We can't really do anything about the retaining unfortunately, that's due to ARC. And with jemalloc I wouldn't hope that in a real-world application the deallocations lead to massive issues. Closing a ( |
Yes, closing socket is expensive. But graph shows retain/release problem, not socket close. So for Swift-NIO current normal closing/open connections speed is < 1000 connections per second for 1 thread. Real limit for thread is 5000 RPS (because massive open/close can suspend this thread for time about: 5..10 seconds to close 5000 connections, and affect over connection handlers, processed by this thread). Yes, this benchmarks is good, but it can be better even on swift I think. And next steps may be:
I hope, I will deep in this issue during next several months. Right now I have no time for it:( |
Again, we can't do much against
Except that ARC at the moment inserts a lot more retains/releases than you'd typically see in C++. If you find time, what would be really interesting to see is:
|
Thanks @weissi. |
I found very interesting architect for async networking, Fibers: So, I have tested this server in comparison with swift-nio HTTPServer using 1 core. Fibers -> simlify code |
@allright I know, fibers (commonly known as green threads) are great. However, they can't be implemented with Swift. You'd need If/when Swift gets async/await, we will be able to write code similar to the fibers that is more optimised too. So everything should look nicer and many things will be faster. |
Is it not safe? |
It is not. |
Can you point concrete unsafe place in this code? |
Incidentally, regarding the performance numbers, I should note that SwiftNIO's default HTTP pipeline configuration is "safe but slow". This is because we want to accommodate users who want to run without an nginx or similar reverse proxy in front of their server. You can remove some channel handlers from the default configuration to potentially see a substantial performance boost on load tests. |
Basically all of this. |
Ok, will see. |
Specifically, that assembly code is replicating many of the features of |
My aim is to have replacement for boost:asio C++ in swift, comparable by performance. |
Is there any new progress? |
Currently there has not been meaningful progress here. We have consistently pushed down the memory overhead and pushed up performance, but it remains the case that deallocations happen on the event loop. |
swift5.5 is release async/await |
Expected behavior
[what you expected to happen]
no stalls on socket close
Actual behavior
[what actually happened]
stalls for 10-30 seconds, up to disconnect by timeout
Steps to reproduce
video: https://yadi.sk/i/ZmAu8La5zLWfSg. ( to look in the best quality you can download file)
sources: https://github.com/allright/swift-nio-load-testing/tree/master/swift-nio-echo-server
commit: allright/swift-nio-load-testing@a461c72
VPS: 1 CPU 512 RAM ubuntu 16.0.4
tcpkali -c 20000 --connect-rate=3000 --duration=10000s --latency-connect -r 1 -m 1 echo-server.url:8888
root@us-san-gate0:~/swift-nio-load-testing/swift-nio-echo-server# cat Package.resolved
{
"object": {
"pins": [
{
"package": "swift-nio",
"repositoryURL": "https://github.com/apple/swift-nio.git",
"state": {
"branch": "nio-1.13",
"revision": "29a9f2aca71c8afb07e291336f1789337ce235dd",
"version": null
}
},
{
"package": "swift-nio-zlib-support",
"repositoryURL": "https://github.com/apple/swift-nio-zlib-support.git",
"state": {
"branch": null,
"revision": "37760e9a52030bb9011972c5213c3350fa9d41fd",
"version": "1.0.0"
}
}
]
},
"version": 1
}
Swift version 4.2.3 (swift-4.2.3-RELEASE)
Target: x86_64-unknown-linux-gnu
Linux us-san-gate0 4.14.91.mptcp #12 SMP Wed Jan 2 17:51:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
PS.
the same echo server, but implemented on C++ ASIO, has not such problem. Can apply source codes(C++) & video if needed
The text was updated successfully, but these errors were encountered: