Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage [after sending message] on Swift ARM #1337

Closed
Bouke opened this issue Jan 14, 2020 · 25 comments
Closed

100% CPU usage [after sending message] on Swift ARM #1337

Bouke opened this issue Jan 14, 2020 · 25 comments

Comments

@Bouke
Copy link
Contributor

Bouke commented Jan 14, 2020

Expected behavior

When there's not much going on, little CPU usage. Increase in CPU while there's activity.

Actual behavior

Continuously high CPU usage, even when there's no network traffic.

Steps to reproduce

I'm not sure what exactly triggers the problem in my application, but I've observed the same problem with the chat client. So including those steps here:

  1. Install SwiftARM on a Raspberry Pi (https://swift-arm.com/install-swift/)
  2. Run both chat server and client (swift run NIOChat(Server|Client))
  3. Notice 100% cpu usage on NIOChatClient

SwiftNIO version/commit hash

Tested against version 2.12.0 and master (f44fe0b).

Swift & OS version (output of swift --version && uname -a)

$ swift --version
Swift version 5.1.3 (swift-5.1.3-RELEASE)
Target: armv6-unknown-linux-gnueabihf

$ uname -a
Linux pi3.local 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux

Flame graph

Some symbols are missing due to perf version mismatch (I think). Both the chat client and my application ("my-home") are included. Attached are the zoomable flame graphs: flame-graphs.zip.

image

@Lukasa
Copy link
Contributor

Lukasa commented Jan 14, 2020

The common cause for this is that we’re spinning in epoll_wait, which implies there’s some I/O event we haven’t processed. This is usually EPOLL_RDHUP. I will do my best to see if I can reproduce this tomorrow, though the fact that this affects a raspberry pi makes life moderately awkward: I’ll see if we can reproduce it in another system.

@weissi
Copy link
Member

weissi commented Jan 14, 2020

@Bouke Thanks for the detailed report! As @Lukasa points out, this points to us not properly processing some event so epoll_wait constantly returns again, telling us about the same event again.

If you can, would you mind to attach a few lines of the output of

strace -ff -e trace=epoll_ctl,epoll_wait -p PID_OF_PROCESS

whilst in the 100% CPU spin loop? Also, is this 32 or 64 bit?

@Lukasa
Copy link
Contributor

Lukasa commented Jan 14, 2020

Based on the target triple it’s 32-bit: at least the OS is.

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

strace output for NIOChatClient:

strace: Process 11180 attached with 2 threads
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
(...)
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
[pid 11225] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
strace: Process 11180 detached
strace: Process 11225 detached

strace output for my own application:

strace: Process 11289 attached with 8 threads
[pid 11302] epoll_wait(16,  <unfinished ...>
[pid 11301] epoll_wait(13, [{EPOLLIN, {u32=15, u64=15}}], 64, -1) = 1
[pid 11300] epoll_wait(10,  <unfinished ...>
[pid 11299] epoll_wait(7,  <unfinished ...>
[pid 11300] <... epoll_wait resumed> [{EPOLLIN, {u32=12, u64=12}}], 64, -1) = 1
[pid 11301] epoll_wait(13, [{EPOLLIN, {u32=15, u64=15}}], 64, -1) = 1
[pid 11300] epoll_wait(10, [{EPOLLIN, {u32=12, u64=12}}], 64, -1) = 1
[pid 11301] epoll_wait(13,  <unfinished ...>
[pid 11295] epoll_wait(3,  <unfinished ...>
[pid 11301] <... epoll_wait resumed> [{EPOLLIN, {u32=15, u64=15}}], 64, -1) = 1
[pid 11300] epoll_wait(10,  <unfinished ...>
[pid 11301] epoll_wait(13,  <unfinished ...>
[pid 11300] <... epoll_wait resumed> [{EPOLLIN, {u32=12, u64=12}}], 64, -1) = 1
[pid 11301] <... epoll_wait resumed> [{EPOLLIN, {u32=15, u64=15}}], 64, -1) = 1
(...)
[pid 11300] epoll_wait(10, 
strace: Process 11289 detached
strace: Process 11295 detached
strace: Process 11299 detached
strace: Process 11300 detached
 <detached ...>
strace: Process 11301 detached
strace: Process 11302 detached
strace: Process 11303 detached
strace: Process 11304 detached

(Note that this is not an official Swift build, but provided by the Swift-ARM community. Official builds don't target ARM, nor 32 bit. See also: https://github.com/uraimo/buildSwiftOnARM and https://swift-arm.com)

@weissi
Copy link
Member

weissi commented Jan 14, 2020

@Bouke thanks for the fast response. Interestingly, this is the EPOLLIN event which means that epoll tells us about data that's available to be read. Could you give us another few lines with the following command?

strace -ff -e trace=epoll_ctl,epoll_wait,read -p PID_OF_PROCESS

(the only difference to the previous one is the ,read)

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

Sure, that would be:

[pid 11841] epoll_wait(5, [{EPOLLIN, {u32=7, u64=7}}], 64, -1) = 1
[pid 11841] read(7, 0x7493f550, 4)      = -1 EINVAL (Invalid argument)

Thank you as well.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 14, 2020

einval is a baffling response, why isn’t that causing us to error out?

@weissi
Copy link
Member

weissi commented Jan 14, 2020

@Bouke wow, that's read returning EINVAL which as of the docs is

EINVAL
fd is attached to an object which is unsuitable for reading; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the current file offset is not suitably aligned.
EINVAL
fd was created via a call to timerfd_create(2) and the wrong size buffer was given to read(); see timerfd_create(2) for further information.

@weissi
Copy link
Member

weissi commented Jan 14, 2020

could we have

lsof -p PID

alongside the same strace command? Need to see what fd the one is where read is returning EINVAL.

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

I think this is what you're asking for?

[pid 12153] read(7, 0x74928550, 4)      = -1 EINVAL (Invalid argument)
COMMAND     PID USER   FD      TYPE DEVICE SIZE/OFF   NODE NAME
NIOChatCl 12133   pi    7u  a_inode   0,11        0   5804 [timerfd]

@weissi
Copy link
Member

weissi commented Jan 14, 2020

why isn’t that causing us to error out?

it's this guy: https://github.com/apple/swift-nio/blob/master/Sources/NIO/Selector.swift#L552

@weissi
Copy link
Member

weissi commented Jan 14, 2020

thank you @Bouke ! this should allow us to debug it, clearly we're doing the timerfd wrong :)

@weissi
Copy link
Member

weissi commented Jan 14, 2020

oooh I think timerfd needs to be 64bit always

@weissi
Copy link
Member

weissi commented Jan 14, 2020

yes, this line

_ = Glibc.read(self.timerFD, &val, MemoryLayout<UInt>.size)

must be

_ = Glibc.read(self.timerFD, &val, MemoryLayout<UInt64>.size)

@weissi
Copy link
Member

weissi commented Jan 14, 2020

@Bouke if you want, you could make this fix yourself, I think you literally just need to add 64 to this line: https://github.com/apple/swift-nio/blob/master/Sources/NIO/Selector.swift#L552

@weissi
Copy link
Member

weissi commented Jan 14, 2020

docs are here:

The file descriptor returned by timerfd_create() supports the following operations:
read(2)
If the timer has already expired one or more times since its settings were last modified using timerfd_settime(), or since the last successful read(2), then the buffer given to read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations that have occurred. (The returned value is in host byte order, i.e., the native byte order for integers on the host machine.)
If no timer expirations have occurred at the time of the read(2), then the call either blocks until the next timer expiration, or fails with the error EAGAIN if the file descriptor has been made nonblocking (via the use of the fcntl(2) F_SETFL operation to set the O_NONBLOCK flag).
A read(2) will fail with the error EINVAL if the size of the supplied buffer is less than 8 bytes.

https://linux.die.net/man/2/timerfd_create

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

Hmm yeah that makes sense! (Although I haven't the faintest clue what this does)

       read(2)
              If the timer has already expired one or more times since its settings were  last  modified
              using  timerfd_settime(),  or  since the last successful read(2), then the buffer given to
              read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations
              that  have  occurred.   (The returned value is in host byte order—that is, the native byte
              order for integers on the host machine.)

              If no timer expirations have occurred at the time of the read(2),  then  the  call  either
              blocks  until  the  next  timer  expiration,  or  fails  with the error EAGAIN if the file
              descriptor has been made nonblocking (via the use of the fcntl(2) F_SETFL operation to set
              the O_NONBLOCK flag).

              A  read(2) will fail with the error EINVAL if the size of the supplied buffer is less than
              8 bytes.

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

That fixed the issue for the chat client. I'm now going to test against my own application.

@weissi
Copy link
Member

weissi commented Jan 14, 2020

@Bouke by reading 64 bits from the "timerfd" you just tell the kernel "ok, thanks, I have processed the timer you woke me up for". But we didn't read enough and therefore the kernel thought we hadn't processed the timer and told us about the same timer again, straight away, in a loop...

@Bouke
Copy link
Contributor Author

Bouke commented Jan 14, 2020

After compiling for what feels like an eternity (the raspberry's not that fast), I can confirm that this is also the fix for my application. Thanks so much everyone!

@weissi
Copy link
Member

weissi commented Jan 15, 2020

fixed by #1340 , thanks @Bouke for report and fix! CC @helje5 who might be interested too

@normanmaurer
Copy link
Member

@weissi shouldn't we use edge triggered for the timerfd and so it should not "matter" ?

@helje5
Copy link
Contributor

helje5 commented Jan 15, 2020

I should have written a test! 🤦‍♀

@helje5
Copy link
Contributor

helje5 commented Jan 15, 2020

@Bouke You don't have to compile on the Raspi, you can use X-compilers, e.g. https://github.com/CSCIX65G/SwiftCrossCompilers

@weissi
Copy link
Member

weissi commented Jan 15, 2020

@normanmaurer because of deregistrations, we sometimes go back to the kernel to have all events redelivered so we can’t use edge triggered.

Also, reading off that the timer fired is still very important even in edge-triggered as we would otherwise only ever get one event (because we never acknowledge it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants