GUACAMOLE-608: Fix connection write thread may encounter dead loop#179
GUACAMOLE-608: Fix connection write thread may encounter dead loop#179changkun wants to merge 2 commits intoapache:masterfrom
Conversation
necouchman
left a comment
There was a problem hiding this comment.
I'm good with it. @mike-jumper any concerns with this change? Based on the documentation of the write call as referenced by @changkun, this seems to be the correct way to go...
src/guacd/connection.c
Outdated
|
|
||
| int written = write(fd, buffer, length); | ||
| if (written < 0) | ||
| if (written < 0 || errno > 0) |
There was a problem hiding this comment.
The contents of errno are not defined for a successful call to write(). I don't believe the standard allows us to rely on checking solely errno to determine whether write() failed.
There was a problem hiding this comment.
As far as I know from a higher level language, Go wraps write call and return error if errno is not zero.
There was a problem hiding this comment.
Unless the standards behind errno or write() define what errno will be after a successful call to write(), it is non-portable to make assumptions about that behavior.
There was a problem hiding this comment.
How about if (written < 0 || (written == 0 && errno > 0))?
This condition completely satisfies the man page description.
There was a problem hiding this comment.
Yep. That looks good to me.
|
@changkun, can you point to any documentation specifically covering the circumstances under which I'm hesitant to simply handle all 0 returns as equivalent to errors, as the normal behavior for such a case is to retry the |
|
@mike-jumper There was a link in this pull request to the documentation, which says this:
https://linux.die.net/man/2/write This does say that this should only be the case where fd is a regular file, so, presumably, this should not happen when this is a socket?? |
|
Yeah, that's were I'm not sure. The 0 return seems clearly defined for writing to files and for non-blocking sockets, but it's unclear to me whether there are non-error conditions for a blocking socket which would result in a If it is true that a 0 return should really not happen for a blocking socket, or if 0 is only really handled as end-of-stream for a blocking socket, then handling 0 as if it were an error makes sense. For non-blocking sockets, a 0 return is not necessarily fatal, and likely indicates that the send attempt should be retried (send buffer is full, etc.). If that is the case here, then the proper solution would be to |
|
@mike-jumper So far, I investigated the phenomenon described in the PR description, the
Do you have any other guess for it? |
|
It definitely should be fixed, but whether the fix is to interpret 0 as an error condition (these changes) or to interpret 0 as a need to retry (
Does this mean you are encountering this in practice? |
|
@mike-jumper Yes, very few connection processes satisfy the three conditions are observed in practice. |
|
What I mean is, if you are encountering this in practice, then that might be an opportunity to add some temporary additional logging and see what conditions surround a |
|
So @mike-jumper you good with this PR, now, or did you still want @changkun to chase down specific situations where this is occurring before we push forward with a merge? |
|
The latter. We need more information before I am comfortable with these changes as they stand. The circumstances that If If It comes down to more than just determining specific situations where this occurs in practice. That may help find a starting point, but we will ultimately need standards documentation to support either solution. |
mike-jumper
left a comment
There was a problem hiding this comment.
As mentioned above, we need to know that write() returning 0 for a blocking socket is an error condition before we can safely handle it as a fatal error.
|
For example, what if |
|
Yeah, @mike-jumper, I completely understand your concerns, to be honest, I am also not confident regarding the change. Speaking of the virgin idea of the PR, it intends to solve the issue of high CPU usage rate. let's take a look at those high CPU rate processes. $ top
top - 17:32:27 up 35 days, 7:48, 0 users, load average: 4.43, 4.02, 4.07
Tasks: 20 total, 1 running, 19 sleeping, 0 stopped, 0 zombie
%Cpu(s): 39.8 us, 1.4 sy, 0.0 ni, 57.0 id, 1.6 wa, 0.0 hi, 0.1 si, 0.1 st
KiB Mem: 8175596 total, 7683964 used, 491632 free, 385312 buffers
KiB Swap: 0 total, 0 used, 0 free. 4100680 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26329 root 20 0 2250720 24896 3040 S 98.2 0.3 14352:42 guacd
57599 root 20 0 2249092 25236 3164 S 98.2 0.3 224:35.78 guacd
50821 root 20 0 2250724 27004 3168 S 91.7 0.3 2992:43 guacd
1 root 20 0 21752 856 576 S 0.0 0.0 0:00.02 init.sh
12 root 20 0 5712804 21156 712 S 0.0 0.3 50:55.79 guacd
13 root 20 0 46440 1160 768 S 0.0 0.0 0:00.00 su
14 jetty 20 0 4002728 321704 2024 S 0.0 3.9 28:59.74 java
58607 root 20 0 5707804 31532 3012 S 0.0 0.4 0:01.36 guacd
58648 root 20 0 5697740 35216 4152 S 0.0 0.4 0:01.46 guacd
58852 root 20 0 5748784 31620 3012 S 0.0 0.4 0:00.73 guacd
58876 root 20 0 5747880 36248 4156 S 0.0 0.4 0:00.60 guacd
$ top -H -p 50821
top - 17:31:06 up 35 days, 7:47, 0 users, load average: 3.75, 3.87, 4.02
Threads: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 85.8 us, 0.7 sy, 0.0 ni, 13.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.2 st
KiB Mem: 8175596 total, 7947360 used, 228236 free, 400124 buffers
KiB Swap: 0 total, 0 used, 0 free. 4268008 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
50825 root 20 0 2250724 27004 3168 R 98.9 0.3 2991:25 guacd
50821 root 20 0 2250724 27004 3168 S 0.0 0.3 0:00.01 guacd
50827 root 20 0 2250724 27004 3168 S 0.0 0.3 0:00.87 guacd
$ top -H -p 26329
top - 17:31:53 up 35 days, 7:47, 0 users, load average: 3.95, 3.90, 4.03
Threads: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 92.3 us, 0.9 sy, 0.0 ni, 6.6 id, 0.0 wa, 0.0 hi, 0.1 si, 0.2 st
KiB Mem: 8175596 total, 8049068 used, 126528 free, 396788 buffers
KiB Swap: 0 total, 0 used, 0 free. 4276156 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26333 root 20 0 2250720 24896 3040 R 96.2 0.3 14352:05 guacd
26329 root 20 0 2250720 24896 3040 S 0.0 0.3 0:00.00 guacd
26335 root 20 0 2250720 24896 3040 S 0.0 0.3 0:04.45 guacdthread creation:
Surprisingly for a connectionless thread can still reading a non zero buffer from fd. |
|
Therefore my hypothesis is Dead loop backtrace: |
|
close due to #181 |
A backtrace is only going to be useful with corresponding symbols. These addresses are going to be specific to your build of whatever-is-in-memory-at-that-location.
@changkun, #181 is an entirely different change covering a different loop. Can you clarify why you believe these changes are no longer needed in light of #181? |
|
Hi @mike-jumper, those backtraces with corresponding symbols are presented in #181 . I generally believe #181 fixes the infinite loop problem. In speaking of this PR changes, there was a "wrong" deployment (master branch with this PR changes, instead of using 0.9.14 releasing) a week ago and there was an automatically disconnection observed in our practices and CPU spinning remains. Despite the deployment was not controlled by one condition, the remaining CPU spinning seems still proves this PR doesn't help. Back to the write call, I reread the doc provided in above. It seems we all misread the documentation before.
This explains no matter how write call wraps syscall, -1 is returned if there was an error.
and apparently /* Transfer data from file descriptor to socket */
while ((length = guac_socket_read(params->socket, buffer, sizeof(buffer))) > 0) {
if (__write_all(params->fd, buffer, length) < 0)
break;
}see guacamole-server/src/guacd/connection.c Line 121 in 332e187 |
|
FYI: the write call is called correctly in guacd. However check errno in this PR is still a right way. I quote an answer from Dietrich on StackOverflow below, and TLDR: errno is not set if write return non-negative, but it can be negative if it was not reset by the caller.
|
guacd involves a
__write_allfunction to write instruction as much as possible. However, system callwriteis possible return 0 and also set errno, which is not verified in the function.A possible case is
writekeeps return 0 (nothing writes to buffer). Therefore the daemon process encounters a dead loop. Furthermore, it leads CPU rate up to 99%.The above phenomenon is observed from server:
This patch perhaps needs more discussion on when and why this is happening, as well as whether should we record the failure to log, for instance: