Switch socket timeout logic to use `poll()` instead of `select()` #168

Closed
alanxz opened this Issue Feb 12, 2014 · 11 comments

Projects

None yet

4 participants

@alanxz
Owner
alanxz commented Feb 12, 2014

Use poll() instead of select() when doing timeout operations on sockets in rabbitmq-c.

@abogosyan

Why don't you use setsockopt with SO_SNDTIMEO / SO_RCVTIMEO for timeouts?

@alanxz
Owner
alanxz commented Feb 14, 2014

That was actually what was used for the first (unreleased) iteration of the timeout code. Unfortunately it was not well supported across different platforms, especially Win32 where the docs suggest that if a timeout occurs on a socket, the socket should be considered in an undefined state and should be closed.

@cloudwu
cloudwu commented Feb 15, 2014

Our system core dump the day before yesterday because of select() :( (The sockfd > 1024)

@lrascao
lrascao commented Feb 16, 2014

there's a define to increase the ammount of descriptors in the select mask, FD_SETSIZE if i'm not mistaken

@cloudwu
cloudwu commented Feb 17, 2014

redis fixed this in 2012

antirez/redis#267

在 2014年2月17日,上午1:58,Luis Rascão notifications@github.com 写道:

there's a define to increase the ammount of descriptors in the select mask, FD_SETSIZE if i'm not mistaken


Reply to this email directly or view it on GitHub.

@alanxz
Owner
alanxz commented Feb 18, 2014

@lrascao that workaround is very dependent on which platform you're compiling for. In many cases it will cause crashes or other undefined behaviors.

@cloudwu your use-case is the motivation behind this patch.

@cloudwu
cloudwu commented Feb 18, 2014

We use rabbitmq-c in our online game server, which hold about 8K TCP connections for players in one process . rabbitmq-c is for the platform message queue.

The rabbitmq-c works fine usually because the platform message queue established at the beginning , the sockfd is very small at that time.

But the connection for platform message queue was broken last week, and the server try to reestablish it immediately. And then, it crashed .

We check the core dump file and found amqp_open_socket_noblock() couldn't return because the return address on stack was 0 , and portnumber_string[] was zero ,too.

So I guess the reason is socketfd was too large at that time, and select() wipe the stack.

@alanxz
Owner
alanxz commented Feb 18, 2014

It wasn't select() that wiped the stack in your case, but the FD_SET() call that did it. On Linux struct fd_set is implemented as a FD_SETSIZE sized bitmask, with each position in the bitmask representing an fd number. FD_SET() doesn't do any bounds checking so when its called with an fd greater than FD_SETSIZE it will write to memory beyond the struct. In rabbitmq-c the struct fd_set is allocated on the stack, so when this write occurs it screws up the stack, and thus the behavior you saw.

@cloudwu
cloudwu commented Feb 19, 2014

@alanxz I checked the core dump file, all the portnumber_string[] were set to zero after select. But FD_SET() only call twice before select (set 2 bits to 1) , so I guess select() clear them.

It seems that select() doesn't know the size of fd_set on Linux, it use nfds (the first parameter of select) as the size of fd_set , it may clear the bits in them.

@abogosyan

Hey, Alan. We have faced an interesting problem in our application.
We are using your library in this way:

  • There is a working thread to work with librabbitmq, it polls tasks from some kind of "task queue", based on boost::asio.
  • Each task represents an rabbitmq-c API call to publish something or make RPC call to Rabbit, i.e. "amqp_queue_declare", "amqp_queue_bind", "amqp_new_connection".
  • After polling the tasks, the working thread tries to consume something with timeout using "amqp_consume_message" just like in your example.
  • So we have a thread which publish something, or doing some "send-like" work and then trying to consume something in cycle. All calls to rabbitmq-c are made from this thread.

We are using very small timeout for consuming: timeval timeout = { 0, 1000 }, so we don't spend too much time for consuming and don't affecting our publish rate, but still have some throttling.

We found that sometimes, in very rare situation, amqp_consume_message can "hang" until something arrived in the socket.

We've made some investigation and we suspect that there is a problem with select in the code.
Under Linux select() can change the timeout in unpredictable way, so when you adjust timeout in EINTR handling code, you can get very big timeout and hang until some data arrived in socket.

quote:
"Also note that on error the contents of the timeval struct are undefined when select() returns. Since you can't guarantee not to get an error (EINTR is most common) you'll need a mechanism to track time remaining (if you care) anyway, so using the contents of the timeval struct after return is a waste of time in most situations even without portability concerns."

We gonna check this using some test with consuming thread being bombed with signals to force EINTR. We also think that poll() could fix this.

Thanks for your library and sorry for my weak english =)

@alanxz alanxz added a commit that referenced this issue Mar 14, 2014
@alanxz Use poll(2) for timeouts on socket
Use poll(2) instead of select(2) to do timeout operations on sockets. This helps
with the situation where the fd is larger than FD_MAXSIZE.

Fixes #168
f726930
@alanxz alanxz added a commit that referenced this issue Mar 30, 2014
@alanxz Use poll(2) for timeouts on socket
Use poll(2) instead of select(2) to do timeout operations on sockets. This helps
with the situation where the fd is larger than FD_MAXSIZE.

Fixes #168
2300531
@alanxz alanxz added a commit that referenced this issue Apr 15, 2014
@alanxz Use poll(2) for timeouts on socket
Use poll(2) instead of select(2) to do timeout operations on sockets. This helps
with the situation where the fd is larger than FD_MAXSIZE.

Fixes #168
a78aa8a
@alanxz alanxz closed this in #176 Apr 15, 2014
@alanxz
Owner
alanxz commented Apr 15, 2014

I've merged in an implementation that uses poll() to do the timeouts. Please test it carefully, as I haven't really stressed the code on many different systems yet.

@sid1029 sid1029 added a commit to sid1029/rabbitmq-c that referenced this issue Jul 15, 2014
@alanxz @sid1029 + sid1029 Use poll(2) for timeouts on socket
Use poll(2) instead of select(2) to do timeout operations on sockets. This helps
with the situation where the fd is larger than FD_MAXSIZE.

Fixes #168
fec09b8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment