Switch socket timeout logic to use `poll()` instead of `select()` #168

alanxz · 2014-02-12T20:29:52Z

Use poll() instead of select() when doing timeout operations on sockets in rabbitmq-c.

The text was updated successfully, but these errors were encountered:

abogosyan · 2014-02-14T17:17:13Z

Why don't you use setsockopt with SO_SNDTIMEO / SO_RCVTIMEO for timeouts?

alanxz · 2014-02-14T17:34:06Z

That was actually what was used for the first (unreleased) iteration of the timeout code. Unfortunately it was not well supported across different platforms, especially Win32 where the docs suggest that if a timeout occurs on a socket, the socket should be considered in an undefined state and should be closed.

cloudwu · 2014-02-15T07:03:28Z

Our system core dump the day before yesterday because of select() :( (The sockfd > 1024)

lrascao · 2014-02-16T17:58:16Z

there's a define to increase the ammount of descriptors in the select mask, FD_SETSIZE if i'm not mistaken

cloudwu · 2014-02-17T01:46:03Z

redis fixed this in 2012

redis/redis#267

在 2014年2月17日，上午1:58，Luis Rascão notifications@github.com 写道：

there's a define to increase the ammount of descriptors in the select mask, FD_SETSIZE if i'm not mistaken

—
Reply to this email directly or view it on GitHub.

alanxz · 2014-02-18T03:34:04Z

@lrascao that workaround is very dependent on which platform you're compiling for. In many cases it will cause crashes or other undefined behaviors.

@cloudwu your use-case is the motivation behind this patch.

cloudwu · 2014-02-18T05:25:05Z

We use rabbitmq-c in our online game server, which hold about 8K TCP connections for players in one process . rabbitmq-c is for the platform message queue.

The rabbitmq-c works fine usually because the platform message queue established at the beginning , the sockfd is very small at that time.

But the connection for platform message queue was broken last week, and the server try to reestablish it immediately. And then, it crashed .

We check the core dump file and found amqp_open_socket_noblock() couldn't return because the return address on stack was 0 , and portnumber_string[] was zero ,too.

So I guess the reason is socketfd was too large at that time, and select() wipe the stack.

alanxz · 2014-02-18T18:29:38Z

It wasn't select() that wiped the stack in your case, but the FD_SET() call that did it. On Linux struct fd_set is implemented as a FD_SETSIZE sized bitmask, with each position in the bitmask representing an fd number. FD_SET() doesn't do any bounds checking so when its called with an fd greater than FD_SETSIZE it will write to memory beyond the struct. In rabbitmq-c the struct fd_set is allocated on the stack, so when this write occurs it screws up the stack, and thus the behavior you saw.

cloudwu · 2014-02-19T00:05:10Z

@alanxz I checked the core dump file, all the portnumber_string[] were set to zero after select. But FD_SET() only call twice before select (set 2 bits to 1) , so I guess select() clear them.

It seems that select() doesn't know the size of fd_set on Linux, it use nfds (the first parameter of select) as the size of fd_set , it may clear the bits in them.

abogosyan · 2014-02-20T08:59:48Z

Hey, Alan. We have faced an interesting problem in our application.
We are using your library in this way:

There is a working thread to work with librabbitmq, it polls tasks from some kind of "task queue", based on boost::asio.
Each task represents an rabbitmq-c API call to publish something or make RPC call to Rabbit, i.e. "amqp_queue_declare", "amqp_queue_bind", "amqp_new_connection".
After polling the tasks, the working thread tries to consume something with timeout using "amqp_consume_message" just like in your example.
So we have a thread which publish something, or doing some "send-like" work and then trying to consume something in cycle. All calls to rabbitmq-c are made from this thread.

We are using very small timeout for consuming: timeval timeout = { 0, 1000 }, so we don't spend too much time for consuming and don't affecting our publish rate, but still have some throttling.

We found that sometimes, in very rare situation, amqp_consume_message can "hang" until something arrived in the socket.

We've made some investigation and we suspect that there is a problem with select in the code.
Under Linux select() can change the timeout in unpredictable way, so when you adjust timeout in EINTR handling code, you can get very big timeout and hang until some data arrived in socket.

quote:
"Also note that on error the contents of the timeval struct are undefined when select() returns. Since you can't guarantee not to get an error (EINTR is most common) you'll need a mechanism to track time remaining (if you care) anyway, so using the contents of the timeval struct after return is a waste of time in most situations even without portability concerns."

We gonna check this using some test with consuming thread being bombed with signals to force EINTR. We also think that poll() could fix this.

Thanks for your library and sorry for my weak english =)

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes #168

alanxz · 2014-04-15T04:53:02Z

I've merged in an implementation that uses poll() to do the timeouts. Please test it carefully, as I haven't really stressed the code on many different systems yet.

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes alanxz#168

alanxz added a commit that referenced this issue Mar 14, 2014

Use poll(2) for timeouts on socket

f726930

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes #168

alanxz mentioned this issue Mar 14, 2014

Use poll(2) for timeouts on socket #176

Merged

alanxz added a commit that referenced this issue Mar 30, 2014

Use poll(2) for timeouts on socket

2300531

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes #168

alanxz added a commit that referenced this issue Apr 15, 2014

Use poll(2) for timeouts on socket

a78aa8a

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes #168

alanxz closed this as completed in #176 Apr 15, 2014

sid1029 pushed a commit to sid1029/rabbitmq-c that referenced this issue Jul 15, 2014

Use poll(2) for timeouts on socket

fec09b8

Use poll(2) instead of select(2) to do timeout operations on sockets. This helps with the situation where the fd is larger than FD_MAXSIZE. Fixes alanxz#168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch socket timeout logic to use `poll()` instead of `select()` #168

Switch socket timeout logic to use `poll()` instead of `select()` #168

alanxz commented Feb 12, 2014

abogosyan commented Feb 14, 2014

alanxz commented Feb 14, 2014

cloudwu commented Feb 15, 2014

lrascao commented Feb 16, 2014

cloudwu commented Feb 17, 2014

alanxz commented Feb 18, 2014

cloudwu commented Feb 18, 2014

alanxz commented Feb 18, 2014

cloudwu commented Feb 19, 2014

abogosyan commented Feb 20, 2014

alanxz commented Apr 15, 2014

Switch socket timeout logic to use poll() instead of select() #168

Switch socket timeout logic to use poll() instead of select() #168

Comments

alanxz commented Feb 12, 2014

abogosyan commented Feb 14, 2014

alanxz commented Feb 14, 2014

cloudwu commented Feb 15, 2014

lrascao commented Feb 16, 2014

cloudwu commented Feb 17, 2014

alanxz commented Feb 18, 2014

cloudwu commented Feb 18, 2014

alanxz commented Feb 18, 2014

cloudwu commented Feb 19, 2014

abogosyan commented Feb 20, 2014

alanxz commented Apr 15, 2014

Switch socket timeout logic to use `poll()` instead of `select()` #168

Switch socket timeout logic to use `poll()` instead of `select()` #168