Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Race in vnode worker pool #298
When a riak_core_vnode_worker finishes work, it sends checkin messages to both poolboy and riak_core_vnode_worker_pool. The latter maintains a queue of work to be handled when there's room in the pool. As soon as RCVWP gets the checkin message, it asks poolboy if there is a worker available (expecting that the worker just checked in will now be available).
The problem is that poolboy may receive RCVWP's message before receiving the worker's checkin message. If this happens, it will tell RCVWP that the pool is full. RCVWP then sticks in the 'queueing' state until it receives another checkin message from a worker. Since another checkin may never arrive, the pool may become frozen.
The test defined by worker_pool_pulse.erl on the bwf-pool-race branch of riak_core demonstrates this race. Under PULSE execution, the test will fail with deadlock.
In order to run the test, you will need the pulse_otp beams from https://github.com/Quviq/pulse_otp on your path. You will also need to compile poolboy with PULSE annotations - the bwf-pool-race branch of basho/poolboy provides this.
Since I think PULSE's graphical output is quite cool, I'll include it here:
The problem is illustrated by the last four steps of "spawn_opt1", the poolboy process (in cyan), and the last six steps of "spawn_opt", the riak_core_vnode_worker_pool_process (in blue). The
Now fixing begins…
A short update: the simple fix of making riak_core_vnode_worker_pool solely responsible for calling poolboy:checkin works for the happy path, but uncovers the exact same race when the worker dies. Both poolboy and RCVWP are monitoring the worker, and the 'DOWN' message and subsequent checkout request might arrive at the poolboy FSM in any order. Further test expansion and fixing underway…