Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race condition between write operation and send timeout #1108

Merged
merged 1 commit into from
Jan 25, 2018

Conversation

merlimat
Copy link
Contributor

Motivation

There is a possible race condition between a send operation getting written on a socket channel and getting timed-out at the same time. The problem is reproducible by setting very low timeout (eg: 10ms) at sustained publish rate.

The reason of the race condition is that we pass the instance of the OpSendMsg to the io-thread to get the ByteBuf to write into the socket but the timeout is touching it as well. We should instead take a reference on the ByteBuf itself for which we have already a reference increased for the write operation.

This is the seen exception:

2018-01-24 14:24:08,738 - WARN  - [pulsar-client-io-2-2:Slf4JLogger@146] - A task raised an exception. Task: org.apache.pulsar.client.impl.ProducerImpl$WriteInEventLoopCallback@4b2480e1
java.lang.NullPointerException: msg
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:785)
	at org.apache.pulsar.client.impl.ProducerImpl$WriteInEventLoopCallback.run(ProducerImpl.java:462)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Jan 24, 2018
@merlimat merlimat added this to the 1.22.0-incubating milestone Jan 24, 2018
@merlimat merlimat self-assigned this Jan 24, 2018
}

try {
cnx.ctx().writeAndFlush(op.cmd, cnx.ctx().voidPromise());
cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need
ReferenceCountUtil.safeRelease(cmd);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer gets release when the write operation completes

Copy link
Contributor

@rdhabalia rdhabalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice catch.

@merlimat merlimat merged commit e1ad269 into apache:master Jan 25, 2018
@merlimat merlimat deleted the fix-timeout-race-condition branch January 26, 2018 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants