-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executor cons/destructions #41785
Conversation
…ationTimeout to zero or negative will cause incessant executor cons/destructions
...etwork-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java
Show resolved
Hide resolved
cc @cloud-fan @dongjoon-hyun thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @yaooqinn .
@@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) { | |||
// Connect to the remote server | |||
long preConnect = System.nanoTime(); | |||
ChannelFuture cf = bootstrap.connect(address); | |||
if (!cf.await(conf.connectionCreationTimeoutMs())) { | |||
|
|||
if (connCreateTimeout <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although negative cases are prevented already via connectionCreationTimeoutMs
changes, this condition looks safer.
@@ -237,4 +235,31 @@ public void fastFailConnectionInTimeWindow() { | |||
Assert.assertThrows("fail this connection directly", IOException.class, | |||
() -> factory.createClient(TestUtils.getLocalHost(), unreachablePort, true)); | |||
} | |||
|
|||
@Test | |||
public void unlimitedConnectionAndCreationTimeouts() throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
cc @mridulm |
thanks @dongjoon-hyun for the check |
Instead of trying to work around the issue - why not simply fail when invalid config ( |
Hi @mridulm, These network settings prefixed by different modules are not registered as TypedConfigBuilder for Usually, negative values for timeouts in spark are valid, e.g. |
To clarify, I am not saying fail on all <= 0 values - but fail where relevant, instead of trying to workaround it. If negative value or zero makes sense for a config - that is perfectly fine. I used checkValue as an example to show the behavior, not impl :-) |
Thanks @mridulm
I'm +1 w/ it. As checking these networking settings is not very convenient on the driver side before scheduler starts, if you don't have strong option for this, I'd like to merge this. |
For |
…ationTimeout to zero or negative will cause incessant executor cons/destructions ### What changes were proposed in this pull request? This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means - connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment - connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled. ### Why are the changes needed? 1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately. 2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests Closes #41785 from yaooqinn/SPARK-44241. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 38645fa) Signed-off-by: Kent Yao <yao@apache.org>
thanks @dongjoon-hyun @mridulm @Hisoka-X for the check, merged to master/3.4/3.3 |
…ationTimeout to zero or negative will cause incessant executor cons/destructions ### What changes were proposed in this pull request? This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means - connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment - connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled. ### Why are the changes needed? 1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately. 2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests Closes #41785 from yaooqinn/SPARK-44241. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 38645fa) Signed-off-by: Kent Yao <yao@apache.org>
@@ -245,12 +245,13 @@ TransportClient createClient(InetSocketAddress address) | |||
logger.debug("Creating new connection to {}", address); | |||
|
|||
Bootstrap bootstrap = new Bootstrap(); | |||
int connCreateTimeout = conf.connectionCreationTimeoutMs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have simply thrown an exception when connCreateTimeout <= 0
- and we can avoid the rest of the PR.
Will that not work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w/o this PR, here we simply throw an exception, it doesn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will throw the exception and user will fix invalid config first time they run: exact same behavior as with checkValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the same; checkvalue fails app in driver ahead and tells users what is invalid. He we fail during executor allocating. Users may not encounter exceptions, or exceptions and config are suppressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure is in createClient
(when introduced) - and will fail all invocations for that module - which will in turn fail the task/stage/job. This is similar in behavior to a configuration referenced in executor which fails in checkValue
.
The failure reason will inform the user about the invalid configuration.
Do you have a scenario where an exception thrown in this codepath does not result in failure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that createClient
failures of executor startup will not result in task fail. The executor will fail after max io retries, and then the app attempt will fail appMaster after the max number of executor failures is reached, and there are still many app attempts waiting for us to retry. It could be hours before you notice the app fail without the driver log telling what mistake we've made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn we have discussed this quite a bit already and now going in circles :-)
Essentially my point is this - instead of rejecting invalid configurations, we are trying to workaround invalid user configs. Once released, this will be a pattern we have to then continue supporting - I am not seeing a good reason why we have to introduce this new behavior.
I would recommend simply rejecting the invalid config with an IllegalArgumentException
(we do this in a bunch of other places as well for invalid user config), instead of introducing behavior which tries to workaround it [1]: as long as the user gets to see the exception, they can fix the issue and mitigate it.
Thoughts ?
+CC @dongjoon-hyun as well, since you reviewed the PR.
[1] If there is utility in introducing this, and not as a workaround for user config, that would be interesting to know more about - it is not coming out from the discussion here or in the jira.
if (!cf.await(conf.connectionCreationTimeoutMs())) { | ||
|
||
if (connCreateTimeout <= 0) { | ||
cf.awaitUninterruptibly(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a plain await()
instead of awaitUninterruptibly()
, since the latter might lead to uninterruptible canceled tasks. See #16866 for a past fix for a similar issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Could you make a PR to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need a new JIRA to fix that instead of a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed https://issues.apache.org/jira/browse/SPARK-44920 and opened #42619 to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @JoshRosen. The fix looks good to me.
While considering the usage of await
and awaitUninterruptibly
, I consulted the Netty documentation (https://netty.io/4.0/api/io/netty/channel/ChannelFuture.html) which recommended the use of awaitUninterruptibly
in all its GOOD examples. So...
…TransportClientFactory.createClient() ### What changes were proposed in this pull request? #41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org>
…TransportClientFactory.createClient() ### What changes were proposed in this pull request? #41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606) Signed-off-by: Kent Yao <yao@apache.org>
…TransportClientFactory.createClient() ### What changes were proposed in this pull request? #41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606) Signed-off-by: Kent Yao <yao@apache.org>
…TransportClientFactory.createClient() ### What changes were proposed in this pull request? #41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606) Signed-off-by: Kent Yao <yao@apache.org>
…ationTimeout to zero or negative will cause incessant executor cons/destructions ### What changes were proposed in this pull request? This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means - connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment - connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled. ### Why are the changes needed? 1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately. 2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests Closes apache#41785 from yaooqinn/SPARK-44241. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 38645fa) Signed-off-by: Kent Yao <yao@apache.org>
…TransportClientFactory.createClient() ### What changes were proposed in this pull request? apache#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see apache#16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606) Signed-off-by: Kent Yao <yao@apache.org>
…tive celeborn.<module>.io.connectTimeout/connectionTimeout ### What changes were proposed in this pull request? `TransportClientFactory` should regard as zero for negative `celeborn.<module>.io.connectTimeout` and `celeborn.<module>.io.connectionTimeout`. ### Why are the changes needed? When `celeborn.<module>.io.connectionTimeout` is 0 that means unlimited to netty, `ChannelFuture.await(0)` fails directly and inappropriately. Meanwhile, whhen `celeborn.<module>.io.connectionTimeout` is less than 0 that causes meaningless transport client reconnections and endless reconstructions. Backport: - apache/spark#41785 - apache/spark#42619 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `TransportClientFactorySuiteJ#unlimitedConnectAndConnectionTimeouts` Closes #2467 from SteNicholas/CELEBORN-1392. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
…tive celeborn.<module>.io.connectTimeout/connectionTimeout ### What changes were proposed in this pull request? `TransportClientFactory` should regard as zero for negative `celeborn.<module>.io.connectTimeout` and `celeborn.<module>.io.connectionTimeout`. ### Why are the changes needed? When `celeborn.<module>.io.connectionTimeout` is 0 that means unlimited to netty, `ChannelFuture.await(0)` fails directly and inappropriately. Meanwhile, whhen `celeborn.<module>.io.connectionTimeout` is less than 0 that causes meaningless transport client reconnections and endless reconstructions. Backport: - apache/spark#41785 - apache/spark#42619 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `TransportClientFactorySuiteJ#unlimitedConnectAndConnectionTimeouts` Closes #2467 from SteNicholas/CELEBORN-1392. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
What changes were proposed in this pull request?
This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means
IdleStateHandler
for triggeringIdleStateEvent
is disabled.Why are the changes needed?
Does this PR introduce any user-facing change?
no
How was this patch tested?
new unit tests