Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executor cons/destructions #41785

Closed
wants to merge 1 commit into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jun 29, 2023

What changes were proposed in this pull request?

This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means

  • connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment
  • connectionTimeout=0, IdleStateHandler for triggering IdleStateEvent is disabled.

Why are the changes needed?

  1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately.
  2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions

Does this PR introduce any user-facing change?

no

How was this patch tested?

new unit tests

…ationTimeout to zero or negative will cause incessant executor cons/destructions
@github-actions github-actions bot added the CORE label Jun 29, 2023
@yaooqinn
Copy link
Member Author

cc @cloud-fan @dongjoon-hyun thanks

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @yaooqinn .

@@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) {
// Connect to the remote server
long preConnect = System.nanoTime();
ChannelFuture cf = bootstrap.connect(address);
if (!cf.await(conf.connectionCreationTimeoutMs())) {

if (connCreateTimeout <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although negative cases are prevented already via connectionCreationTimeoutMs changes, this condition looks safer.

@@ -237,4 +235,31 @@ public void fastFailConnectionInTimeWindow() {
Assert.assertThrows("fail this connection directly", IOException.class,
() -> factory.createClient(TestUtils.getLocalHost(), unreachablePort, true));
}

@Test
public void unlimitedConnectionAndCreationTimeouts() throws IOException, InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

@dongjoon-hyun
Copy link
Member

cc @mridulm

@yaooqinn
Copy link
Member Author

thanks @dongjoon-hyun for the check

@mridulm
Copy link
Contributor

mridulm commented Jun 30, 2023

Instead of trying to work around the issue - why not simply fail when invalid config (<= 0 in this case) is specified ?
Within core, for example, invalid config values result in checkValue throwing IllegalArgumentException and so app failure.

@yaooqinn
Copy link
Member Author

Hi @mridulm,

These network settings prefixed by different modules are not registered as TypedConfigBuilder for checkValue.

Usually, negative values for timeouts in spark are valid, e.g. spark.sql.streaming.stopTimeout(0 or negative values wait indefinitely) and here also at least 90% valid for the underlying network protocol.

@mridulm
Copy link
Contributor

mridulm commented Jun 30, 2023

To clarify, I am not saying fail on all <= 0 values - but fail where relevant, instead of trying to workaround it.

If negative value or zero makes sense for a config - that is perfectly fine.

I used checkValue as an example to show the behavior, not impl :-)

@yaooqinn
Copy link
Member Author

Thanks @mridulm

If negative value or zero makes sense for a config - that is perfectly fine.

I'm +1 w/ it.

As checking these networking settings is not very convenient on the driver side before scheduler starts, if you don't have strong option for this, I'd like to merge this.

@yaooqinn
Copy link
Member Author

Just FYI, https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/timeout/IdleStateHandler.java#L198

For io.connectionTimeouts, this is also adjusted by netty. I guess it forgots to do this for io.connectionCreationTimeout, a.k.a CONNNETION_TIMEOUT

@yaooqinn yaooqinn closed this in 38645fa Jun 30, 2023
yaooqinn added a commit that referenced this pull request Jun 30, 2023
…ationTimeout to zero or negative will cause incessant executor cons/destructions

### What changes were proposed in this pull request?

This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means
- connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment
- connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled.

### Why are the changes needed?

1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately.
2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new unit tests

Closes #41785 from yaooqinn/SPARK-44241.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 38645fa)
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member Author

thanks @dongjoon-hyun @mridulm @Hisoka-X for the check, merged to master/3.4/3.3

yaooqinn added a commit that referenced this pull request Jun 30, 2023
…ationTimeout to zero or negative will cause incessant executor cons/destructions

### What changes were proposed in this pull request?

This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means
- connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment
- connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled.

### Why are the changes needed?

1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately.
2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new unit tests

Closes #41785 from yaooqinn/SPARK-44241.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 38645fa)
Signed-off-by: Kent Yao <yao@apache.org>
@@ -245,12 +245,13 @@ TransportClient createClient(InetSocketAddress address)
logger.debug("Creating new connection to {}", address);

Bootstrap bootstrap = new Bootstrap();
int connCreateTimeout = conf.connectionCreationTimeoutMs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have simply thrown an exception when connCreateTimeout <= 0 - and we can avoid the rest of the PR.
Will that not work ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w/o this PR, here we simply throw an exception, it doesn't work

Copy link
Contributor

@mridulm mridulm Jul 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will throw the exception and user will fix invalid config first time they run: exact same behavior as with checkValue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the same; checkvalue fails app in driver ahead and tells users what is invalid. He we fail during executor allocating. Users may not encounter exceptions, or exceptions and config are suppressed

Copy link
Contributor

@mridulm mridulm Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure is in createClient (when introduced) - and will fail all invocations for that module - which will in turn fail the task/stage/job. This is similar in behavior to a configuration referenced in executor which fails in checkValue.
The failure reason will inform the user about the invalid configuration.

Do you have a scenario where an exception thrown in this codepath does not result in failure ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that createClient failures of executor startup will not result in task fail. The executor will fail after max io retries, and then the app attempt will fail appMaster after the max number of executor failures is reached, and there are still many app attempts waiting for us to retry. It could be hours before you notice the app fail without the driver log telling what mistake we've made.

Copy link
Contributor

@mridulm mridulm Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn we have discussed this quite a bit already and now going in circles :-)
Essentially my point is this - instead of rejecting invalid configurations, we are trying to workaround invalid user configs. Once released, this will be a pattern we have to then continue supporting - I am not seeing a good reason why we have to introduce this new behavior.

I would recommend simply rejecting the invalid config with an IllegalArgumentException (we do this in a bunch of other places as well for invalid user config), instead of introducing behavior which tries to workaround it [1]: as long as the user gets to see the exception, they can fix the issue and mitigate it.

Thoughts ?

+CC @dongjoon-hyun as well, since you reviewed the PR.

[1] If there is utility in introducing this, and not as a workaround for user config, that would be interesting to know more about - it is not coming out from the discussion here or in the jira.

@yaooqinn yaooqinn deleted the SPARK-44241 branch July 5, 2023 01:49
if (!cf.await(conf.connectionCreationTimeoutMs())) {

if (connCreateTimeout <= 0) {
cf.awaitUninterruptibly();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a plain await() instead of awaitUninterruptibly(), since the latter might lead to uninterruptible canceled tasks. See #16866 for a past fix for a similar issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Could you make a PR to fix it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we need a new JIRA to fix that instead of a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JoshRosen. The fix looks good to me.

While considering the usage of await and awaitUninterruptibly, I consulted the Netty documentation (https://netty.io/4.0/api/io/netty/channel/ChannelFuture.html) which recommended the use of awaitUninterruptibly in all its GOOD examples. So...

yaooqinn pushed a commit that referenced this pull request Aug 23, 2023
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Aug 23, 2023
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 2137606)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Aug 23, 2023
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 2137606)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Aug 23, 2023
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see #16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 2137606)
Signed-off-by: Kent Yao <yao@apache.org>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…ationTimeout to zero or negative will cause incessant executor cons/destructions

### What changes were proposed in this pull request?

This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means
- connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment
- connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled.

### Why are the changes needed?

1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately.
2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new unit tests

Closes apache#41785 from yaooqinn/SPARK-44241.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 38645fa)
Signed-off-by: Kent Yao <yao@apache.org>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…TransportClientFactory.createClient()

### What changes were proposed in this pull request?

apache#41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call.

Note that the other pre-existing branches in this method were already using `await()`.

### Why are the changes needed?

Uninterruptible waiting can cause problems when cancelling tasks. For details, see apache#16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#42619 from JoshRosen/remove-awaitUninterruptibly.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 2137606)
Signed-off-by: Kent Yao <yao@apache.org>
FMX pushed a commit to apache/celeborn that referenced this pull request Apr 19, 2024
…tive celeborn.<module>.io.connectTimeout/connectionTimeout

### What changes were proposed in this pull request?

`TransportClientFactory` should regard as zero for negative `celeborn.<module>.io.connectTimeout` and `celeborn.<module>.io.connectionTimeout`.

### Why are the changes needed?

When `celeborn.<module>.io.connectionTimeout` is 0 that means unlimited to netty, `ChannelFuture.await(0)` fails directly and inappropriately. Meanwhile, whhen `celeborn.<module>.io.connectionTimeout` is less than 0 that causes meaningless transport client reconnections and endless reconstructions.

Backport:

- apache/spark#41785
- apache/spark#42619

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportClientFactorySuiteJ#unlimitedConnectAndConnectionTimeouts`

Closes #2467 from SteNicholas/CELEBORN-1392.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
gotikkoxq added a commit to gotikkoxq/celeborn that referenced this pull request Aug 26, 2024
…tive celeborn.<module>.io.connectTimeout/connectionTimeout

### What changes were proposed in this pull request?

`TransportClientFactory` should regard as zero for negative `celeborn.<module>.io.connectTimeout` and `celeborn.<module>.io.connectionTimeout`.

### Why are the changes needed?

When `celeborn.<module>.io.connectionTimeout` is 0 that means unlimited to netty, `ChannelFuture.await(0)` fails directly and inappropriately. Meanwhile, whhen `celeborn.<module>.io.connectionTimeout` is less than 0 that causes meaningless transport client reconnections and endless reconstructions.

Backport:

- apache/spark#41785
- apache/spark#42619

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportClientFactorySuiteJ#unlimitedConnectAndConnectionTimeouts`

Closes #2467 from SteNicholas/CELEBORN-1392.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants