Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24352][core][tests] De-flake StandaloneDynamicAllocationSuite blacklist test #25318

Closed
wants to merge 1 commit into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 31, 2019

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

…blacklist test.

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully registered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.
@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108492 has finished for PR 25318 at commit d04eac1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
The newly create RpcEnv looks safe. Also, since the blacklist related logic is inside CoarseGrainedSchedulerBackend instead StandaloneSchedulerBackend, the patch looks good.

Merged to master/2.4/2.3. Thank you, @vanzin .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-24352][core][tests] De-flake StandaloneDynamicAllocationSuite blacklist test. [SPARK-24352][core][tests] De-flake StandaloneDynamicAllocationSuite blacklist test Aug 1, 2019
dongjoon-hyun pushed a commit that referenced this pull request Aug 1, 2019
…blacklist test

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

Closes #25318 from vanzin/SPARK-24352.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b3ffd8b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Aug 1, 2019
…blacklist test

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

Closes #25318 from vanzin/SPARK-24352.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b3ffd8b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@vanzin vanzin deleted the SPARK-24352 branch August 5, 2019 18:01
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…blacklist test

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

Closes apache#25318 from vanzin/SPARK-24352.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b3ffd8b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…blacklist test

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

Closes apache#25318 from vanzin/SPARK-24352.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b3ffd8b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
XUJiahua pushed a commit to XUJiahua/spark that referenced this pull request Apr 9, 2020
…blacklist test

The issue is that the test tried to stop an existing scheduler and replace it with
a new one set up for the test. That can cause issues because both were sharing the
same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous
(see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where
the new scheduler tried to register before the old one was fully unregistered.

The updated test avoids the issue by using a separate RpcEnv / scheduler instance
altogether, and also avoids a misleading NPE in the test logs.

Closes apache#25318 from vanzin/SPARK-24352.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b3ffd8b)

Cloudera ID: CDH-79079

Change-Id: Ia67a22046c2e4e392d083f3723cd18433f407f91
(cherry picked from commit 3261a02f43c0013c981743ae999d97141538e39d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants