[SPARK-32037][CORE] Rename blacklisting feature #29906

tgravescs · 2020-09-29T21:48:35Z

What changes were proposed in this pull request?

this PR renames the blacklisting feature. I ended up using "excludeOnFailure" or "excluded" in most cases but there is a mix. I renamed the BlacklistTracker to HealthTracker, but for the TaskSetBlacklist HealthTracker didn't make sense to me since its not the health of the taskset itself but rather tracking the things its excluded on so I renamed it to be TaskSetExcludeList. Everything else I tried to use the context and in most cases excluded made sense. It made more sense to me then blocked since you are basically excluding those executors and nodes from scheduling tasks on them. Then can be unexcluded later after timeouts and such. The configs I changed the name to use excludeOnFailure which I thought explained it.

I unfortunately couldn't get rid of some of them because its part of the event listener and history files. To keep backwards compatibility I kept the events and some of the parsing so that the history server would still properly read older history files. It is not forward compatible though - meaning a new application write the "Excluded" events so the older history server won't properly read display them as being blacklisted.

A few of the files below are showing up as deleted and recreated even though I did a git mv on them. I'm not sure why.

Why are the changes needed?

get rid of problematic language

Does this PR introduce any user-facing change?

Config name changes but the old configs still work but are deprecated.

How was this patch tested?

updated tests and also manually tested the UI changes and manually tested the history server reading older versions of history files and vice versa.

…still work

SparkQA · 2020-09-29T21:54:09Z

Test build #129254 has finished for PR 29906 at commit 6657785.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

xkrogen

Thanks a lot for putting up this PR @tgravescs ! I am sure it was not a fun effort :)

I only took a quick look for now and will happily help go more in depth when I have a bit more time. Besides some inline comments I left, some overall comments:

I think the naming conventions are great.
I see a few references to blocked/blocklist hanging around in the new code, was this intentional? I think they should be excluded/excludelist based on the proposed terminology.
It looks like we've also renamed Spark's references to the YARN-level blacklisting feature (e.g. YARN-4576) in YarnAllocator and friends. I'm not sure this is appropriate. YARN still calls it blacklisting and it may be confusing for us to refer to it by another name. That feature also behaves differently from Spark's so I'm not sure if excludeOnFailure is the right name regardless. Was this an intentional change or did it just accidentally get swept up?

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala

core/src/main/scala/org/apache/spark/status/AppStatusSource.scala

sql/hive-thriftserver/v1.2/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala

tgravescs · 2020-09-30T12:59:55Z

thanks I started out with blocklist and must have missed converting a few, I'll fix those up. The yarn allocator side I obviously left the actual calls into the client but tried to rename the variables and such to something else. There is a hadoop jira to remove from there as well but I have no idea when it will be implemented so I was just trying to remove where it made sense.

changes

tgravescs · 2020-09-30T14:58:29Z

thanks @xkrogen I made another pass through and think I updated all the reference to blocklist. Another set of eyes would be great as it is a lot of code.

cc @holdenk @srowen

SparkQA · 2020-09-30T15:05:39Z

Test build #129279 has finished for PR 29906 at commit 76714e7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-09-30T15:25:03Z

looks like some new files added, I'll upmerge to latest

mridulm · 2020-10-20T17:52:56Z

SparkFirehoseListener has been public right ? We just missed annotating it appropriately ?
Given this, I am comfortable fixing this oversight.

Thoughts @srowen, @dongjoon-hyun ?

Ngone51 · 2020-10-22T03:16:00Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

        appStatusSource.foreach(_.BLACKLISTED_EXECUTORS.inc())
+        appStatusSource.foreach(_.EXCLUDED_EXECUTORS.inc())


I wonder this could lead to the metrics overcounted since we always post two blacklist events?

thanks for catching this one, I missed it. I had checked this and most are fine due to using set and just setting status which would already be set, but I somehow missed this one. I'll fix

I updated this but I actually found a pre-existing bug where we weren't incrementing this when we excluded a node - which implicitly excludes the executors

…tation

SparkQA · 2020-10-23T16:33:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34810/

SparkQA · 2020-10-23T17:03:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34810/

SparkQA · 2020-10-23T18:24:08Z

Test build #130210 has finished for PR 29906 at commit 32ab73d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-24T00:51:38Z

retest this please

SparkQA · 2020-10-24T01:34:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34820/

SparkQA · 2020-10-24T02:03:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34820/

SparkQA · 2020-10-24T03:17:29Z

Test build #130220 has finished for PR 29906 at commit 32ab73d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2020-10-24T08:28:49Z

Thanks for working on this Tom ! Looks good to me.

tgravescs · 2020-10-27T18:25:21Z

@Ngone51 any other comments?

…e found ### What changes were proposed in this pull request? This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found. Currently, we're running and skipping the tests dynamically. For example, - if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases. - if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files. When test reporting ("Report test results") job is triggered after the main build ("Build and test ") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example. This PR works around it by simply skipping the testing report when there are no JUnit XML files are found. Please see apache#29906 (comment) for more details. ### Why are the changes needed? To avoid false alarm for test results. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my fork. Positive case: https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/288996327 Negative case: https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/289000058 Closes apache#29946 from HyukjinKwon/test-junit-files. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a0aa8f3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Ngone51

LGTM, except for some minor comments. Thank you for the great work!

core/src/main/scala/org/apache/spark/internal/config/package.scala

Ngone51 · 2020-10-28T07:25:32Z

core/src/main/scala/org/apache/spark/SparkConf.scala

+        "Please use spark.excludeOnFailure.stage.maxFailedExecutorsPerNode"),
+      DeprecatedConfig("spark.blacklist.timeout", "3.1.0",
+        "Please use spark.excludeOnFailure.timeout"),
+      DeprecatedConfig("spark.scheduler.executorTaskBlacklistTime", "3.1.0",


This one is duplicated? (see L605)

Ngone51 · 2020-10-28T07:30:08Z

core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

@@ -25,7 +25,7 @@ import scala.collection.Map
 import com.fasterxml.jackson.annotation.JsonTypeInfo

 import org.apache.spark.TaskEndReason
-import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.annotation.{DeveloperApi, Since}


Unused Since?

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

SparkQA · 2020-10-28T14:41:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34974/

SparkQA · 2020-10-28T15:04:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34974/

SparkQA · 2020-10-28T16:09:31Z

Test build #130370 has finished for PR 29906 at commit b38dd66.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-10-30T19:13:17Z

test this please

SparkQA · 2020-10-30T19:55:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35076/

SparkQA · 2020-10-30T20:24:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35076/

SparkQA · 2020-10-30T22:08:27Z

Test build #130472 has finished for PR 29906 at commit b38dd66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-10-30T22:14:01Z

based on the last lgtm with just nits and I fixed those, I'm going to go ahead and merge this so things don't get out of date, if there was anything else let me know and I'll followup. thanks @mridulm and @Ngone51

dongjoon-hyun · 2020-10-30T22:37:39Z

Thank you, @tgravescs and all!

HyukjinKwon · 2020-11-01T10:14:51Z

Thanks @tgravescs for dealing with this.

xkrogen · 2020-11-02T16:22:26Z

Huge thanks for pushing this through @tgravescs ! It was no small effort!

tgravescs and others added 17 commits September 15, 2020 10:47

Change blacklist to use exclude

c8f1ef9

more changes

803c8cc

Continue renaming blacklisting, start ui changes

a9710dc

more renames

26aa116

update tests and revert some of the listener events so history files …

73b997f

…still work

add release to deprecated and fix compilation

d85f2b5

Fix test reference blacklist

f8b042a

Fix one more referenc3:

b0d1c72

minor fixes

57a4ac9

Fix test text

83a2b94

test fixes

eea7bdf

revert

9f6b813

revert some changes for backwards compatibility

1155c1d

Fix test

12d99aa

test fixes

b3f932f

a few comments missed

f319b1b

rename a few hive blacklist variables

6657785

probot-autolabeler bot added CORE WEB UI labels Sep 29, 2020

xkrogen reviewed Sep 29, 2020

View reviewed changes

tgravescs added 3 commits September 30, 2020 08:15

Fix scalastyle and update review comments. Revert Hve thriftserver

9e5ace6

changes

renames blocked to excluded

1a3237a

Update blocklist to excluded

76714e7

tgravescs added 2 commits September 30, 2020 10:25

Merge remote-tracking branch 'upstream/master' into SPARK-32037

08946ab

Update new test file

6da870b

Ngone51 reviewed Oct 22, 2020

View reviewed changes

tgravescs added 4 commits October 22, 2020 08:32

Add developerapi to SparkFirehostListener

8241e61

Fix bug in blacklist metrics incrementing and minor fixes and documen…

3446fb0

…tation

Fix typo

f20a75d

Merge remote-tracking branch 'upstream/master' into SPARK-32037

32ab73d

Ngone51 reviewed Oct 28, 2020

View reviewed changes

fix missing deprecated configs and minor issues

b38dd66

asfgit closed this in 72ad9dc Oct 30, 2020

		appStatusSource.foreach(_.BLACKLISTED_EXECUTORS.inc())
		appStatusSource.foreach(_.EXCLUDED_EXECUTORS.inc())

[SPARK-32037][CORE] Rename blacklisting feature #29906

[SPARK-32037][CORE] Rename blacklisting feature #29906

Conversation

tgravescs commented Sep 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 29, 2020

xkrogen left a comment

Choose a reason for hiding this comment

tgravescs commented Sep 30, 2020

tgravescs commented Sep 30, 2020

SparkQA commented Sep 30, 2020

tgravescs commented Sep 30, 2020

mridulm commented Oct 20, 2020

Ngone51 Oct 22, 2020

Choose a reason for hiding this comment

tgravescs Oct 22, 2020

Choose a reason for hiding this comment

tgravescs Oct 23, 2020

Choose a reason for hiding this comment

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

SparkQA commented Oct 23, 2020

HyukjinKwon commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

mridulm commented Oct 24, 2020

tgravescs commented Oct 27, 2020

Ngone51 left a comment • edited Loading

Choose a reason for hiding this comment

Ngone51 Oct 28, 2020

Choose a reason for hiding this comment

tgravescs Oct 28, 2020

Choose a reason for hiding this comment

Ngone51 Oct 28, 2020

Choose a reason for hiding this comment

tgravescs Oct 28, 2020

Choose a reason for hiding this comment

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

tgravescs commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

SparkQA commented Oct 30, 2020

tgravescs commented Oct 30, 2020

dongjoon-hyun commented Oct 30, 2020

HyukjinKwon commented Nov 1, 2020

xkrogen commented Nov 2, 2020

Ngone51 left a comment •

edited

Loading