Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32037][CORE] Rename blacklisting feature #29906

Closed
wants to merge 41 commits into from

Conversation

tgravescs
Copy link
Contributor

What changes were proposed in this pull request?

this PR renames the blacklisting feature. I ended up using "excludeOnFailure" or "excluded" in most cases but there is a mix. I renamed the BlacklistTracker to HealthTracker, but for the TaskSetBlacklist HealthTracker didn't make sense to me since its not the health of the taskset itself but rather tracking the things its excluded on so I renamed it to be TaskSetExcludeList. Everything else I tried to use the context and in most cases excluded made sense. It made more sense to me then blocked since you are basically excluding those executors and nodes from scheduling tasks on them. Then can be unexcluded later after timeouts and such. The configs I changed the name to use excludeOnFailure which I thought explained it.

I unfortunately couldn't get rid of some of them because its part of the event listener and history files. To keep backwards compatibility I kept the events and some of the parsing so that the history server would still properly read older history files. It is not forward compatible though - meaning a new application write the "Excluded" events so the older history server won't properly read display them as being blacklisted.

A few of the files below are showing up as deleted and recreated even though I did a git mv on them. I'm not sure why.

Why are the changes needed?

get rid of problematic language

Does this PR introduce any user-facing change?

Config name changes but the old configs still work but are deprecated.

How was this patch tested?

updated tests and also manually tested the UI changes and manually tested the history server reading older versions of history files and vice versa.

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Test build #129254 has finished for PR 29906 at commit 6657785.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting up this PR @tgravescs ! I am sure it was not a fun effort :)

I only took a quick look for now and will happily help go more in depth when I have a bit more time. Besides some inline comments I left, some overall comments:

  • I think the naming conventions are great.
  • I see a few references to blocked/blocklist hanging around in the new code, was this intentional? I think they should be excluded/excludelist based on the proposed terminology.
  • It looks like we've also renamed Spark's references to the YARN-level blacklisting feature (e.g. YARN-4576) in YarnAllocator and friends. I'm not sure this is appropriate. YARN still calls it blacklisting and it may be confusing for us to refer to it by another name. That feature also behaves differently from Spark's so I'm not sure if excludeOnFailure is the right name regardless. Was this an intentional change or did it just accidentally get swept up?

@tgravescs
Copy link
Contributor Author

thanks I started out with blocklist and must have missed converting a few, I'll fix those up. The yarn allocator side I obviously left the actual calls into the client but tried to rename the variables and such to something else. There is a hadoop jira to remove from there as well but I have no idea when it will be implemented so I was just trying to remove where it made sense.

@tgravescs
Copy link
Contributor Author

thanks @xkrogen I made another pass through and think I updated all the reference to blocklist. Another set of eyes would be great as it is a lot of code.

cc @holdenk @srowen

@SparkQA
Copy link

SparkQA commented Sep 30, 2020

Test build #129279 has finished for PR 29906 at commit 76714e7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

looks like some new files added, I'll upmerge to latest

@mridulm
Copy link
Contributor

mridulm commented Oct 20, 2020

SparkFirehoseListener has been public right ? We just missed annotating it appropriately ?
Given this, I am comfortable fixing this oversight.

Thoughts @srowen, @dongjoon-hyun ?

appStatusSource.foreach(_.BLACKLISTED_EXECUTORS.inc())
appStatusSource.foreach(_.EXCLUDED_EXECUTORS.inc())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder this could lead to the metrics overcounted since we always post two blacklist events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this one, I missed it. I had checked this and most are fine due to using set and just setting status which would already be set, but I somehow missed this one. I'll fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this but I actually found a pre-existing bug where we weren't incrementing this when we excluded a node - which implicitly excludes the executors

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34810/

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34810/

@SparkQA
Copy link

SparkQA commented Oct 23, 2020

Test build #130210 has finished for PR 29906 at commit 32ab73d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34820/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34820/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Test build #130220 has finished for PR 29906 at commit 32ab73d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Oct 24, 2020

Thanks for working on this Tom ! Looks good to me.

@tgravescs
Copy link
Contributor Author

@Ngone51 any other comments?

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…e found

### What changes were proposed in this pull request?

This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found.

Currently, we're running and skipping the tests dynamically. For example,
- if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases.
- if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files.

When test reporting ("Report test results") job is triggered after the main build ("Build and test
") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example.

This PR works around it by simply skipping the testing report when there are no JUnit XML files are found.
Please see apache#29906 (comment) for more details.

### Why are the changes needed?

To avoid false alarm for test results.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in my fork.

Positive case:

https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/288996327

Negative case:

https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true
https://github.com/HyukjinKwon/spark/actions/runs/289000058

Closes apache#29946 from HyukjinKwon/test-junit-files.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit a0aa8f3)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for some minor comments. Thank you for the great work!

"Please use spark.excludeOnFailure.stage.maxFailedExecutorsPerNode"),
DeprecatedConfig("spark.blacklist.timeout", "3.1.0",
"Please use spark.excludeOnFailure.timeout"),
DeprecatedConfig("spark.scheduler.executorTaskBlacklistTime", "3.1.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is duplicated? (see L605)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -25,7 +25,7 @@ import scala.collection.Map
import com.fasterxml.jackson.annotation.JsonTypeInfo

import org.apache.spark.TaskEndReason
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.annotation.{DeveloperApi, Since}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused Since?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34974/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34974/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130370 has finished for PR 29906 at commit b38dd66.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35076/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35076/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130472 has finished for PR 29906 at commit b38dd66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

based on the last lgtm with just nits and I fixed those, I'm going to go ahead and merge this so things don't get out of date, if there was anything else let me know and I'll followup. thanks @mridulm and @Ngone51

@asfgit asfgit closed this in 72ad9dc Oct 30, 2020
@dongjoon-hyun
Copy link
Member

Thank you, @tgravescs and all!

@HyukjinKwon
Copy link
Member

Thanks @tgravescs for dealing with this.

@xkrogen
Copy link
Contributor

xkrogen commented Nov 2, 2020

Huge thanks for pushing this through @tgravescs ! It was no small effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants