-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32037][CORE] Rename blacklisting feature #29906
Conversation
Test build #129254 has finished for PR 29906 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for putting up this PR @tgravescs ! I am sure it was not a fun effort :)
I only took a quick look for now and will happily help go more in depth when I have a bit more time. Besides some inline comments I left, some overall comments:
- I think the naming conventions are great.
- I see a few references to
blocked
/blocklist
hanging around in the new code, was this intentional? I think they should beexcluded
/excludelist
based on the proposed terminology. - It looks like we've also renamed Spark's references to the YARN-level blacklisting feature (e.g. YARN-4576) in
YarnAllocator
and friends. I'm not sure this is appropriate. YARN still calls it blacklisting and it may be confusing for us to refer to it by another name. That feature also behaves differently from Spark's so I'm not sure if excludeOnFailure is the right name regardless. Was this an intentional change or did it just accidentally get swept up?
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/status/AppStatusSource.scala
Outdated
Show resolved
Hide resolved
sql/hive-thriftserver/v1.2/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala
Outdated
Show resolved
Hide resolved
thanks I started out with blocklist and must have missed converting a few, I'll fix those up. The yarn allocator side I obviously left the actual calls into the client but tried to rename the variables and such to something else. There is a hadoop jira to remove from there as well but I have no idea when it will be implemented so I was just trying to remove where it made sense. |
Test build #129279 has finished for PR 29906 at commit
|
looks like some new files added, I'll upmerge to latest |
Thoughts @srowen, @dongjoon-hyun ? |
appStatusSource.foreach(_.BLACKLISTED_EXECUTORS.inc()) | ||
appStatusSource.foreach(_.EXCLUDED_EXECUTORS.inc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder this could lead to the metrics overcounted since we always post two blacklist events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this one, I missed it. I had checked this and most are fine due to using set and just setting status which would already be set, but I somehow missed this one. I'll fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this but I actually found a pre-existing bug where we weren't incrementing this when we excluded a node - which implicitly excludes the executors
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130210 has finished for PR 29906 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130220 has finished for PR 29906 at commit
|
Thanks for working on this Tom ! Looks good to me. |
@Ngone51 any other comments? |
…e found ### What changes were proposed in this pull request? This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found. Currently, we're running and skipping the tests dynamically. For example, - if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases. - if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files. When test reporting ("Report test results") job is triggered after the main build ("Build and test ") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example. This PR works around it by simply skipping the testing report when there are no JUnit XML files are found. Please see apache#29906 (comment) for more details. ### Why are the changes needed? To avoid false alarm for test results. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my fork. Positive case: https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/288996327 Negative case: https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/289000058 Closes apache#29946 from HyukjinKwon/test-junit-files. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a0aa8f3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for some minor comments. Thank you for the great work!
"Please use spark.excludeOnFailure.stage.maxFailedExecutorsPerNode"), | ||
DeprecatedConfig("spark.blacklist.timeout", "3.1.0", | ||
"Please use spark.excludeOnFailure.timeout"), | ||
DeprecatedConfig("spark.scheduler.executorTaskBlacklistTime", "3.1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is duplicated? (see L605)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -25,7 +25,7 @@ import scala.collection.Map | |||
import com.fasterxml.jackson.annotation.JsonTypeInfo | |||
|
|||
import org.apache.spark.TaskEndReason | |||
import org.apache.spark.annotation.DeveloperApi | |||
import org.apache.spark.annotation.{DeveloperApi, Since} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused Since
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #130370 has finished for PR 29906 at commit
|
test this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130472 has finished for PR 29906 at commit
|
Thank you, @tgravescs and all! |
Thanks @tgravescs for dealing with this. |
Huge thanks for pushing this through @tgravescs ! It was no small effort! |
What changes were proposed in this pull request?
this PR renames the blacklisting feature. I ended up using "excludeOnFailure" or "excluded" in most cases but there is a mix. I renamed the BlacklistTracker to HealthTracker, but for the TaskSetBlacklist HealthTracker didn't make sense to me since its not the health of the taskset itself but rather tracking the things its excluded on so I renamed it to be TaskSetExcludeList. Everything else I tried to use the context and in most cases excluded made sense. It made more sense to me then blocked since you are basically excluding those executors and nodes from scheduling tasks on them. Then can be unexcluded later after timeouts and such. The configs I changed the name to use excludeOnFailure which I thought explained it.
I unfortunately couldn't get rid of some of them because its part of the event listener and history files. To keep backwards compatibility I kept the events and some of the parsing so that the history server would still properly read older history files. It is not forward compatible though - meaning a new application write the "Excluded" events so the older history server won't properly read display them as being blacklisted.
A few of the files below are showing up as deleted and recreated even though I did a git mv on them. I'm not sure why.
Why are the changes needed?
get rid of problematic language
Does this PR introduce any user-facing change?
Config name changes but the old configs still work but are deprecated.
How was this patch tested?
updated tests and also manually tested the UI changes and manually tested the history server reading older versions of history files and vice versa.