Support HDFS rack locality #350
Support HDFS rack locality #350
Conversation
FYI, the unit tests failure seem genuine. I'm looking at it. |
Thanks @kimoonkim! This looks awesome. Looking into this in detail shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @kimoonkim !
isConfigured = checkConfigured(hadoopConfiguration) | ||
// RackResolver logs an INFO message whenever it resolves a rack, which is way too often. | ||
if (Logger.getLogger(classOf[RackResolver]).getLevel == null) { | ||
Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yes I've seen this too -- I think Hadoop 2.8.0 lowered this log level. What version of are you testing against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using 2.7. Good to know 2.8 fixed this (YARN-3350)
private[spark] class KubernetesTaskSchedulerImpl(sc: SparkContext) extends TaskSchedulerImpl(sc) { | ||
private[spark] class KubernetesTaskSchedulerImpl( | ||
sc: SparkContext, | ||
rackResolverUtil: RackResolverUtil = new RackResolverUtil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put sc as a parameter into a RackResolverUtil constructor? I'm hoping to be able to get rid of the rackResolverUtil.init
method since that class has a two-step initialization step with it (instantiate class, call .init
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for the suggestion. I wanted to do this earlier but didn't try hard enough :-)
|
||
private def getRackForDatanodeOrExecutor(hostPort: String): Option[String] = { | ||
val host = Utils.parseHostPort(hostPort)._1 | ||
val backend = this.backend.asInstanceOf[KubernetesClusterSchedulerBackend] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this cast once in the constructor (outside this method) ? if this is cheap in the JVM JIT then maybe no need to avoid frequent casts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. FYI it appears this has to be done in the separate init method.
val host = Utils.parseHostPort(hostPort)._1 | ||
val backend = this.backend.asInstanceOf[KubernetesClusterSchedulerBackend] | ||
val executorPod = backend.getExecutorPodByIP(host) | ||
if (executorPod.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if/else could be turned into a match
which might be cleaner in scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
override def createTaskSetManager(taskSet: TaskSet, maxTaskFailures: Int): TaskSetManager = { | ||
new KubernetesTaskSetManager(this, taskSet, maxTaskFailures) | ||
} | ||
|
||
override def getRackForHost(hostPort: String): Option[String] = { | ||
if (!rackResolverUtil.isConfigured) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work on keeping this speedy for non-HDFS users
import org.apache.spark.scheduler.{FakeTask, FakeTaskScheduler, HostTaskLocation, TaskLocation} | ||
|
||
class KubernetesTaskSetManagerSuite extends SparkFunSuite { | ||
|
||
val sc = new SparkContext("local", "test") | ||
val sc = new SparkContext(master = "local", appName = "test", | ||
new SparkConf().set("spark.driver.allowMultipleContexts", "true")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change needed? are we parallelizing tests and running multiple drivers in the same JVM at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this project, we have KubernetesTaskSetManagerSuite
and KubernetesTaskSchedulerImplSuite
. The two suites run inside the single JVM. We are still running one suite at a time sequentially. But we are creating two SparkContext
instances. Without this option, the second suite fails because only one SparkContext
instance is allowed. I was hoping to prevent that.
But I think there is a better option, which is just to call SparkContext.clearActiveContext()
. Switched to that.
LGTM! Thanks @kimoonkim, that looks good, and appears to handle both hdfs nodes and executor pods well. Question - for the executor pods, how do you see the script to resolve rack-info working? Could it have access to more than just the Pod IP to find what rack it belongs to? |
@foxish Good question. My understanding is that executors do not call the topology plugin. Only the driver will consult with the topology plugin to decide which executor, hopefully a rack-local one, should receive a new task. When an executor reads a HDFS block for the new task, it will then simply use Hadoop library code that sends a RPC request to the namenode. The namenode will consider the list of datanodes that have copies of the block. And the namenode asks the topology plugin which datanodes are better. When the namenode returns the list of datanodes, it sort them in the locality order. (node local to rack local to remote) FYI, the HDFS on kubernetes we have does not support configuration of the topology plugin yet in the namenode helm chart. But I intend to do it soon. The pod IP question on the namenode side is not as important as we originally thought. Because most k8s network plugins do NAT and the namenode sees the k8s cluster node IPs. (The pod IP issue on the namenode side only manifests in the kubenet on GKE) For details, please see this kubernetes-HDFS/topology/README.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took awhile to review. I have some suggestions.
private def getRackForDatanodeOrExecutor(hostPort: String): Option[String] = { | ||
val host = Utils.parseHostPort(hostPort)._1 | ||
val executorPod = kubernetesSchedulerBackend.getExecutorPodByIP(host) | ||
executorPod.isEmpty match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Option.map.getOrElse
instead of matching on isEmpty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for the suggestion. Found and fixed a minor bug thanks to that!
|
||
class KubernetesTaskSchedulerImplSuite extends SparkFunSuite { | ||
|
||
SparkContext.clearActiveContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done in a before
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be put in a before
block. But then I need to put val sc = new SparkContext(...)
also inside the block because SparkContext.clearActiveContext()
should be called before a new SparkContext
is created. Please let me know what you think.
class KubernetesTaskSchedulerImplSuite extends SparkFunSuite { | ||
|
||
SparkContext.clearActiveContext() | ||
val sc = new SparkContext("local", "test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use a mock SparkContext here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about the possibility and tried a bit. But then I realized SparkContext
is not friendly to mock. It just has too many methods returning too many objects. I'll have to mock many of those that the test subject class happens to interact with. And if any of interaction changes, the test will break.
I think the test maintenance cost becomes too much. I found other Spark core unit tests avoid using mock probably for the same reason.
} | ||
|
||
test("Gets racks for datanodes") { | ||
val rackResolverUtil = mock(classOf[RackResolverUtil]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unusual to me to be mocking the RackResolverUtil
. It seems to be part of the core KubernetesTaskSchedulerImpl
because the RackResolverUtil
is a nested private class.
If we indeed want to be testing these separately then the architecture should reflect as such:
- Extract
RackResolverUtil
to another class, - Put a
trait
on top of theRackResolverUtil
, - Inject an instance into the
KubernetesTaskSchedulerImpl
when we create it, - Write a separate test for
RackResolverUtil
.
If we don't want to test these separately then we should create a real RackResolveUtil
and test the top level methods accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It started as a pure wrapper of RackResolver, then I ended up adding a few business logics. I like the suggestion, but I'll have to think about this a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. New code looks better. Thanks.
.thenReturn(None) | ||
when(rackResolverUtil.resolveRack(sc.hadoopConfiguration, "192.168.1.5")) | ||
.thenReturn(None) | ||
val inetAddressUtil = mock(classOf[InetAddressUtil]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this can't be a real InetAddressUtil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can't be then move InetAddressUtil
to its own file and place a trait
over it. The real InetAddressUtil
could probably be an object
that extends the trait in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to mock it so unit tests don't call real DNS and potentially get influenced by the responses. The trail approach sounds good, I'll probably try in the next patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -26,6 +26,7 @@ import org.apache.spark.scheduler.{FakeTask, FakeTaskScheduler, HostTaskLocation | |||
|
|||
class KubernetesTaskSetManagerSuite extends SparkFunSuite { | |||
|
|||
SparkContext.clearActiveContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - should this be being done in a before
block? Also, can we use a mock SparkContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the answer above.
@mccheah Thanks for the review. Addressed comments. PTAL. |
I was trying to retarget this to branch-2.2-kubernetes using github UI. Then I realized that there will be too many diffs in this PR. Maybe I should retarget to branch-2.2-kubernetes and merge changes from branch-2.2-kubernetes? @foxish Any better suggestion? |
d9de98e
to
2e49e48
Compare
Ok. Retargeted to branch-2.2-kubernetes after rebasing my branch. |
Merged with |
@mccheah Can you please take a look at this PR? Perhaps, this is ready to merge after another look. |
* Support HDFS rack locality * Fix unit tests * Address review comments * Address some review comments * Use traits for InetAddress and RackResolver util classes * Disables expensive DNS lookup by default
Closes #349 and #206.
@ash211 @foxish
Supports HDFS rack locality by implementing
getRackForHost
in KubernetesTaskSchedulerImplAdded unit tests.
Also did manual testing using a dummy topology script that always returns a dummy rack name, "/rack0".
The driver log shows a small number of
RACK_LOCAL
tasks, which used to beANY
tasks. (The majority of tasks are stillNODE_LOCAL
tasks)The job was
HdfsTest
.spark-defaults.conf
specified the dummy topology script:The dummy script
print_rack.sh
I added the script to the driver docker image manually.