[SPARK-22180][CORE] Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort #19408

obermeier · 2017-10-01T14:55:34Z

External applications like Apache Cassandra are able to deal with IPv6 addresses. Libraries like spark-cassandra-connector combine Apache Cassandra with Apache Spark.
This combination is very useful IMHO.

One problem is that org.apache.spark.util.Utils.parseHostPort(hostPort: String) takes the last colon to sepperate the port from host path. This conflicts with literal IPv6 addresses.

I think we can take hostPort as literal IPv6 address if it contains tow ore more colons. If IPv6 addresses are enclosed in square brackets port definition is still possible.

srowen · 2017-10-01T19:31:13Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Nit: ore
You might note here that you're checking that you don't have a [::1]:123 IPv6 address here -- the braces are key.

srowen · 2017-10-01T19:32:29Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Turn the rule back on after the block?

srowen · 2017-10-01T19:32:53Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Hex digits can be uppercase right?
Should the pattern it not be more like [0-9a-f]*(:[0-9a-f]*)+ match a number, then colon-number colon-number pairs, not number-colon-number number-colon-number sequences?
It might end up being equivalent because the match is for 0 or more digits.

This allows some strings that it shouldn't like "::::", but, the purpose isn't to catch every possible case I guess. It would fail name resolution.

I thought Inet6Address would just provide parsing for this but I guess not.

Yes a real parser would be much better!!
I hope the methods will check the input. Like the name resolver...

At this point I thought more about the separation of the port.
I think it is important to check if two colons exists, otherwise this expression accepts hostnames like abc:123

srowen · 2017-10-01T19:33:10Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Final nit, use braces for both parts of the if-else

jiangxb1987 · 2017-11-06T11:09:34Z

@obermeier Could you please rebase this with the latest master? Thanks!

obermeier · 2017-11-12T22:39:27Z

Done

jiangxb1987 · 2017-11-13T15:02:30Z

Are you planning to fully address the IPv6 issues? If not, why do we choose to adapt this single function separately?

obermeier · 2017-11-13T21:43:41Z

I chose this function because I had some exceptions like this [1] if I used IPv6 hosts.
In this example org.apache.spark.util.Utils$.parseHostPort decided to use f904 as port but it was the last 16 bit chunk of an IPv6 address.

I do not have an overview over the Spark code so I am currently not able to provide a general IPv6 solution.

I think this code snipets will improve the situation and enables us to use IPv6 hosts in some cases.

[1]

java.lang.NumberFormatException: For input string: "f904"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
	at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:935)
	at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:206)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:187)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:187)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:166)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at org.apache.spark.scheduler.TaskSetManager.<init>(TaskSetManager.scala:165)
	at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:205)
	at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:169)
	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1058)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:873)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1626)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

jiangxb1987 · 2017-11-14T05:58:34Z

Sounds good.

jiangxb1987

The change LGTM, cc @cloud-fan

jiangxb1987 · 2017-11-14T05:59:12Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

nit: we should add more test cases to cover the invalid cases.

What is the preferred way to handle this kind of parse errors in Spark?
Changing the signature of this method to something like :Try[..], :Option ... is no option!?
Error log messages?
Unchecked Exceptions?
...

cloud-fan · 2017-11-14T10:08:17Z

core/src/main/scala/org/apache/spark/util/Utils.scala

does this comment still valid?

I think not

I removed this comment

….parseHostPort ## What changes were proposed in this pull request? Take ```hostPort``` as literal IPv6 address if it contains tow ore more colons. If IPv6 addresses are enclosed in square brackets port definition is still possible. ## How was this patch tested? Added a new test case into UtilsSuite Remove comment

vanzin · 2017-12-15T23:20:33Z

ok to test

SparkQA · 2017-12-15T23:24:34Z

Test build #84982 has finished for PR 19408 at commit 1400299.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Remove whitespace at end of line.

SparkQA · 2017-12-18T17:23:17Z

Test build #85062 has finished for PR 19408 at commit 68c3221.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2018-01-26T19:36:49Z

To rephrase @jiangxb1987's question - supporting IPv6 is a much larger effort, which spark currently does not. We should be addressing that as the problem to solve, and fix this as part of IPv6 support : eliminating individual exceptions could simply result in spark platform going into inconsistent states, without any telemetry in the logs on why (because we removed/'fixed' the expected exceptions).

jiangxb1987 · 2018-01-26T20:33:05Z

Fully agree with @mridulm 's big picture suggestion, and I also think supporting IPv6 should be designed as a integral feature, instead of just putting together some PRs.

obermeier · 2018-10-09T21:45:05Z

I total agree with you.
What do you think about just adding a log message if the given string is obviously not a valid host name.
Because the given NumberFormatException much later after the parsing component was a little bit confusing.

java.lang.NumberFormatException: For input string: "f904"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
	at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:935)
	at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:206)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:187)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:187)
	at org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:166)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at org.apache.spark.scheduler.TaskSetManager.<init>(TaskSetManager.scala:165)
	at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:205)
	at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:169)
	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1058)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:873)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1626)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

obermeier · 2018-10-13T18:33:05Z

This issue seems to be fixed in Spark 2.3.2

obermeier · 2018-10-17T08:44:03Z

If Spark runs in YARN Cluster this issue still exists

AmplabJenkins · 2019-09-16T18:24:40Z

Can one of the admins verify this patch?

github-actions · 2020-01-15T00:06:32Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

obermeier changed the title ~~[SPARK-22180][CORE] Allow IPv6~~ [SPARK-22180][CORE] Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort Oct 1, 2017

srowen requested changes Oct 1, 2017

View reviewed changes

obermeier force-pushed the issue/SPARK-22180 branch 2 times, most recently from cf1d920 to 8026b0f Compare November 6, 2017 23:05

obermeier force-pushed the issue/SPARK-22180 branch 2 times, most recently from 418927f to a6894d5 Compare November 13, 2017 00:58

jiangxb1987 reviewed Nov 14, 2017

View reviewed changes

cloud-fan reviewed Nov 14, 2017

View reviewed changes

obermeier force-pushed the issue/SPARK-22180 branch from a6894d5 to 453e104 Compare November 14, 2017 17:33

obermeier and others added 2 commits December 14, 2017 23:52

Merge branch 'master' into issue/SPARK-22180

1e6623f

Fix build problem

1400299

Fix style checks violation.

68c3221

Remove whitespace at end of line.

Add log message if hostname:port is not valid

8220d95

obermeier closed this Oct 13, 2018

obermeier reopened this Oct 17, 2018

dongjoon-hyun added the SPARK CORE label Jun 14, 2019

github-actions bot added the Stale label Jan 15, 2020

github-actions bot closed this Jan 16, 2020

obermeier deleted the issue/SPARK-22180 branch January 24, 2020 10:10

[SPARK-22180][CORE] Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort #19408

[SPARK-22180][CORE] Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort #19408

Uh oh!

Conversation

obermeier commented Oct 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Nov 6, 2017

Uh oh!

obermeier commented Nov 12, 2017

Uh oh!

jiangxb1987 commented Nov 13, 2017

Uh oh!

obermeier commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented Nov 14, 2017

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

obermeier Nov 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

obermeier Nov 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Dec 15, 2017

Uh oh!

SparkQA commented Dec 15, 2017

Uh oh!

SparkQA commented Dec 18, 2017

Uh oh!

mridulm commented Jan 26, 2018

Uh oh!

jiangxb1987 commented Jan 26, 2018

Uh oh!

obermeier commented Oct 9, 2018

Uh oh!

obermeier commented Oct 13, 2018

Uh oh!

obermeier commented Oct 17, 2018

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

obermeier commented Nov 13, 2017 •

edited

Loading

obermeier Nov 14, 2017 •

edited

Loading

obermeier Nov 14, 2017 •

edited

Loading