Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33084][CORE][SQL] Add jar support ivy path #29966

Closed
wants to merge 57 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Oct 7, 2020

What changes were proposed in this pull request?

Support add jar with ivy path

Why are the changes needed?

Since submit app can support ivy, add jar we can also support ivy now.

Does this PR introduce any user-facing change?

User can add jar with sql like

add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false

core api

sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true")
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false")

Doc Update snapshot

image

How was this patch tested?

Added UT

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-33084][CORE][SQL]Add jar support ivy path [WIP][SPARK-33084][CORE][SQL]Add jar support ivy path Oct 7, 2020
@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34113/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34113/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129507 has finished for PR 29966 at commit 51daf9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34124/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34124/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129520 has finished for PR 29966 at commit d6e8caf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34157/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34157/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Test build #129551 has finished for PR 29966 at commit 169e1f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-33084][CORE][SQL]Add jar support ivy path [SPARK-33084][CORE][SQL]Add jar support ivy path Oct 9, 2020
@AngersZhuuuu
Copy link
Contributor Author

cc @dongjoon-hyun As I have mentioned in https://issues.apache.org/jira/browse/SPARK-29288, make this pr support ivy path like https://issues.apache.org/jira/browse/HIVE-9664

@dongjoon-hyun
Copy link
Member

Got it. Thank you, @AngersZhuuuu

@github-actions github-actions bot added the CORE label Nov 23, 2020
@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131522 has finished for PR 29966 at commit b3e3211.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36125/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36127/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36125/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36127/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131524 has finished for PR 29966 at commit 9161340.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37926/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37926/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Test build #133335 has finished for PR 29966 at commit 4c44dae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 3.2.0. Thank you, @AngersZhuuuu and all.

Feel free to merge, @maropu .

@dongjoon-hyun
Copy link
Member

BTW, Merry Christmas and Happy New Year, all!

@maropu maropu closed this in 10b6466 Dec 25, 2020
@maropu
Copy link
Member

maropu commented Dec 25, 2020

Thanks, @AngersZhuuuu @dongjoon-hyun ! Merged to master. Happy Merry Christmas, too!

@maropu
Copy link
Member

maropu commented Dec 25, 2020

FYI: @gatorsmile @cloud-fan

@AngersZhuuuu
Copy link
Contributor Author

Merry Christmas! Thanks all for your patient reviews.

HyukjinKwon pushed a commit that referenced this pull request Dec 28, 2020
…() 's return parameter

### What changes were proposed in this pull request?
Per discuss in  #29966 (comment)
We'd better change `SparkSubmitUtils.resolveMavenCoordinates()` 's return value as `Seq[String]`

### Why are the changes needed?
refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30922 from AngersZhuuuu/SPARK-33908.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@koertkuipers
Copy link
Contributor

please note this breaks scala 2.13
where can i find source code for SPARK-33084.jar?

[info]   Cause: java.lang.NoSuchMethodError: org.apache.spark.sql.types.StructType$.apply(Lscala/collection/Seq;)Lorg/apache/spark/sql/types/StructType;
[info]   at org.apache.spark.examples.sql.Spark33084.inputSchema(Spark33084.scala:9)
[info]   at org.apache.spark.sql.execution.aggregate.ScalaUDAF.<init>(udaf.scala:347)

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Dec 31, 2020

please note this breaks scala 2.13
where can i find source code for SPARK-33084.jar?

[info]   Cause: java.lang.NoSuchMethodError: org.apache.spark.sql.types.StructType$.apply(Lscala/collection/Seq;)Lorg/apache/spark/sql/types/StructType;
[info]   at org.apache.spark.examples.sql.Spark33084.inputSchema(Spark33084.scala:9)
[info]   at org.apache.spark.sql.execution.aggregate.ScalaUDAF.<init>(udaf.scala:347)

Defined it by my self. I will fix this issue. Can you show how to reproduce this error, source code is
#29966 (comment)

@AngersZhuuuu
Copy link
Contributor Author

please note this breaks scala 2.13
where can i find source code for SPARK-33084.jar?

[info]   Cause: java.lang.NoSuchMethodError: org.apache.spark.sql.types.StructType$.apply(Lscala/collection/Seq;)Lorg/apache/spark/sql/types/StructType;
[info]   at org.apache.spark.examples.sql.Spark33084.inputSchema(Spark33084.scala:9)
[info]   at org.apache.spark.sql.execution.aggregate.ScalaUDAF.<init>(udaf.scala:347)

Can you show how you reproduce this error?

@koertkuipers
Copy link
Contributor

$ dev/change-scala-version.sh 2.13
$ SBT_MAVEN_PROFILES="-Pscala-2.13" sbt
sbt:spark-parent> project sql
sbt:spark-sql> testOnly org.apache.spark.sql.SQLQuerySuite

dongjoon-hyun pushed a commit that referenced this pull request Dec 31, 2020
### What changes were proposed in this pull request?
Fix UT according to  #29966 (comment)

Change StructType construct from
```
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
```
to
```
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)
```
The whole udf class is :

```
package org.apache.spark.examples.sql

import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

class Spark33084 extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)

  // Data types of values in the aggregation buffer
  def bufferSchema: StructType =
    new StructType().add("sum", LongType).add("count", LongType)
  // The data type of the returned value
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
```

### Why are the changes needed?
Fix UT for scala 2.13

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30980 from AngersZhuuuu/spark-33084-followup.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

import org.apache.spark.SparkFunSuite

class DependencyUtilsSuite extends SparkFunSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename file name to DependencyUtilsSuite.scala?


val e3 = intercept[IllegalArgumentException] {
DependencyUtils.resolveMavenDependencies(
URI.create("ivy://org.apache.hive:hive-contrib:2.3.7?foo="))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep 2.3.7 consistent with the built-in Hive version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep 2.3.7 consistent with the built-in Hive version?

Emmm, is there any concern about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep 2.3.7 consistent with the built-in Hive version?

How about #31118

HyukjinKwon pushed a commit that referenced this pull request Jan 11, 2021
### What changes were proposed in this pull request?
According to #29966 (comment)
Use wrong name about suite file, this pr to fix this problem.
And change to use some fake ivy link for this test

### Why are the changes needed?
Follow file name rule

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #31118 from AngersZhuuuu/SPARK-33084-FOLLOW-UP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Jan 11, 2021
### What changes were proposed in this pull request?
According to apache/spark#29966 (comment)
Use wrong name about suite file, this pr to fix this problem.
And change to use some fake ivy link for this test

### Why are the changes needed?
Follow file name rule

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #31118 from AngersZhuuuu/SPARK-33084-FOLLOW-UP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
maropu pushed a commit that referenced this pull request Mar 1, 2021
… with Hive transitive behavior

### What changes were proposed in this pull request?
SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't

1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L169) in the coordinate and [false for invalid values](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L124). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes).

2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](#29966 (comment)) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L122).

I propose that we be compatible with Hive for these behaviors

### Why are the changes needed?
To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior

### Does this PR introduce _any_ user-facing change?

The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet
1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does.
2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively.

### How was this patch tested?

Modified existing unit tests to test new behavior
Add new unit test to cover usage of `exclude` with unspecified `transitive`

Closes #31623 from shardulm94/spark-34506.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Mar 3, 2021
… SQLQuerySuite to avoid clearing ivy.home

### What changes were proposed in this pull request?
Add the `ResetSystemProperties` trait to `SQLQuerySuite` so that system property changes made by any of the tests will not affect other suites/tests. Specifically, the system property changes made by `SPARK-33084: Add jar support Ivy URI in SQL -- jar contains udf class` are targeted here (which sets and then clears `ivy.home`).

### Why are the changes needed?
PR #29966 added a new test case that adjusts the `ivy.home` system property to force Ivy to resolve an artifact from a custom location. At the end of the test, the value is cleared. Clearing the value meant that, if a custom value of `ivy.home` was configured externally, it would not apply for tests run after this test case.

### Does this PR introduce _any_ user-facing change?
No, this is only in tests.

### How was this patch tested?
Existing unit tests continue to pass, whether or not `spark.jars.ivySettings` is configured (which adjusts the behavior of Ivy w.r.t. handling of `ivy.home` and `ivy.default.ivy.user.dir` properties).

Closes #31694 from xkrogen/xkrogen-SPARK-33084-ivyhome-sysprop-followon.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
Support add jar with ivy path

Since submit app can support ivy, add jar we can also support ivy now.

User can add jar with sql like
```
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false
```

core api
```
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true")
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false")
```

![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png)

Added UT

Closes apache#29966 from AngersZhuuuu/support-add-jar-ivy.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
… with Hive transitive behavior

### What changes were proposed in this pull request?
SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR apache#29966 claims to mimic Hive behavior although I found a few cases where it doesn't

1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L169) in the coordinate and [false for invalid values](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L124). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes).

2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](apache#29966 (comment)) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L122).

I propose that we be compatible with Hive for these behaviors

### Why are the changes needed?
To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior

### Does this PR introduce _any_ user-facing change?

The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet
1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does.
2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively.

### How was this patch tested?

Modified existing unit tests to test new behavior
Add new unit test to cover usage of `exclude` with unspecified `transitive`

Closes apache#31623 from shardulm94/spark-34506.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants