Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kyuubi-spark-extensions module #631

Closed
wants to merge 9 commits into from
Closed

Add kyuubi-spark-extensions module #631

wants to merge 9 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented May 12, 2021

Why are the changes needed?

The added sql module structure looks like:

kyuubi
  |
  - dev
      |
      - kyuubi-extension-spark_3.1

This PR mainly add 3 feature:

  • merging small files automatically (include dynamic partition insertion case)
  • insert shuffle node before Join to make AQE OptimizeSkewedJoin work
  • stage level config isolation in AQE

Note that, the sql rule depend on the Apache Spark interface so we need make the sql module verion independence. Currently, this PR only supports the Spark 3.1.1.

Due to the version issue, we need to check and deploy this extension manually currently.

How was this patch tested?

Add new test.

@ulysses-you
Copy link
Contributor Author

cc @yaooqinn @pan3793 @turboFei

pom.xml Outdated Show resolved Hide resolved
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>kyuubi-sql-spark_3.1</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the folder named ...-3_1 but artifactId is ..._3.1, it's by design?

@codecov-commenter
Copy link

codecov-commenter commented May 12, 2021

Codecov Report

Merging #631 (7740ca6) into master (d2be0f3) will not change coverage.
The diff coverage is n/a.

❗ Current head 7740ca6 differs from pull request most recent head 2cf12f1. Consider uploading reports for the commit 2cf12f1 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master     #631   +/-   ##
=======================================
  Coverage   79.91%   79.91%           
=======================================
  Files         119      119           
  Lines        4636     4636           
  Branches      560      560           
=======================================
  Hits         3705     3705           
  Misses        620      620           
  Partials      311      311           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2be0f3...2cf12f1. Read the comment docs.

@yaooqinn
Copy link
Member

cc @cloud-fan (if interested)

@yaooqinn yaooqinn changed the title Add kyuubi sql module Add kyuubi-spark-extensions module May 12, 2021
@yaooqinn
Copy link
Member

yaooqinn commented May 12, 2021

Can we add some real hammers in doc for showing performance improvement with these extensions?

</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>kyuubi-extension-spark_3.1</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<name>Kyuubi Project Dev Spark Extensions</name>
<packaging>jar</packaging>


<properties>
<spark.version>3.1.1</spark.version>
<scala.binary.version>2.12</scala.binary.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unecessary

val INSERT_REPARTITION_BEFORE_WRITE =
buildConf("spark.sql.optimizer.insertRepartitionBeforeWrite.enabled")
.doc("Add repartition node at the top of plan. A approach of merging small files.")
.version("0.0.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 1.2.0

Comment on lines 106 to 107
<outputDirectory>target/scala-2.12/classes</outputDirectory>
<testOutputDirectory>target/scala-2.12/test-classes</testOutputDirectory>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${scala.binary.version}

<scope>test</scope>
</dependency>

<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary


<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaooqinn yaooqinn closed this in 43f40dc May 13, 2021
yaooqinn pushed a commit that referenced this pull request May 13, 2021
<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
The added sql module structure looks like:
```
kyuubi
  |
  - dev
      |
      - kyuubi-extension-spark_3.1
```

This PR mainly add 3 feature:
* merging small files automatically (include dynamic partition insertion case)
* insert shuffle node before Join to make AQE `OptimizeSkewedJoin` work
* stage level config isolation in AQE

Note that,  the sql rule  depend on the Apache Spark interface so we need make the sql module verion independence. Currently, this PR only supports the Spark 3.1.1.

Due to the version issue, we need to check and deploy this extension manually currently.

### _How was this patch tested?_
Add new test.

Closes #631 from ulysses-you/add-sql-module.

Closes #631

2cf12f1 [ulysses-you] version
cfbf72c [ulysses-you] address comment
7740ca6 [ulysses-you] module name
0f723eb [ulysses-you] workflow
45c23d8 [ulysses-you] line
80378f5 [ulysses-you] assembly
95528aa [ulysses-you] move module
5fe5d87 [ulysses-you] license
6578440 [ulysses-you] init work

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 43f40dc)
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member

thanks, merged to master for v1.3.0 and branch 1.2 for v1.2.0

@pan3793
Copy link
Member

pan3793 commented May 13, 2021

module version of master is not correct

@yaooqinn
Copy link
Member

nice catch can you send a followup?

@pan3793
Copy link
Member

pan3793 commented May 13, 2021

sure

yaooqinn pushed a commit that referenced this pull request May 13, 2021
…module version

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request

Closes #633 from pan3793/minor.

Closes #633

108aca0 [Cheng Pan] [BUILD] Fix kyuubi-extension-spark_3.1 module version

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
@ulysses-you ulysses-you added the kind:feature Feature request label May 14, 2021
@ulysses-you ulysses-you added this to the v1.2.0 milestone May 14, 2021
@ulysses-you ulysses-you self-assigned this May 14, 2021
yaooqinn pushed a commit that referenced this pull request May 25, 2021
…-spark_3.1

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Fix `-Pkyuubi-sql-spark_3.1` introduced by KYUUBI #631 in GH action

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request

Closes #645 from pan3793/build.

Closes #645

d20c6d8 [Cheng Pan] suppress CI log for kyuubi-extension-spark_3.1
1499d5b [Cheng Pan] [BUILD] Fix CI profile kyuubi-extension-spark_3.1

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request May 25, 2021
…-spark_3.1

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Fix `-Pkyuubi-sql-spark_3.1` introduced by KYUUBI #631 in GH action

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request

Closes #645 from pan3793/build.

Closes #645

d20c6d8 [Cheng Pan] suppress CI log for kyuubi-extension-spark_3.1
1499d5b [Cheng Pan] [BUILD] Fix CI profile kyuubi-extension-spark_3.1

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 10d72e0)
Signed-off-by: Kent Yao <yao@apache.org>
@yangrong688
Copy link
Contributor

Hi @ulysses-you, I wonder is this feature product ready? I cannot find any documentation about how to use it in Kyuubi website.

@ulysses-you ulysses-you deleted the add-sql-module branch June 17, 2021 02:10
@ulysses-you
Copy link
Contributor Author

@yangrong688 yes, it's ready.

But now this only available with Spark branch-3.1 (i.e 3.1.1 and 3.1.2). Since Kyuubi 1.2.0 has not released yet, you can download the last rc version in https://github.com/NetEase/kyuubi/releases.

About the document, yeah it's better to improve the documents about this feature. You can try this

  1. move the jar which is in extension into $SPARK_HOME/jars
  2. add a config into spark-defaults.conf: spark.sql.extensions=org.apache.kyuubi.sql.KyuubiSparkSQLExtension

@yangrong688
Copy link
Contributor

@yangrong688 yes, it's ready.

But now this only available with Spark branch-3.1 (i.e 3.1.1 and 3.1.2). Since Kyuubi 1.2.0 has not released yet, you can download the last rc version in https://github.com/NetEase/kyuubi/releases.

About the document, yeah it's better to improve the documents about this feature. You can try this

1. move the jar which is in `extension` into `$SPARK_HOME/jars`

2. add a config into `spark-defaults.conf`: `spark.sql.extensions=org.apache.kyuubi.sql.KyuubiSparkSQLExtension`

OK, thanks. I will have a try.

@yaooqinn
Copy link
Member

please help with the doc @ulysses-you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants