Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement-11857][Spark] Remove the spark version of spark task #11860

Merged
merged 2 commits into from
Sep 21, 2022

Conversation

rickchengx
Copy link
Contributor

@rickchengx rickchengx commented Sep 8, 2022

Purpose of the pull request

close: #11857

Currently, the spark version is misleading.
截屏2022-09-08 15 24 34

This is not the spark version that DS currently supports. E.g., DS can also run the spark task of SPARK-3.x.x.

The spark version selected by the user only determines the environment variables used and the final command executed as below:

image

And also there is a bug that spark-sql can only be executed by SPARK2, see #11721

So why not just remove the spark version, like other task
And use {SPARK_HOME}/bin/spark_submit and {SPARK_HOME}/bin/spark-sql

Brief change log

use {SPARK_HOME}/bin/spark_submit and {SPARK_HOME}/bin/spark-sql

Verify this pull request

manually tested

image

截屏2022-09-08 16 51 41

@SbloodyS SbloodyS added the improvement make more easy to user or prompt friendly label Sep 8, 2022
@SbloodyS SbloodyS added this to the 3.1.0 milestone Sep 8, 2022
@rickchengx
Copy link
Contributor Author

Related Doc will be modified after this PR is reviewed

@SbloodyS
Copy link
Member

SbloodyS commented Sep 8, 2022

Related Doc will be modified after this PR is reviewed

Please add the docs in this PR. We suggest that the documents involved in the modification should be submitted together.

@SbloodyS SbloodyS added the miss:docs missing documents in PR label Sep 8, 2022
@rickchengx
Copy link
Contributor Author

Related Doc will be modified after this PR is reviewed

Please add the docs in this PR. We suggest that the documents involved in the modification should be submitted together.

Sure, but there are many documents that need to be modified, maybe we can first review whether this PR is appropriate

@SbloodyS
Copy link
Member

SbloodyS commented Sep 8, 2022

Related Doc will be modified after this PR is reviewed

Please add the docs in this PR. We suggest that the documents involved in the modification should be submitted together.

Sure, but there are many documents that need to be modified, maybe we can first review whether this PR is appropriate

That sound good to me.

@github-actions github-actions bot added backend CI&CD UI ui and front end related labels Sep 8, 2022
@fuchanghai
Copy link
Member

If there is only one version of spark on a machine, there is no problem keeping only one. Problems can arise if a machine has two versions. Of course, I don't know about spark, I just analyze this problem from a business perspective

@fuchanghai
Copy link
Member

@SbloodyS may be this PR is conflict with #11720 ,According to the actual application scenario, it seems that the current PR should be supported. if ds want to support two different versions ,i support another

@rickchengx
Copy link
Contributor Author

rickchengx commented Sep 9, 2022

Hi, @fuchanghai, Thanks for the reply. I agree with your point.

  1. I think that the current spark version on ui is kind of misleading. Users may be concerned that DS does not support spark 3.x.x. I have seen people asking whether DS supports spark3 , even if there is a description for this problem in the DS documentation (but users will not find this description easily) as below:

截屏2022-09-09 09 55 13

  1. Even if DS needs to support multiple sparks, I don't think setting SPARK_HOME1 and SPARK_HOME2 is an elegant way. What if the user needs to support 3 or more sparks?

  2. I think that using SPARK_HOME is an easy and efficient way without misunderstandings and some potential problems.

@SbloodyS
Copy link
Member

SbloodyS commented Sep 9, 2022

@SbloodyS may be this PR is conflict with #11720 ,According to the actual application scenario, it seems that the current PR should be supported. if ds want to support two different versions ,i support another

I think it's better to use current environment management to manage different version of spark. This is the meaning of environmental management.

@rickchengx
Copy link
Contributor Author

@SbloodyS may be this PR is conflict with #11720 ,According to the actual application scenario, it seems that the current PR should be supported. if ds want to support two different versions ,i support another

I think it's better to use current environment management to manage different version of spark. This is the meaning of environmental management.

@SbloodyS Thanks for your suggestion. I totally agree with you.

So DS uses SPARK_HOME/bin/spark-submit to build the command, and user can set the SPARK_HOME he wants in the environment.

截屏2022-09-09 10 13 33

@SbloodyS
Copy link
Member

SbloodyS commented Sep 9, 2022

@SbloodyS may be this PR is conflict with #11720 ,According to the actual application scenario, it seems that the current PR should be supported. if ds want to support two different versions ,i support another

I think it's better to use current environment management to manage different version of spark. This is the meaning of environmental management.

@SbloodyS Thanks for your suggestion. I totally agree with you.

So DS uses SPARK_HOME/bin/spark-submit to build the command, and user can set the SPARK_HOME he wants in the environment.

截屏2022-09-09 10 13 33

Yes. That's what I mean.

@rickchengx
Copy link
Contributor Author

rickchengx commented Sep 13, 2022

Yes. That's what I mean.

Hi, @SbloodyS , @fuchanghai

Since #11721 is merged, I've rebased my PR and also modified the related docs.

So now this PR uses SPARK_HOME/bin/spark-submit or SPARK_HOME/bin/spark-sql to build the command, and user can set the SPARK_HOME he wants in the environment as below.

截屏2022-09-13 14 40 34

So the misleading spark version is removed, and different tasks can still use different spark versions through the environment.

@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2022

Codecov Report

Merging #11860 (ad683c3) into dev (ad683c3) will not change coverage.
The diff coverage is n/a.

❗ Current head ad683c3 differs from pull request most recent head 64f3d2f. Consider uploading reports for the commit 64f3d2f to get more accurate results

@@            Coverage Diff            @@
##                dev   #11860   +/-   ##
=========================================
  Coverage     38.68%   38.68%           
  Complexity     4006     4006           
=========================================
  Files          1002     1002           
  Lines         37213    37213           
  Branches       4249     4249           
=========================================
  Hits          14394    14394           
  Misses        21186    21186           
  Partials       1633     1633           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@zhongjiajie zhongjiajie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 20 to 29
| Node Name | Set the name of the task. Node names within a workflow definition are unique. |
| Run flag | Indicates whether the node can be scheduled normally. If it is not necessary to execute, you can turn on the prohibiting execution switch. |
| Description | Describes the function of this node. |
| Task priority | When the number of worker threads is insufficient, they are executed in order from high to low according to the priority, and they are executed according to the first-in, first-out principle when the priority is the same. |
| Worker group | The task is assigned to the machines in the worker group for execution. If Default is selected, a worker machine will be randomly selected for execution. |
| Task group name | The group in Resources, if not configured, it will not be used. |
| Environment Name | Configure the environment in which to run the script. |
| Number of failed retries | The number of times the task is resubmitted after failure. It supports drop-down and manual filling. |
| Failure Retry Interval | The time interval for resubmitting the task if the task fails. It supports drop-down and manual filling. |
| Timeout alarm | Check Timeout Alarm and Timeout Failure. When the task exceeds the "timeout duration", an alarm email will be sent and the task execution will fail. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is seems we already change the general parameter into separate parameter files, can you rebase to latest and take a look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@rickchengx
Copy link
Contributor Author

please rebase the latest code, our docs make some change and can you also add some discription in both https://github.com/apache/dolphinscheduler/blob/4dca488cd50b4392d222167c01ae2a79fd295e77/dolphinscheduler-python/pydolphinscheduler/UPDATING.mdand https://github.com/apache/dolphinscheduler/blob/3aa9f2ea25ca42112141aad85140a72b0963e2c3/docs/docs/en/guide/upgrade/incompatible.md becaus this is incompatible change for users

Hi, @zhongjiajie , thanks for your comment. I've rebased my PR and add some descriptions in UPDATING.md and incompatible.md

@sonarcloud
Copy link

sonarcloud bot commented Sep 19, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

44.8% 44.8% Coverage
0.0% 0.0% Duplication

Copy link
Member

@zhongjiajie zhongjiajie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@zhongjiajie zhongjiajie merged commit 08a4c79 into apache:dev Sep 21, 2022
@zhongjiajie
Copy link
Member

@caishunfeng should we release this PR in version 3.1.0?

@caishunfeng caishunfeng modified the milestones: 3.1.0, 3.2.0 Sep 21, 2022
xdu-chenrj pushed a commit to xdu-chenrj/dolphinscheduler that referenced this pull request Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend CI&CD document improvement make more easy to user or prompt friendly incompatible miss:docs missing documents in PR Python UI ui and front end related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement][Spark] Remove the spark version of spark task
6 participants