Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK][EASY] Create PySpark batch jobs tests for RESTful API #5380

Closed
2 of 3 tasks
pan3793 opened this issue Oct 8, 2023 · 8 comments
Closed
2 of 3 tasks

[TASK][EASY] Create PySpark batch jobs tests for RESTful API #5380

pan3793 opened this issue Oct 8, 2023 · 8 comments
Assignees

Comments

@pan3793
Copy link
Member

pan3793 commented Oct 8, 2023

Code of Conduct

Search before creating

  • I have searched in the task list and found no similar tasks.

Mentor

  • I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

  • Knowledge about Spark submit
  • Basic Knowledge of Scala testing and PySpark programming

Background and Goals

Restful API supports creating Batch jobs by posting /batches requests with or without resource file uploading. Currently, it is mainly used for submitting Spark Jars. As PySpark jobs become popular approach for data exploring and processing, we need to create tests for creating PySpark jobs.

Implementation steps

Add tests for creating PySpark jobs via Rest API

Additional context

Original reporter is @bowenliang123

@weixi62961
Copy link
Contributor

@pan3793 Could you please assign it to me ? I suppose to complete this task on 27th October

@bowenliang123
Copy link
Contributor

Thanks. Assigned to @weixi62961 . Looking forward to your PRs.

@weixi62961
Copy link
Contributor

weixi62961 commented Oct 20, 2023

@bowenliang123 cc @pan3793
I want to talk about current progress. If you have any question, please tell me.

Before writing test cases, I first used curl to test some PySpark jobs on Kyuubi 1.7.1. They were mainly tested in the local machine. All these curl test cases were passed.

  1. PySpark job file is hosted in HDFS
# hdfs pi.py
curl -H "Content-Type: application/json" \
-X POST \
-d '{"batchType": "PYSPARK", "resource":"hdfs:/tmp/upload/pyspark_submit_sample/pi.py",  "name": "PySpark PI", "conf": {"spark.master": "local"}, "args": [10]}' \
http://localhost:10099/api/v1/batches
  1. PySpark job file is uploaded through uploading resource
# upload resource: pi.py
curl --location --request POST 'http://localhost:10099/api/v1/batches' \
--form 'batchRequest="{\"batchType\":\"PYSPARK\",\"name\":\"PySpark Pi\",\"args\":[10]}";type=application/json' \
--form 'resourceFile=@"/localpath/to/file/pi.py"'
  1. PySpark job file depends on other modules
# module dependency
curl -H "Content-Type: application/json" \
-X POST \
-d '{"batchType":"PYSPARK","resource":"hdfs:/tmp/upload/pyspark_submit_sample/test_module_dependency.py","name":"PySpark Module Dependency","conf":{"spark.master":"local","spark.submit.pyFiles":"hdfs:/tmp/upload/pyspark_submit_sample/my_module.zip"}}' \
http://localhost:10099/api/v1/batches

Refer to the existing 2 test cases for Spark jars, I'll write two test case for PySpark for "POST without uploading" and "POST with uploading".
Is it OK?

In addition, due to the complexity of PySpark's package management, it is not within the scope of this topic. For more information, refer to Spark official site: Python Package Management

BTW, I have created a small project for test pyspark submit sample

@bowenliang123
Copy link
Contributor

bowenliang123 commented Oct 20, 2023

Thanks for updating the detailed progress. Small PRs for subtasks are welcomed.
Raise the first PR for case 2, which sounds like the easiest case for minimum testing. (Also helps you to utilize the auto CI test runs on github.)
POST with uploading + test py file in resource folder for testing + local standalone for Spark master.

WDYT?

@weixi62961
Copy link
Contributor

WDYT?

OK, Thanks

@bowenliang123
Copy link
Contributor

@weixi62961 May I have contact with you? You can send your private wechat account number to my email address liang.bowen.123@qq.com

@weixi62961
Copy link
Contributor

@weixi62961 May I have contact with you? You can send your private wechat account number to my email address liang.bowen.123@qq.com

Already sent

bowenliang123 added a commit that referenced this issue Oct 25, 2023
### _Why are the changes needed?_

To close #5380.

As PySpark jobs become popular approach for data exploring and processing, we need to create tests for creating PySpark jobs.

According the existing Spark Jar unit tests, two PySpark job unit test were added, they are all simple PI computing jobs from Spark examples.

#### case1, "pyspark submit - basic batch rest client with existing resource file"
It's almost same with the spark jar job test case, except the following two points:
1. param `batchType` should be set to `PYSPARK`, not `SPARK`.  please refer to #3836 for detailed information.
2. For PySpark job,param `className` is useless, should be set to null

#### case2, "pyspark submit - basic batch rest client with uploading resource file"

Through the two test cases, simple PySpark jobs can be submitted normally.

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [x] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

No

Closes #5498 from weixi62961/unittest-batchapi-pyspark-simple.

Closes #5380

b693efc [Bowen Liang] simplify sparkBatchTestResource
72a92b5 [Bowen Liang] Update kyuubi-server/src/test/scala/org/apache/kyuubi/server/rest/client/PySparkBatchRestApiSuite.scala
b2035a3 [weixi] remove no necessary wrapper object "PySparkJobPI"
27d12e8 [weixi] rename from BatchRestApiPySparkSuite to PySparkBatchRestApiSuite
e680e60 [weixi] Create a dedicated batch API suite for PySpark jobs.
dc8b6bf [weixi] add 2 test cases for pyspark batch job submit.

Lead-authored-by: weixi <weixi62961@outlook.com>
Co-authored-by: Bowen Liang <liangbowen@gf.com.cn>
Co-authored-by: Bowen Liang <bowenliang@apache.org>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
(cherry picked from commit 5cff4fb)
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
davidyuan1223 pushed a commit to davidyuan1223/kyuubi that referenced this issue Oct 26, 2023
### _Why are the changes needed?_

To close apache#5380.

As PySpark jobs become popular approach for data exploring and processing, we need to create tests for creating PySpark jobs.

According the existing Spark Jar unit tests, two PySpark job unit test were added, they are all simple PI computing jobs from Spark examples.

#### case1, "pyspark submit - basic batch rest client with existing resource file"
It's almost same with the spark jar job test case, except the following two points:
1. param `batchType` should be set to `PYSPARK`, not `SPARK`.  please refer to apache#3836 for detailed information.
2. For PySpark job,param `className` is useless, should be set to null

#### case2, "pyspark submit - basic batch rest client with uploading resource file"

Through the two test cases, simple PySpark jobs can be submitted normally.

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [x] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

No

Closes apache#5498 from weixi62961/unittest-batchapi-pyspark-simple.

Closes apache#5380

b693efc [Bowen Liang] simplify sparkBatchTestResource
72a92b5 [Bowen Liang] Update kyuubi-server/src/test/scala/org/apache/kyuubi/server/rest/client/PySparkBatchRestApiSuite.scala
b2035a3 [weixi] remove no necessary wrapper object "PySparkJobPI"
27d12e8 [weixi] rename from BatchRestApiPySparkSuite to PySparkBatchRestApiSuite
e680e60 [weixi] Create a dedicated batch API suite for PySpark jobs.
dc8b6bf [weixi] add 2 test cases for pyspark batch job submit.

Lead-authored-by: weixi <weixi62961@outlook.com>
Co-authored-by: Bowen Liang <liangbowen@gf.com.cn>
Co-authored-by: Bowen Liang <bowenliang@apache.org>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
@weixi62961
Copy link
Contributor

weixi62961 commented Oct 26, 2023

Summary

Let‘s summarize the status of PySpark Batch Job. In my opinion, For PySpark jobs, it is recommended to submit jobs by post batches existing resources instead of uploading resources.

Problem

# JarSpark job
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
fat-jar-with-dependencies.jar

# PySpark job
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files dependency.zip \
main.py

Workaround

  • For PySpark jobs, it is recommended to submit jobs by post batches existing resources instead of uploading resources.
  • First upload the resource file and dependency zip packages to a location that is accessible to all nodes, such as HDFS or NFS. Then submit the job through the post-batches API. Here is an example, using HDFS.
  • json parameters in pretty format
{
	"batchType": "PYSPARK",
	"resource": "hdfs:/tmp/upload/main.py",
	"conf": {
		"spark.master": "yarn",
		"spark.submit.deployMode": "cluster",
		"spark.submit.pyFiles": "hdfs:/tmp/upload/dependency.zip"
	}
}
  • curl command ( it works!)
curl -H "Content-Type: application/json" \
-X POST \
-d '{"batchType":"PYSPARK","resource":"hdfs:/tmp/upload/main.py","conf":{"spark.master":"yarn","spark.submit.deployMode":"cluster","spark.submit.pyFiles":"hdfs:/tmp/upload/dependency.zip"}}' \
http://localhost:10099/api/v1/batches

WDYT? @bowenliang123 cc @pan3793

For more PySpark job submit cases, please refer to my small project for pyspark submit sample

pan3793 added a commit that referenced this issue Nov 8, 2023
### _Why are the changes needed?_

Current now, in spark-engine module, some session-level configurations are ignored due to the complexity of get session-level configurations in kyuubi spark engine, so As discussed in #5410 (comment). If we need unit test use withSessionConf method, we need make the code get configuration from the right session

The PR is unfinished, it need wait the pr #5410 success so that i can use the new change in unit test

closes #5438
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

No

Closes #5487 from davidyuan1223/5438_add_common_method_to_support_session_config.

Closes #5438

e1ded36 [davidyuan] add more optional session level to get conf
84c4568 [davidyuan] add more optional session level to get conf
4d70902 [davidyuan] add more optional session level to get conf
96d7cde [davidyuan] Revert "add more optional session level to get conf"
940f8f8 [davidyuan] add more optional session level to get conf
15641e8 [davidyuan] add more optional session level to get conf
d838931 [davidyuan] Merge branch '5438_add_common_method_to_support_session_config' of https://github.com/davidyuan1223/kyuubi into 5438_add_common_method_to_support_session_config
2de96b5 [davidyuan] add common method to get session level config
3ec73ad [liangbowen] [KYUUBI #5522] [BATCH] Ignore main class for PySpark batch job submission
d8b808d [Cheng Pan] [KYUUBI #5523] [DOC] Update the Kyuubi supported components version
c7d15ae [Cheng Pan] [KYUUBI #5483] Release Spark TPC-H/DS Connectors with Scala 2.13
4a1db42 [zwangsheng] [KYUUBI #5513][BATCH] Always redirect delete batch request to Kyuubi instance that owns batch session
b06e044 [labbomb] [KYUUBI #5517] [UI] Initial implement the SQL Lab page
88bb6b4 [liangbowen] [KYUUBI #5486] Bump Kafka client version from 3.4.0 to 3.5.1
538a648 [davidyuan] [KYUUBI #4186] Spark showProgress with JobInfo
682e5b5 [Xianxun Ye] [KYUUBI #5405] [FLINK] Support Flink 1.18
c71528e [Cheng Pan] [KYUUBI #5484] Remove legacy Web UI
ee52b2a [Angerszhuuuu] [KYUUBI #5446][AUTHZ] Support Create/Drop/Show/Reresh index command for Hudi
6a5bb10 [weixi] [KYUUBI #5380][UT] Create PySpark batch jobs tests for RESTful API
86f692d [Kent Yao] [KYUUBI #5512] [AuthZ] Remove the non-existent query specs in Deletes and Updates
dfdd7a3 [fwang12] [KYUUBI #5499][KYUUBI #2503] Catch any exception when closing idle session
b7b3544 [伟程] [KYUUBI #5212] Fix configuration errors causing by helm charts of prometheus services
d123a5a [liupeiyue] [KYUUBI #5282] Support configure Trino session conf in `kyuubi-default.conf`
0750437 [yangming] [KYUUBI #5294] [DOC] Update supported dialects for JDBC engine
9c75d82 [zwangsheng] [KYUUBI #5435][INFRA][TEST] Improve Kyuubi On Kubernetes IT
1dc264a [Angerszhuuuu] [KYUUBI #5479][AUTHZ] Support Hudi CallProcedureHoodieCommand for stored procedures
bc3fcbb [Angerszhuuuu] [KYUUBI #5472] Permanent View should pass column when child plan no output
a67b824 [Fantasy-Jay] [KYUUBI #5382][JDBC] Duplication cleanup improvement in JdbcDialect and schema helpers
c039e1b [Kent Yao] [KYUUBI #5497] [AuthZ] Simplify debug message for missing field/method in ReflectUtils
0c8be79 [Angerszhuuuu] [KYUUBI #5475][FOLLOWUP] Authz check permanent view's subquery should check view's correct privilege
1293cf2 [Kent Yao] [KYUUBI #5500] Add Kyuubi Code Program to Doc
e2754fe [Angerszhuuuu] [KYUUBI #5492][AUTHZ] saveAsTable create DataSource table miss db info
0c53d00 [Angerszhuuuu] [KYUUBI #5447][FOLLOWUP] Remove unrelated debug prints in TableIdentifierTableExtractor
119c393 [Angerszhuuuu] [KYUUBI #5447][AUTHZ] Support Hudi DeleteHoodieTableCommand/UpdateHoodieTableCommand/MergeIntoHoodieTableCommand
3af5ed1 [yikaifei] [KYUUBI #5427] [AUTHZ] Shade spark authz plugin
503c3f7 [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
7a67ace [davidyuan] add common method to get session level config
3f42317 [davidyuan] add common method to get session level config
bb5d5ce [davidyuan] add common method to get session level config
623200f [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
8011959 [davidyuan] add common method to get session level config
605ef16 [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
bb63ed8 [davidyuan] add common method to get session level config
d9cf248 [davidyuan] add common method to get session level config
c8647ef [davidyuan] add common method to get session level config
618c0f6 [david yuan] Merge branch 'apache:master' into 5438_add_common_method_to_support_session_config
c1024bd [david yuan] Merge branch 'apache:master' into 5438_add_common_method_to_support_session_config
32028f9 [davidyuan] add common method to get session level config
03e2887 [davidyuan] add common method to get session level config

Lead-authored-by: David Yuan <yuanfuyuan@mafengwo.com>
Co-authored-by: davidyuan <yuanfuyuan@mafengwo.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: liangbowen <liangbowen@gf.com.cn>
Co-authored-by: david yuan <51512358+davidyuan1223@users.noreply.github.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Co-authored-by: yangming <261635393@qq.com>
Co-authored-by: 伟程 <cheng1483x@gmail.com>
Co-authored-by: weixi <weixi62961@outlook.com>
Co-authored-by: fwang12 <fwang12@ebay.com>
Co-authored-by: Xianxun Ye <yesorno828423@gmail.com>
Co-authored-by: liupeiyue <liupeiyue@yy.com>
Co-authored-by: Fantasy-Jay <13631435453@163.com>
Co-authored-by: yikaifei <yikaifei@apache.org>
Co-authored-by: labbomb <739955946@qq.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 added a commit that referenced this issue Nov 8, 2023
### _Why are the changes needed?_

Current now, in spark-engine module, some session-level configurations are ignored due to the complexity of get session-level configurations in kyuubi spark engine, so As discussed in #5410 (comment). If we need unit test use withSessionConf method, we need make the code get configuration from the right session

The PR is unfinished, it need wait the pr #5410 success so that i can use the new change in unit test

closes #5438
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

No

Closes #5487 from davidyuan1223/5438_add_common_method_to_support_session_config.

Closes #5438

e1ded36 [davidyuan] add more optional session level to get conf
84c4568 [davidyuan] add more optional session level to get conf
4d70902 [davidyuan] add more optional session level to get conf
96d7cde [davidyuan] Revert "add more optional session level to get conf"
940f8f8 [davidyuan] add more optional session level to get conf
15641e8 [davidyuan] add more optional session level to get conf
d838931 [davidyuan] Merge branch '5438_add_common_method_to_support_session_config' of https://github.com/davidyuan1223/kyuubi into 5438_add_common_method_to_support_session_config
2de96b5 [davidyuan] add common method to get session level config
3ec73ad [liangbowen] [KYUUBI #5522] [BATCH] Ignore main class for PySpark batch job submission
d8b808d [Cheng Pan] [KYUUBI #5523] [DOC] Update the Kyuubi supported components version
c7d15ae [Cheng Pan] [KYUUBI #5483] Release Spark TPC-H/DS Connectors with Scala 2.13
4a1db42 [zwangsheng] [KYUUBI #5513][BATCH] Always redirect delete batch request to Kyuubi instance that owns batch session
b06e044 [labbomb] [KYUUBI #5517] [UI] Initial implement the SQL Lab page
88bb6b4 [liangbowen] [KYUUBI #5486] Bump Kafka client version from 3.4.0 to 3.5.1
538a648 [davidyuan] [KYUUBI #4186] Spark showProgress with JobInfo
682e5b5 [Xianxun Ye] [KYUUBI #5405] [FLINK] Support Flink 1.18
c71528e [Cheng Pan] [KYUUBI #5484] Remove legacy Web UI
ee52b2a [Angerszhuuuu] [KYUUBI #5446][AUTHZ] Support Create/Drop/Show/Reresh index command for Hudi
6a5bb10 [weixi] [KYUUBI #5380][UT] Create PySpark batch jobs tests for RESTful API
86f692d [Kent Yao] [KYUUBI #5512] [AuthZ] Remove the non-existent query specs in Deletes and Updates
dfdd7a3 [fwang12] [KYUUBI #5499][KYUUBI #2503] Catch any exception when closing idle session
b7b3544 [伟程] [KYUUBI #5212] Fix configuration errors causing by helm charts of prometheus services
d123a5a [liupeiyue] [KYUUBI #5282] Support configure Trino session conf in `kyuubi-default.conf`
0750437 [yangming] [KYUUBI #5294] [DOC] Update supported dialects for JDBC engine
9c75d82 [zwangsheng] [KYUUBI #5435][INFRA][TEST] Improve Kyuubi On Kubernetes IT
1dc264a [Angerszhuuuu] [KYUUBI #5479][AUTHZ] Support Hudi CallProcedureHoodieCommand for stored procedures
bc3fcbb [Angerszhuuuu] [KYUUBI #5472] Permanent View should pass column when child plan no output
a67b824 [Fantasy-Jay] [KYUUBI #5382][JDBC] Duplication cleanup improvement in JdbcDialect and schema helpers
c039e1b [Kent Yao] [KYUUBI #5497] [AuthZ] Simplify debug message for missing field/method in ReflectUtils
0c8be79 [Angerszhuuuu] [KYUUBI #5475][FOLLOWUP] Authz check permanent view's subquery should check view's correct privilege
1293cf2 [Kent Yao] [KYUUBI #5500] Add Kyuubi Code Program to Doc
e2754fe [Angerszhuuuu] [KYUUBI #5492][AUTHZ] saveAsTable create DataSource table miss db info
0c53d00 [Angerszhuuuu] [KYUUBI #5447][FOLLOWUP] Remove unrelated debug prints in TableIdentifierTableExtractor
119c393 [Angerszhuuuu] [KYUUBI #5447][AUTHZ] Support Hudi DeleteHoodieTableCommand/UpdateHoodieTableCommand/MergeIntoHoodieTableCommand
3af5ed1 [yikaifei] [KYUUBI #5427] [AUTHZ] Shade spark authz plugin
503c3f7 [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
7a67ace [davidyuan] add common method to get session level config
3f42317 [davidyuan] add common method to get session level config
bb5d5ce [davidyuan] add common method to get session level config
623200f [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
8011959 [davidyuan] add common method to get session level config
605ef16 [davidyuan] Merge remote-tracking branch 'origin/5438_add_common_method_to_support_session_config' into 5438_add_common_method_to_support_session_config
bb63ed8 [davidyuan] add common method to get session level config
d9cf248 [davidyuan] add common method to get session level config
c8647ef [davidyuan] add common method to get session level config
618c0f6 [david yuan] Merge branch 'apache:master' into 5438_add_common_method_to_support_session_config
c1024bd [david yuan] Merge branch 'apache:master' into 5438_add_common_method_to_support_session_config
32028f9 [davidyuan] add common method to get session level config
03e2887 [davidyuan] add common method to get session level config

Lead-authored-by: David Yuan <yuanfuyuan@mafengwo.com>
Co-authored-by: davidyuan <yuanfuyuan@mafengwo.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: liangbowen <liangbowen@gf.com.cn>
Co-authored-by: david yuan <51512358+davidyuan1223@users.noreply.github.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Co-authored-by: yangming <261635393@qq.com>
Co-authored-by: 伟程 <cheng1483x@gmail.com>
Co-authored-by: weixi <weixi62961@outlook.com>
Co-authored-by: fwang12 <fwang12@ebay.com>
Co-authored-by: Xianxun Ye <yesorno828423@gmail.com>
Co-authored-by: liupeiyue <liupeiyue@yy.com>
Co-authored-by: Fantasy-Jay <13631435453@163.com>
Co-authored-by: yikaifei <yikaifei@apache.org>
Co-authored-by: labbomb <739955946@qq.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 9615db5)
Signed-off-by: Cheng Pan <chengpan@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants