Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42215][CONNECT] Simplify Scala Client IT tests #40274

Closed
wants to merge 1 commit into from

Conversation

zhenlineo
Copy link
Contributor

@zhenlineo zhenlineo commented Mar 4, 2023

What changes were proposed in this pull request?

Make use of the new spark-connect script to make the Scala client test to not directly depends on any other modules.
The dependency is still there but hidden by the spark-connect script. When calling the script to start the server, the script performs a build/sbt package to ensure all the jars are build and can be found in the correct path.

After the change, we can use the following commands to run the Scala client tests:

build/mvn clean
build/mvn compile -pl connector/connect/client/jvm
build/mvn test -pl connector/connect/client/jvm
build/sbt clean
build/sbt "testOnly org.apache.spark.sql.ClientE2ETestSuite" 
build/sbt clean
build/sbt "connect-client-jvm/test"

Scala 2.13

build/mvn clean
build/mvn -Pscala-2.13 compile -pl connector/connect/client/jvm -am -DskipTests
build/mvn -Pscala-2.13 test -pl connector/connect/client/jvm

// These commands failed with errors to find catalyst ArrowWriter. The error seems unrelated to this change.
build/sbt clean
build/sbt "testOnly org.apache.spark.sql.ClientE2ETestSuite" -Pscala-2.13
build/sbt clean
build/sbt "connect-client-jvm/test" -Pscala-2.13

After the change, the waiting time to run the E2ESuite is ~3min for a clean build. Then ~1min for subsequent runs. The test is slower only because we moved the build time from many commands to this single command. There is no limitations to add more tests as the delay is only caused by the shared server start time. Once the server is started, the tests run fast.

Why are the changes needed?

A single command for maven users to mvn clean install to run tests and build.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Build, Manual tests.

@zhenlineo
Copy link
Contributor Author

@hvanhovell
cc @LuciferYang

@zhenlineo
Copy link
Contributor Author

The full error (even with the clean master branch):

build/mvn clean
build/mvn -Pscala-2.13 compile -pl connector/connect/client/jvm -am -DskipTests
build/mvn -Pscala-2.13 test -pl connector/connect/client/jvm -> errored here

[ERROR] [Error] /Users/zhen.li/code/spark-sbt/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/util/ConvertToArrow.scala:28: object arrow is not a member of package org.apache.spark.sql.execution
[ERROR] [Error] /Users/zhen.li/code/spark-sbt/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/util/ConvertToArrow.scala:46: not found: type ArrowWriter
[ERROR] [Error] /Users/zhen.li/code/spark-sbt/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/util/ConvertToArrow.scala:46: not found: value ArrowWriter

@zhenlineo zhenlineo changed the title [SPARK-42215][CONNECT] Single command to run Scala Client IT tests [SPARK-42215][CONNECT] Simplify Scala Client IT tests Mar 4, 2023
@@ -76,7 +76,8 @@ class ClientE2ETestSuite extends RemoteSparkSession {
assert(result(2) == 2)
}

test("simple udf") {
// Ignore this test until the udf is fully implemented.
ignore("simple udf") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind ignore this one in a in an independent pr first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created SPARK-42665 yesterday because it still test fail in 3.4.0 RC2 and be reported in the dev mail list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hvanhovell should we ignore this test first? Maven test must fail due to the function is org.apache.spark.sql.ClientE2ETestSuite$$Lambda$XXX, but there is no org.apache.spark.sql.ClientE2ETestSuite in server side when test using maven.

@LuciferYang
Copy link
Contributor

Thanks for your work @zhenlineo
If you don't mind, please give me more time to think about this pr :)

@LuciferYang
Copy link
Contributor

In the pr description, build/mvn compile -pl connector/connect/client/jvm should be build/mvn compile -pl connector/connect/client/jvm -am ?

@LuciferYang
Copy link
Contributor

On the whole, it is good for me. There is only one question. Spark still uses maven for version release and deploy. But after this pr, the E2E test change to use sbt assembly server jar instead of maven shaded server jar for testing, which may weaken the maven test. We may need other ways to ensure the correctness of maven shaded server jar.

In the future, we may use sbt to completely replace maven(should not be in Spark 3.4.0), including version release, deploy and other help tools, which will no longer be a problem at that time.

@LuciferYang
Copy link
Contributor

There is another problem that needs to be confirmed, which may not related to current pr: if other Suites inherit RemoteSparkSession, they will share the same connect server, right? (SparkConnectServerUtils is an object, so SparkConnect will only submit once)

@zhenlineo
Copy link
Contributor Author

@LuciferYang Thanks for your review. This PR was trying to simplify the test running steps. But as you said it make the maven commands to call sbt implicitly. I will split the changes into smaller PRs to allow this PR only deal with the IT command change. Then we can votes if we like this change or not :)

@zhenlineo
Copy link
Contributor Author

#40304 #40303

@zhenlineo zhenlineo force-pushed the one-cmd-for-it branch 2 times, most recently from ba43059 to d1738ac Compare March 8, 2023 21:36
@LuciferYang
Copy link
Contributor

seems SimpleSparkConnectService startup failed, the error message is

Error: Missing application resource.

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
...

@zhenlineo
Copy link
Contributor Author

seems SimpleSparkConnectService startup failed, the error message is

Error: Missing application resource.

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
...

Yeah, this was caused by the bug we had in the scripts.

@zhenlineo
Copy link
Contributor Author

@hvanhovell Want to keep this or shall we skip? It helps a bit when not knowing build/sbt -Pconnect -Phive package before running the IT.

@zhenlineo zhenlineo closed this Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants