[SPARK-42457][CONNECT] Adding SparkSession#read #40025

zhenlineo · 2023-02-15T01:28:34Z

What changes were proposed in this pull request?

Add SparkSession Read API to read data into Spark via Scala Client:

DataFrameReader.format(…).option(“key”, “value”).schema(…).load()

The following methods are skipped by the Scala Client on purpose:

[info]   deprecated method json(org.apache.spark.api.java.JavaRDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version
[info]   deprecated method json(org.apache.spark.rdd.RDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version
[info]   method json(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version
[info]   method csv(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version

Why are the changes needed?

To read data from csv etc. format.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

E2E, Golden tests.

hvanhovell · 2023-02-15T02:49:57Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+      extraOptions.foreach {
+        case (k, v) => dataSourceBuilder.putOptions(k, v)
+      }
+      paths.foreach(path => dataSourceBuilder.addPaths(path))


You can also set path/paths in the options. How does this work? Does the server reconcile all these path options?

@hvanhovell I do not think we can anything special here. The server planner will call the original SQL code and merge it based on the config settings. e.g.

def load(paths: String*): DataFrame = { .... val legacyPathOptionBehavior = sparkSession.sessionState.conf.legacyPathOptionBehavior if (!legacyPathOptionBehavior && (extraOptions.contains("path") || extraOptions.contains("paths")) && paths.nonEmpty) { throw QueryCompilationErrors.pathOptionNotSetCorrectlyWhenReadingError() } DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).flatMap { provider => DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions, source, paths: _*) }.getOrElse(loadV1Source(paths: _*))

If we merge them on the client we by default assumed the config value. So the best is actually leave the result to server SQL API to handle the merging of paths.

Alright, that works for me. Can you make sure we test this?

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

hvanhovell

Looks good overall. Please provide clarity on path handling.

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

hvanhovell · 2023-02-15T02:53:48Z

@zhenlineo can you make the PR description a bit more descriptive?

hvanhovell · 2023-02-15T02:54:32Z

Oh and can you add the ticket to the title?

amaliujia · 2023-02-15T21:26:59Z

So it is ok to not have e2e tests for read API (similarly for the write side)?

amaliujia · 2023-02-15T21:39:56Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

+      .schema(StructType(StructField("name", StringType) :: StructField("age", IntegerType) :: Nil))
+      .option("op1", "op1")
+      .options(Map("op2" -> "op2"))
+      .load(testDataPath.resolve("people.txt").toString)


Why we need the physical data files checked in? If we only compare the plans then we only need a fake data path?

The server test ProtoToParsedPlanTestSuite runs this test with data. Yet, it does not verify the read data result correctness. (e.g. the content of the csv is loaded correctly or not)

I might be wrong but I am thinking if you offer a schema in the proto then the server side might not need to load the data: it loads the data as it needs to infer the scheme by reading the data directly when a schema is not set. I am not sure though except this case if server side still has to read the data directly.

Umm, tried with json file, but the schema seems not enough. :(

amaliujia · 2023-02-16T05:04:49Z

LGTM

Looks like you only need ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm to fix the style.

hvanhovell

LGTM

### What changes were proposed in this pull request? Add SparkSession Read API to read data into Spark via Scala Client: ``` DataFrameReader.format(…).option(“key”, “value”).schema(…).load() ``` The following methods are skipped by the Scala Client on purpose: ``` [info] deprecated method json(org.apache.spark.api.java.JavaRDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] deprecated method json(org.apache.spark.rdd.RDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method json(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method csv(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version ``` ### Why are the changes needed? To read data from csv etc. format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E, Golden tests. Closes #40025 from zhenlineo/session-read. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8d863e3) Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Add SparkSession Read API to read data into Spark via Scala Client: ``` DataFrameReader.format(…).option(“key”, “value”).schema(…).load() ``` The following methods are skipped by the Scala Client on purpose: ``` [info] deprecated method json(org.apache.spark.api.java.JavaRDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] deprecated method json(org.apache.spark.rdd.RDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method json(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method csv(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version ``` ### Why are the changes needed? To read data from csv etc. format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E, Golden tests. Closes apache#40025 from zhenlineo/session-read. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8d863e3) Signed-off-by: Herman van Hovell <herman@databricks.com>

github-actions bot added CONNECT SQL labels Feb 15, 2023

hvanhovell reviewed Feb 15, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Show resolved Hide resolved

hvanhovell reviewed Feb 15, 2023

View reviewed changes

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala Show resolved Hide resolved

zhenlineo force-pushed the session-read branch from 56f80b3 to 17620fc Compare February 15, 2023 19:25

amaliujia reviewed Feb 15, 2023

View reviewed changes

zhenlineo added 3 commits February 15, 2023 14:24

Adding SparkSession#read

9065f8c

fix compatibility test

e4875a2

Add more tests

c9425d0

zhenlineo force-pushed the session-read branch from 7046456 to 8c7724d Compare February 15, 2023 22:24

zhenlineo changed the title ~~[CONNECT] Adding SparkSession#read~~ [SPARK-42457][CONNECT] Adding SparkSession#read Feb 15, 2023

zhenlineo marked this pull request as ready for review February 15, 2023 22:31

Fix

8c4616e

zhenlineo force-pushed the session-read branch from 8c7724d to 8c4616e Compare February 15, 2023 23:04

fix

893cdb4

fix formating

2f271b2

hvanhovell approved these changes Feb 16, 2023

View reviewed changes

hvanhovell closed this in 8d863e3 Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42457][CONNECT] Adding SparkSession#read #40025

[SPARK-42457][CONNECT] Adding SparkSession#read #40025

zhenlineo commented Feb 15, 2023 •

edited

hvanhovell Feb 15, 2023

zhenlineo Feb 15, 2023

hvanhovell Feb 15, 2023

hvanhovell left a comment

hvanhovell commented Feb 15, 2023

hvanhovell commented Feb 15, 2023

amaliujia commented Feb 15, 2023

amaliujia Feb 15, 2023

zhenlineo Feb 15, 2023 •

edited

amaliujia Feb 16, 2023 •

edited

zhenlineo Feb 16, 2023

amaliujia Feb 16, 2023

amaliujia commented Feb 16, 2023

hvanhovell left a comment

[SPARK-42457][CONNECT] Adding SparkSession#read #40025

[SPARK-42457][CONNECT] Adding SparkSession#read #40025

Conversation

zhenlineo commented Feb 15, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

hvanhovell Feb 15, 2023

Choose a reason for hiding this comment

zhenlineo Feb 15, 2023

Choose a reason for hiding this comment

hvanhovell Feb 15, 2023

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell commented Feb 15, 2023

hvanhovell commented Feb 15, 2023

amaliujia commented Feb 15, 2023

amaliujia Feb 15, 2023

Choose a reason for hiding this comment

zhenlineo Feb 15, 2023 • edited

Choose a reason for hiding this comment

amaliujia Feb 16, 2023 • edited

Choose a reason for hiding this comment

zhenlineo Feb 16, 2023

Choose a reason for hiding this comment

amaliujia Feb 16, 2023

Choose a reason for hiding this comment

amaliujia commented Feb 16, 2023

hvanhovell left a comment

Choose a reason for hiding this comment

zhenlineo commented Feb 15, 2023 •

edited

zhenlineo Feb 15, 2023 •

edited

amaliujia Feb 16, 2023 •

edited