[SPARK-51799] Support user-specified schema in `DataFrameReader` #58

dongjoon-hyun · 2025-04-15T00:15:30Z

What changes were proposed in this pull request?

This PR aims to support user-specified schema in DataFrameReader.

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

No. This is a new addition.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2025-04-15T02:12:22Z

Could you review this PR, @yaooqinn ?

yaooqinn · 2025-04-15T02:49:06Z

Tests/SparkConnectTests/DataFrameReaderTests.swift

+    #expect(try await spark.read.schema("age SHORT").json(path).dtypes.count == 1)
+    #expect(try await spark.read.schema("age SHORT").json(path).dtypes[0] == ("age", "smallint"))
+    #expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[0] == ("age", "smallint"))
+    #expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[1] == ("name", "string"))


Can we also add a test with comment & null constraint

I thought it's supported

But, according to the Apache Spark 4.0.0 RC4, it seems there are limitations.
spark-shell

$ bin/spark-shell WARNING: Using incubator modules: jdk.incubator.vector Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.0.0 /_/ Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 17.0.14) Type in expressions to have them evaluated. Type :help for more information. 25/04/15 12:32:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1744687967546). Spark session available as 'spark'. scala> spark.read.schema("name STRING NOT NULL").json("examples/src/main/resources/people.json").printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- name: string (nullable = true)

spark-connect-shell

$ bin/spark-connect-shell --remote sc://localhost:15002 25/04/15 12:28:48 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type 25/04/15 12:28:48 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.0.0 /_/ Type in expressions to have them evaluated. Spark connect server version 4.0.0. Spark session available as 'spark'. scala> spark.read.schema("name STRING").json("../examples/src/main/resources/people.json").printSchema root |-- name: string (nullable = true) scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").printSchema root |-- name: string (nullable = true) scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").show() +-------+ | name| +-------+ |Michael| | Andy| | Justin| +-------+

For that part, let me dig more, @yaooqinn .

Thank you @dongjoon-hyun

dongjoon-hyun · 2025-04-15T03:07:54Z

Thank you, @yaooqinn !

dongjoon-hyun · 2025-04-15T03:34:18Z

Merged to main.

[SPARK-51799] Support user-specified schema in DataFrameReader

c495cb4

yaooqinn reviewed Apr 15, 2025

View reviewed changes

yaooqinn approved these changes Apr 15, 2025

View reviewed changes

dongjoon-hyun closed this in 7ab1e45 Apr 15, 2025

dongjoon-hyun deleted the SPARK-51799 branch April 15, 2025 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51799] Support user-specified schema in `DataFrameReader` #58

[SPARK-51799] Support user-specified schema in `DataFrameReader` #58

Uh oh!

dongjoon-hyun commented Apr 15, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

yaooqinn Apr 15, 2025

Uh oh!

dongjoon-hyun Apr 15, 2025

Uh oh!

yaooqinn Apr 15, 2025

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

Uh oh!

[SPARK-51799] Support user-specified schema in DataFrameReader #58

[SPARK-51799] Support user-specified schema in DataFrameReader #58

Uh oh!

Conversation

dongjoon-hyun commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

yaooqinn Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

yaooqinn Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

dongjoon-hyun commented Apr 15, 2025

Uh oh!

Uh oh!

[SPARK-51799] Support user-specified schema in `DataFrameReader` #58

[SPARK-51799] Support user-specified schema in `DataFrameReader` #58

dongjoon-hyun commented Apr 15, 2025 •

edited

Loading