-
Notifications
You must be signed in to change notification settings - Fork 6
[SPARK-51799] Support user-specified schema in DataFrameReader
#58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you review this PR, @yaooqinn ? |
#expect(try await spark.read.schema("age SHORT").json(path).dtypes.count == 1) | ||
#expect(try await spark.read.schema("age SHORT").json(path).dtypes[0] == ("age", "smallint")) | ||
#expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[0] == ("age", "smallint")) | ||
#expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[1] == ("name", "string")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a test with comment & null constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's supported
But, according to the Apache Spark 4.0.0 RC4, it seems there are limitations.
spark-shell
$ bin/spark-shell
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0
/_/
Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 17.0.14)
Type in expressions to have them evaluated.
Type :help for more information.
25/04/15 12:32:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1744687967546).
Spark session available as 'spark'.
scala> spark.read.schema("name STRING NOT NULL").json("examples/src/main/resources/people.json").printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
|-- name: string (nullable = true)
$ bin/spark-connect-shell --remote sc://localhost:15002
25/04/15 12:28:48 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
25/04/15 12:28:48 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0
/_/
Type in expressions to have them evaluated.
Spark connect server version 4.0.0.
Spark session available as 'spark'.
scala> spark.read.schema("name STRING").json("../examples/src/main/resources/people.json").printSchema
root
|-- name: string (nullable = true)
scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").printSchema
root
|-- name: string (nullable = true)
scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").show()
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
For that part, let me dig more, @yaooqinn .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongjoon-hyun
Thank you, @yaooqinn ! |
Merged to main. |
What changes were proposed in this pull request?
This PR aims to support user-specified schema in
DataFrameReader
.Why are the changes needed?
For feature parity.
Does this PR introduce any user-facing change?
No. This is a new addition.
How was this patch tested?
Pass the CIs.
Was this patch authored or co-authored using generative AI tooling?
No.