[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148

vladanvasi-db · 2024-12-11T16:44:25Z

What changes were proposed in this pull request?

In this PR, I propose extending the current DataSet API and adding another toJSON method which will take a jsonOptions map parameter.
The full list of jsonOptions is documented in the dataSources page: https://spark.apache.org/docs/3.5.1/sql-data-sources-json.html, so the new added method can be used with all of the options specified in the docs.

Why are the changes needed?

These changes are needed because in some cases, users need to specify some options and more robustly control how the JSON strings will be formed from Dataset.
One example is when the output string has a timestamp:
Currently, the timestamp returned by toJSON method will have millisecond precision, but sometimes, users need better precision like micro. With appropriate option, users will be able to format the timestamp correctly.

In this case, user can call this new method with:
dataSet.toJSON(Map("timestampFormat" -> "dd.MM.yyyy HH:mm:ss.SSSSSS")) and get the desired precision.

Does this PR introduce any user-facing change?

Yes. The change represents an extension for the DataSet API and allows users to call toJSON method with options argument, specifying the custom options for formatting the returned json.

How was this patch tested?

This patch was tested by adding a test in JsonSuite. There are a lot of tests that are testing toJSON method without arguments, but test was added for toJSON method with options argument, and they test that the date and timestamp accordingly to the format specified in the jsonOptions specified argument.

Was this patch authored or co-authored using generative AI tooling?

No.

MaxGekk

users need better precision like micro/nano

nonoseconds precision is not supported at all.

@vladanvasi-db Could you write a test to check that non-empty options are propagated properly.

cloud-fan · 2024-12-11T19:56:08Z

Please update the PR description as it's a user-facing change. Let's also add tests for it.

sql/api/src/main/scala/org/apache/spark/sql/api/Dataset.scala

HyukjinKwon · 2024-12-12T00:43:57Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+  }
+
+  /** @inheritdoc */
+  override def toJSON(jsonOptions: Map[String, String]): Dataset[String] = {


and .. actually we can easily work around via:

df.select(to_json(struct(col("*")), options)).as(StringEncoder)

ah, we probably shouldn't add Dataset.toJSON at the first place...

Yes, actually we can bypass it using this one properly, however, do you still think it is worth to extend the toJSON API? In my opinion, a lot of users will not figure out this workaround when trying to specify options for toJSON in a DataSet.

vladanvasi-db · 2024-12-14T16:13:45Z

From my understanding, toJSON is a proper API and it was added in Spark 2.0.0.
It is documented here, at the official DataSet Spark documentation. On the internet, I see a lot of code examples using this toJSON API to construct series of JSON strings from DataSet/DataFrame, and I have not heard one place which says it is not a proper API or it is getting deprecated.
The workaround that @HyukjinKwon proposed works in my case, so I want to hear your thoughts about whether it is worth extending DataSet API with toJSON(jsonOptions: Map[String, String] method. These JSON options are also well documented here, and I think it would be better for users to have a method toJSON(options) for their purpose instead of using the workaround that was proposed above.
Please share your thoughts @cloud-fan @MaxGekk @HyukjinKwon because this one is critical for timestamp precision in JSON converted strings from the DataSet.

MaxGekk · 2024-12-14T17:11:16Z

Currently, the timestamp returned by toJSON method will have millisecond precision, but sometimes, users need better precision like micro.

I would consider to change the default timestamp pattern, and output timestamp with microseconds precision by default in builtin text datasources. I believe it is worth to do in the release 4.0.

HyukjinKwon · 2024-12-16T04:33:03Z

my 2 cents, @vladanvasi-db we didn't have expressions like to_json at that time we added Dataset.toJson. Given that many people use it, I wouldn't deprecate it for now but I would also avoid extending this feature for the matter of consistency. Another example could be Dataset.toCsv which I don't think we should add it as an API

vladanvasi-db · 2024-12-17T13:57:34Z

Thank you for comments. I will close this PR for now, as changing the default timestamp pattern seems as better option.

Extended DataSet API to support specifying JSON options in toJSON method

8e7264f

github-actions bot added the SQL label Dec 11, 2024

MaxGekk requested changes Dec 11, 2024

View reviewed changes

Added initial test for toJSON

d49214c

HyukjinKwon reviewed Dec 12, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/api/Dataset.scala Show resolved Hide resolved

HyukjinKwon reviewed Dec 12, 2024

View reviewed changes

Added tests for toJSON method with jsonOptions

4d459ec

github-actions bot added the CONNECT label Dec 12, 2024

vladanvasi-db requested review from HyukjinKwon, MaxGekk and cloud-fan December 12, 2024 10:26

HyukjinKwon changed the title ~~[SPARK-50548] Extended DataSet API to support specifying JSON options in toJSON method~~ [SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method Dec 16, 2024

vladanvasi-db closed this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148

[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148

Uh oh!

vladanvasi-db commented Dec 11, 2024 •

edited

Loading

Uh oh!

MaxGekk left a comment

Uh oh!

cloud-fan commented Dec 11, 2024

Uh oh!

Uh oh!

HyukjinKwon Dec 12, 2024

Uh oh!

cloud-fan Dec 12, 2024

Uh oh!

vladanvasi-db Dec 13, 2024

Uh oh!

vladanvasi-db commented Dec 14, 2024

Uh oh!

MaxGekk commented Dec 14, 2024

Uh oh!

HyukjinKwon commented Dec 16, 2024

Uh oh!

vladanvasi-db commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148

[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148

Uh oh!

Conversation

vladanvasi-db commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 11, 2024

Uh oh!

Uh oh!

HyukjinKwon Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

vladanvasi-db Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

vladanvasi-db commented Dec 14, 2024

Uh oh!

MaxGekk commented Dec 14, 2024

Uh oh!

HyukjinKwon commented Dec 16, 2024

Uh oh!

vladanvasi-db commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vladanvasi-db commented Dec 11, 2024 •

edited

Loading