Skip to content

Conversation

@vladanvasi-db
Copy link
Contributor

@vladanvasi-db vladanvasi-db commented Dec 11, 2024

What changes were proposed in this pull request?

In this PR, I propose extending the current DataSet API and adding another toJSON method which will take a jsonOptions map parameter.
The full list of jsonOptions is documented in the dataSources page: https://spark.apache.org/docs/3.5.1/sql-data-sources-json.html, so the new added method can be used with all of the options specified in the docs.

Why are the changes needed?

These changes are needed because in some cases, users need to specify some options and more robustly control how the JSON strings will be formed from Dataset.
One example is when the output string has a timestamp:
Currently, the timestamp returned by toJSON method will have millisecond precision, but sometimes, users need better precision like micro. With appropriate option, users will be able to format the timestamp correctly.

In this case, user can call this new method with:
dataSet.toJSON(Map("timestampFormat" -> "dd.MM.yyyy HH:mm:ss.SSSSSS")) and get the desired precision.

Does this PR introduce any user-facing change?

Yes. The change represents an extension for the DataSet API and allows users to call toJSON method with options argument, specifying the custom options for formatting the returned json.

How was this patch tested?

This patch was tested by adding a test in JsonSuite. There are a lot of tests that are testing toJSON method without arguments, but test was added for toJSON method with options argument, and they test that the date and timestamp accordingly to the format specified in the jsonOptions specified argument.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 11, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users need better precision like micro/nano

nonoseconds precision is not supported at all.

@vladanvasi-db Could you write a test to check that non-empty options are propagated properly.

@cloud-fan
Copy link
Contributor

Please update the PR description as it's a user-facing change. Let's also add tests for it.

}

/** @inheritdoc */
override def toJSON(jsonOptions: Map[String, String]): Dataset[String] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and .. actually we can easily work around via:

df.select(to_json(struct(col("*")), options)).as(StringEncoder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we probably shouldn't add Dataset.toJSON at the first place...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually we can bypass it using this one properly, however, do you still think it is worth to extend the toJSON API? In my opinion, a lot of users will not figure out this workaround when trying to specify options for toJSON in a DataSet.

@vladanvasi-db
Copy link
Contributor Author

From my understanding, toJSON is a proper API and it was added in Spark 2.0.0.
It is documented here, at the official DataSet Spark documentation. On the internet, I see a lot of code examples using this toJSON API to construct series of JSON strings from DataSet/DataFrame, and I have not heard one place which says it is not a proper API or it is getting deprecated.
The workaround that @HyukjinKwon proposed works in my case, so I want to hear your thoughts about whether it is worth extending DataSet API with toJSON(jsonOptions: Map[String, String] method. These JSON options are also well documented here, and I think it would be better for users to have a method toJSON(options) for their purpose instead of using the workaround that was proposed above.
Please share your thoughts @cloud-fan @MaxGekk @HyukjinKwon because this one is critical for timestamp precision in JSON converted strings from the DataSet.

@MaxGekk
Copy link
Member

MaxGekk commented Dec 14, 2024

Currently, the timestamp returned by toJSON method will have millisecond precision, but sometimes, users need better precision like micro.

I would consider to change the default timestamp pattern, and output timestamp with microseconds precision by default in builtin text datasources. I believe it is worth to do in the release 4.0.

@HyukjinKwon
Copy link
Member

my 2 cents, @vladanvasi-db we didn't have expressions like to_json at that time we added Dataset.toJson. Given that many people use it, I wouldn't deprecate it for now but I would also avoid extending this feature for the matter of consistency. Another example could be Dataset.toCsv which I don't think we should add it as an API

@HyukjinKwon HyukjinKwon changed the title [SPARK-50548] Extended DataSet API to support specifying JSON options in toJSON method [SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method Dec 16, 2024
@vladanvasi-db
Copy link
Contributor Author

Thank you for comments. I will close this PR for now, as changing the default timestamp pattern seems as better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants