-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50548][SQL] Extended DataSet API to support specifying JSON options in toJSON method #49148
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users need better precision like micro/nano
nonoseconds precision is not supported at all.
@vladanvasi-db Could you write a test to check that non-empty options are propagated properly.
|
Please update the PR description as it's a user-facing change. Let's also add tests for it. |
| } | ||
|
|
||
| /** @inheritdoc */ | ||
| override def toJSON(jsonOptions: Map[String, String]): Dataset[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and .. actually we can easily work around via:
df.select(to_json(struct(col("*")), options)).as(StringEncoder)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, we probably shouldn't add Dataset.toJSON at the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually we can bypass it using this one properly, however, do you still think it is worth to extend the toJSON API? In my opinion, a lot of users will not figure out this workaround when trying to specify options for toJSON in a DataSet.
|
From my understanding, |
I would consider to change the default timestamp pattern, and output timestamp with microseconds precision by default in builtin text datasources. I believe it is worth to do in the release 4.0. |
|
my 2 cents, @vladanvasi-db we didn't have expressions like |
|
Thank you for comments. I will close this PR for now, as changing the default timestamp pattern seems as better option. |
What changes were proposed in this pull request?
In this PR, I propose extending the current
DataSetAPI and adding anothertoJSONmethod which will take a jsonOptions map parameter.The full list of
jsonOptionsis documented in the dataSources page: https://spark.apache.org/docs/3.5.1/sql-data-sources-json.html, so the new added method can be used with all of the options specified in the docs.Why are the changes needed?
These changes are needed because in some cases, users need to specify some options and more robustly control how the JSON strings will be formed from Dataset.
One example is when the output string has a timestamp:
Currently, the timestamp returned by
toJSONmethod will have millisecond precision, but sometimes, users need better precision like micro. With appropriate option, users will be able to format the timestamp correctly.In this case, user can call this new method with:
dataSet.toJSON(Map("timestampFormat" -> "dd.MM.yyyy HH:mm:ss.SSSSSS"))and get the desired precision.Does this PR introduce any user-facing change?
Yes. The change represents an extension for the
DataSetAPI and allows users to calltoJSONmethod withoptionsargument, specifying the custom options for formatting the returned json.How was this patch tested?
This patch was tested by adding a test in
JsonSuite. There are a lot of tests that are testingtoJSONmethod without arguments, but test was added fortoJSONmethod withoptionsargument, and they test that the date and timestamp accordingly to the format specified in thejsonOptionsspecified argument.Was this patch authored or co-authored using generative AI tooling?
No.