Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case Duration display in datafusion-cli and sqllogictest #7070

Closed
Tracked by #3148
alamb opened this issue Jul 24, 2023 · 6 comments
Closed
Tracked by #3148

Special case Duration display in datafusion-cli and sqllogictest #7070

alamb opened this issue Jul 24, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 24, 2023

Is your feature request related to a problem or challenge?

In #6832 and #7068 , @tustvold proposes to use Duration consistently for timestamp arithmetic which makes the code simpler and consistent across DataFusion

However, one implication of this change, at the time of this writigng, is shown #6832 (comment)

Specifically, in DataFusion 28.0.0 the output of timestamp - timestamp is a interval and is displayed like this:

select (now() - '2023-01-01'::timestamp);
+-----------------------------------------------------------+
| now() - Utf8("2023-01-01")                                |
+-----------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 16 mins 8.121077000 secs |
+-----------------------------------------------------------+
1 row in set. Query took 0.004 seconds.

However, without any additional changes after #6832 it is displayed in ISO 8601 duration format

select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S        |
+----------------------------+
1 row in set. Query took 0.002 seconds.

I think this is problematic because the output:

  1. Looks wildly different than 28.0.0
  2. Looks wildly different than how intervals are expressed in queries (e.g. select now() + interval '1 year';)
  3. Will likely cause significant confusion as the ISO duration format, (e.g. P198DT72932.972880S), is not widely used for displaying intervals to humans
  4. The casting logic to strings will be different (see below)

Casting logic difference will means that the following query in 28.0.0

select (now() - '2023-01-01'::timestamp)::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

Would produce this in 29.0.0:

+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| P198DT72932.972880S ago |
+----------------------------------------------------------------+

Describe the solution you'd like

I propose adding specific code for formatting DataFusion output, used in datafusion-cli, tests, and sqllogictests that formats Durations consistently with how Intervals are displayed in 28.0.0.

This change will unblock the arrow upgrade in #6832 as well as give us a single location to control formatting in DataFusion making it easier to improve the formatting of intervals / durations (or other types) in the future if we desire.

Anyone using DataFusion programmatically can either choose to use this new formatting routine, or can format Durations / Intervals to their specific need using arrow-rs or custom code (as we do in IOx, for example (source))

While this solution still suffers from the Duration --> String casting problem, I think it is ok because I don't think it is very common and there is a workaround (to cast the duration to an interval):

❯ select (now() - '2023-01-01'::timestamp)::interval::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

For those people who want even more control of duration printing, it would probably make sense to add a specific to_char or similar function: https://www.postgresql.org/docs/current/functions-formatting.html

Describe alternatives you've considered

Change the parsing / formattting of Durations in arrow-rs

The idea is to make PRs like apache/arrow-rs#4557 that would change the formatting in arrow-rs

Some challenges with this approach:

  1. It is not clear if there is an "ideal" canonical format is for durations
  2. It would be a backwards compatible change that would likely impact other projects than DataFusion
  3. Existing data that was serialized with Durations (e.g. JSON or CSV) may be impacted

Special case String casting / parsing

One way to avoid issues here would be to just special case the casting and parsing (Duration <==> String) in datafusion and instead of calling the arrow cast function we could call datafusion_cast which would have special handling for some type and then fall back to the arrow cast kernel

The challenges with this approach are:

  1. It would be brittle (it would be hard to ensure that all new code called datafusion_cast rather than cast
  2. It is not clear that it would be entirely consistent (e.g. if there is code in arrow itself that calls cast)

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

@waitingkuo / @liukun4515 / @ozankabak I wonder if you have any thoughts about this proposal (related to reducing the impact of #7068)

@ozankabak
Copy link
Contributor

I will discuss with my team and share our thoughts tomorrow.

@berkaysynnada
Copy link
Contributor

I think the way you suggested seems to be the most rational option. Since we do not use year and month units for durations, we can maybe ignore those parts while formatting.

@alamb
Copy link
Contributor Author

alamb commented Jul 28, 2023

I believe @tustvold plans to handle this next week

@tustvold
Copy link
Contributor

Thinking about this a bit more, I think it would be cleanest if we added support for this to FormatOptions, this can then be plumbed through CastOptions.

apache/arrow-rs#4581 adds the necessary upstream changes

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2023

Closed by #6832

@tustvold tustvold closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants