-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conversion Date32 / DateTime64 / Date to narrow types #40217
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
884de09
to
953f4ed
Compare
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pong 😄
I didn't try it out but it would be interesting to know if the other methods mentioned here work correctly at the type boundaries, e.g. what happens when addYears()
(or date_add
etc.) produces an out-of-bounds value, what happens if yesterday()
is called on the lowest possible value etc.
cefa6bc
to
d789dab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My brain has melted after review but the change is almost ready ^^
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for the explanation. The semantics seem quite dangerous to me, and they are also not documented. Perhaps integer-to-date conversion should not be allowed in the first place. But that's a different topic. I asked about the non-distributed / distributed mismatch in our internal chat and hope to come back with some findings. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@alexey-milovidov I need your advice because the situation is really interesting. I was working on #38435 and my branch https://github.com/arenadata/ClickHouse/tree/ADQM-528-B In my branch I fixed the functions: toStartOfYear, toStartOfISOYear, toStartOfQuarter, toStartOfMonth, toStartOfWeek, toMonday, toLastDayOfMonth Today before doing my PR I tried to merge with upstream master and I had a merge conflict because of this PR #40217 After a code review I have noticed many strange new things. For example, consider only function toLastDayOfMonth (most of all other functions are also affected); this PR #40217
In my branch:
In master at the beginning of August:
So please advice what should I do now? I seems to make it working with full range 1900-01-01 to 2299-12-31 I need to completely discard all the changes made by this PR #40217 to all mentioned functions. Right now what we have:
before this PR #40217 (master from the beginning of August 2022); it returns: in this PR #40217 it returns: I my branch it returns the correct result: The joke is that this PR #40217 is the real improvement: 1970 is closer to 1950 than 2129. But it's still not correct. |
@rvasin Allow me to answer. This PR (#40217) is in some sense a stopgap solution. As you correction mention at the end of your last comment, it produces incorrect results when the input is of extended date/time type and outside the standard value range. The improvement is that prior to this PR, the result was in principle completely random, whereas the calculation of the incorrect result is now at least deterministic. The exact behavior is documented here. More specifically, most date/time functions now implement "cropping" semantics which makes sense iff one assumes they always produce output in a standard precision type. An implementation which produces the correct result for all input type + value combinations requires type promotion (and yes, such an implementation would replace the modifications in this PR). The reason this has not been done yet were compat concerns.
But it would be interesting to also hear the opinion of @amosbird and @alexey-milovidov . PS: @rvasin Feel free to set me as reviewer to your PR once it is ready, I am by now quite deep in the entire date/time madness 😄 |
Hello @rschu1ze , thanks for the reply. |
@rvasin I would say the approach in your comment #38435 (comment) is the most intuitive one, i.e. for Date and DateTime inputs date/time functions produce a Date (*) and for Date32 and DateTime64 inputs they produce a Date32. I also agree with #38435 (comment) that ideally all date/time functions should accept extended date/time types as input unless we want the user to insert explicit casts. About compatibility issues: I guess that the folks I mentioned in #40217 (comment) recall the concerns better than me (I joined the team just a few months ago). Let's hear their feedback. (*) Just to nitpick, |
@rschu1ze OK. Robert, it would it great if you could be a reviewer for my PR. I will probably write more details on Monday. Here I will mention some of my ideas about the current situation.
I don't understand what are you talking about [2149-06-01, 2149-06-06] range. In my branch all mentioned functions work fine in [1900-01-01, 2299-12-31] range. It's implemented in such way that if argument is Date32 or DateTime64 the result is Date32 (wide range) otherwise the result is Date type (narrow range). So it's backward incompatible solution. Therefore as I discussed with @ilejn we decided to implemented the compatibility option enable_date32_results which currently has FALSE value by default. So my PR would not break compatibility. And if someone wants to to use wide ranges he may set value to TRUE. Lets say it's phase A. Phase B. correct functions like date_diff to accept Date32 and DateTime64 as arguments. This what is causing the main backward compatibility concerts. If we have things like Phase C. The idea that with some CH version (for example in October or November) we may set this option TRUE by default (there is special compatibility section for that in CH). Phase D. Somewhere in the future to make is function obsolete. Because behavior will be as TRUE What comes to my mind: Maybe narrowing implemented in current PR makes sense. We may keep it (while I have ideas how to refactor it using std::clamp function etc). So my idea is:
And maybe somewhere on phase D we may remove all narrowing code. Maybe it's a good strategy. What do you think? I think there is should be some strategy of evolution of date/time functions with final goal: all date/time functions must accept Date32/DateTime64 in addition to Date/DateTime and return also narrow range ro wide range. So finally full range 1900-01-01 - 2299-12-31 must be supported by all CH date/time functions. What bothers me in current PR #40217 there is performance degradation because of clamping/cropping. Another thing: I also think the approach to use functions like toStartOfDay64 is not a good idea. Because
|
So your approach faces the same problem and it cannot solve it. Since (as far as I understand) you try to avoid truncation, my recommendation was to throw an exception for
Sounds reasonable.
(I suppose mixed arguments, i.e. one standard-precision argument and another extended-precision argument, will also be made possible? It involves promotion of the standard-precision type to the extended type. But that's more of a convenience thing and not necessary if too much work.) In "Phase B", the behavior will be different only if 1. config "enable_date32_results" was enabled and 2. the inputs are extended types. So instead of calling the behavioral change "backwards incompatible", I would rather say that the date/time functions (like
If I read you right, config option "enable_date32_results" will control for all date/time functions (not only
Sounds good overall. ("Narrowing" is "turned on" (= the default behavior) right now already).
I concur with removing the cfg parameter at some point in future. Yes, the narrowing code can also go at that point. There will still be edge cases like
Agree.
Are you saying that this PR #40217 causes performance degradation? Could you point me to the performance report which shows that?
Let's address regressions when they occur. Actually, since the calculation with standard-precision types should remain unchanged, regressions should (in theory) only occur with extended types.
Agree. |
@rschu1ze Right now I am working on integration of the code from this PR into my branch. So I will talk about compatibility aspects later (which are also important). I am planning to make my PR at the middle of the next week. Now I want to answer on the following:
As an example of performance degradation as you see there were added many ifs in toDateTime() function:
And on the tests we see:
+1.502x - it's 50% degradation. I have to note that all performance tests of this PR are "green". Maybe date time functions are not covered well by tests. As I explained earlier:
+1.217x So in final commit we removed the clamping code: I must admit that such performance degradations must be considered separately and more carefully. I will probably write several queries specific to affected functions like toLastDayOfMonth, toStartOf and I will compare the performance of different branches. |
@rschu1ze Robert, I created PR #41214 please set yourself as reviewer for the PR. |
Agree that it is a good idea to write performance tests for all date/time functions (or to check at least the test coverage for them). With your new PR #41214, I would not worry too much about the performance degradations because at some future point, narrowing will completely go away and with it all the if statements. And for the remaining few cases, |
@zvonand I also don't understand what was the motivation for this change. |
The original motivation was #39249. I don't mind if we keep the original behavior which is to produce random results for arguments values outside the converted-to type's range. |
Fixes #39249, also impacts #25284. The upper or lower value is considered when out of normal range.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
More natural conversion of Date32, DateTime64, Date to narrower types: upper or lower normal value is considered when out of normal range