Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion Date32 / DateTime64 / Date to narrow types #40217

Merged
merged 29 commits into from
Aug 28, 2022

Conversation

zvonand
Copy link
Contributor

@zvonand zvonand commented Aug 14, 2022

Fixes #39249, also impacts #25284. The upper or lower value is considered when out of normal range.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

More natural conversion of Date32, DateTime64, Date to narrower types: upper or lower normal value is considered when out of normal range

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Aug 14, 2022
@Avogar Avogar added the can be tested Allows running workflows for external contributors label Aug 15, 2022
@den-crane

This comment was marked as outdated.

@nikitamikhaylov

This comment was marked as outdated.

@alexey-milovidov

This comment was marked as outdated.

@rschu1ze rschu1ze self-assigned this Aug 16, 2022
@zvonand zvonand marked this pull request as ready for review August 17, 2022 15:59
@zvonand zvonand changed the title Fix narrowing Date32 / DateTime64 to normal range Fix conversion Date32 / DateTime64 / Date to narrow types Aug 18, 2022
@zvonand

This comment was marked as outdated.

Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pong 😄

I didn't try it out but it would be interesting to know if the other methods mentioned here work correctly at the type boundaries, e.g. what happens when addYears() (or date_add etc.) produces an out-of-bounds value, what happens if yesterday() is called on the lowest possible value etc.

docs/ru/sql-reference/data-types/date.md Show resolved Hide resolved
src/Functions/DateTimeTransforms.h Outdated Show resolved Hide resolved
tests/queries/0_stateless/02403_date_time_narrowing.sql Outdated Show resolved Hide resolved
tests/queries/0_stateless/02403_date_time_narrowing.sql Outdated Show resolved Hide resolved
src/Functions/FunctionsConversion.h Outdated Show resolved Hide resolved
src/Functions/FunctionsConversion.h Outdated Show resolved Hide resolved
src/Functions/DateTimeTransforms.h Outdated Show resolved Hide resolved
src/Functions/DateTimeTransforms.h Show resolved Hide resolved
src/Functions/DateTimeTransforms.h Outdated Show resolved Hide resolved
@zvonand zvonand requested a review from rschu1ze August 23, 2022 15:34
Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain has melted after review but the change is almost ready ^^

docs/en/sql-reference/functions/date-time-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/date-time-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/date-time-functions.md Outdated Show resolved Hide resolved
docs/en/sql-reference/functions/date-time-functions.md Outdated Show resolved Hide resolved
tests/queries/0_stateless/02403_date_time_narrowing.sql Outdated Show resolved Hide resolved
src/Functions/CustomWeekTransforms.h Outdated Show resolved Hide resolved
src/Functions/DateTimeTransforms.h Outdated Show resolved Hide resolved
src/Functions/FunctionsConversion.h Outdated Show resolved Hide resolved
src/Functions/FunctionsConversion.h Outdated Show resolved Hide resolved
zvonand and others added 4 commits August 23, 2022 23:40
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
Co-authored-by: Robert Schulze <robert@clickhouse.com>
@zvonand

This comment was marked as outdated.

@rschu1ze
Copy link
Member

Thanks for the explanation. The semantics seem quite dangerous to me, and they are also not documented. Perhaps integer-to-date conversion should not be allowed in the first place. But that's a different topic.

I asked about the non-distributed / distributed mismatch in our internal chat and hope to come back with some findings.

@rschu1ze

This comment was marked as outdated.

@zvonand

This comment was marked as outdated.

@rschu1ze rschu1ze merged commit df934d8 into ClickHouse:master Aug 28, 2022
@zvonand zvonand deleted the zvonand-minmax branch August 28, 2022 15:16
@rvasin
Copy link
Contributor

rvasin commented Sep 2, 2022

@alexey-milovidov I need your advice because the situation is really interesting.

I was working on #38435 and my branch https://github.com/arenadata/ClickHouse/tree/ADQM-528-B
is ready for PR (I need only to update docs).

In my branch I fixed the functions: toStartOfYear, toStartOfISOYear, toStartOfQuarter, toStartOfMonth, toStartOfWeek, toMonday, toLastDayOfMonth
The are all working fine with range 1900-01-01 - 2299-12-31. Functional tests are ready and they work.
I also implemented a compatibility option: enable_date32_results

Today before doing my PR I tried to merge with upstream master and I had a merge conflict because of this PR #40217

After a code review I have noticed many strange new things. For example, consider only function toLastDayOfMonth (most of all other functions are also affected);

this PR #40217

struct ToLastDayOfMonthImpl
{
    static constexpr auto name = "toLastDayOfMonth";

    static inline UInt16 execute(Int64 t, const DateLUTImpl & time_zone)
    {
        if (t < 0)
            return 0;

        /// 0xFFF9 is Int value for 2149-05-31 -- the last day where we can actually find LastDayOfMonth. This will also be the return value.
        return time_zone.toLastDayNumOfMonth(ExtendedDayNum(std::min(Int32(time_zone.toDayNum(t)), Int32(0xFFF9))));
    }
    static inline UInt16 execute(UInt32 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(time_zone.toDayNum(t));
    }
    static inline UInt16 execute(Int32 d, const DateLUTImpl & time_zone)
    {
        if (d < 0)
            return 0;

        /// 0xFFF9 is Int value for 2149-05-31 -- the last day where we can actually find LastDayOfMonth. This will also be the return value.
        return time_zone.toLastDayNumOfMonth(ExtendedDayNum(std::min(d, Int32(0xFFF9))));
    }
    static inline UInt16 execute(UInt16 d, const DateLUTImpl & time_zone)
    {
        /// 0xFFF9 is Int value for 2149-05-31 -- the last day where we can actually find LastDayOfMonth. This will also be the return value.
        return time_zone.toLastDayNumOfMonth(DayNum(std::min(d, UInt16(0xFFF9))));
    }

    using FactorTransform = ZeroTransform;
};

In my branch:

struct ToLastDayOfMonthImpl
{
    static constexpr auto name = "toLastDayOfMonth";

    static inline Int64 execute(Int64 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(time_zone.toDayNum(t));
    }
    static inline UInt16 execute(UInt32 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(time_zone.toDayNum(t));
    }
    static inline Int32 execute(Int32 d, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(ExtendedDayNum(d));
    }
    static inline UInt16 execute(UInt16 d, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(DayNum(d));
    }

    using FactorTransform = ZeroTransform;
};

In master at the beginning of August:

struct ToLastDayOfMonthImpl
{
    static constexpr auto name = "toLastDayOfMonth";

    static inline UInt16 execute(Int64 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(time_zone.toDayNum(t));
    }
    static inline UInt16 execute(UInt32 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(time_zone.toDayNum(t));
    }
    static inline UInt16 execute(Int32 d, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(ExtendedDayNum(d));
    }
    static inline UInt16 execute(UInt16 d, const DateLUTImpl & time_zone)
    {
        return time_zone.toLastDayNumOfMonth(DayNum(d));
    }

    using FactorTransform = ZeroTransform;
};

So please advice what should I do now? I seems to make it working with full range 1900-01-01 to 2299-12-31 I need to completely discard all the changes made by this PR #40217 to all mentioned functions.

Right now what we have:

SELECT toLastDayOfMonth(toDateTime64('1950-02-02 10:20:30', 3));

before this PR #40217 (master from the beginning of August 2022); it returns:
2129-08-04

in this PR #40217 it returns:
1970-01-01

I my branch it returns the correct result:
1950-02-28

The joke is that this PR #40217 is the real improvement: 1970 is closer to 1950 than 2129. But it's still not correct.

@rschu1ze
Copy link
Member

rschu1ze commented Sep 3, 2022

@rvasin Allow me to answer. This PR (#40217) is in some sense a stopgap solution. As you correction mention at the end of your last comment, it produces incorrect results when the input is of extended date/time type and outside the standard value range. The improvement is that prior to this PR, the result was in principle completely random, whereas the calculation of the incorrect result is now at least deterministic. The exact behavior is documented here. More specifically, most date/time functions now implement "cropping" semantics which makes sense iff one assumes they always produce output in a standard precision type.

An implementation which produces the correct result for all input type + value combinations requires type promotion (and yes, such an implementation would replace the modifications in this PR). The reason this has not been done yet were compat concerns.
My personal opinion is that of (1) in #25284 (comment). I think that the reasons for not changing the date/time functions in a backwards-incompatible way are legit. On the other hand, I also think they are not substantiated enough as

  1. only date/time values in the 1900-1969 and 2150-2299 time ranges would be affected and I doubt that users have a lot of such data. The situation would be different for an OLTP database where there is a high chance of storing birthdays etc. But to support such use cases regardless, we should have a server setting which re-enables the legacy cropping behavior.
  2. Adding new "extended" versions of date time functions (e.g. toLastDayOfMonth64) + doing AST transformations sounds hard to implement and difficult to comprehend for users.

But it would be interesting to also hear the opinion of @amosbird and @alexey-milovidov .

PS: @rvasin Feel free to set me as reviewer to your PR once it is ready, I am by now quite deep in the entire date/time madness 😄

@ilejn
Copy link
Contributor

ilejn commented Sep 3, 2022

Hello @rschu1ze , thanks for the reply.
Could you shed some light on backward compatibility issues here, please?
Asking to be sure that we are not missing something important.

@rschu1ze
Copy link
Member

rschu1ze commented Sep 4, 2022

@rvasin I would say the approach in your comment #38435 (comment) is the most intuitive one, i.e. for Date and DateTime inputs date/time functions produce a Date (*) and for Date32 and DateTime64 inputs they produce a Date32. I also agree with #38435 (comment) that ideally all date/time functions should accept extended date/time types as input unless we want the user to insert explicit casts.

About compatibility issues: I guess that the folks I mentioned in #40217 (comment) recall the concerns better than me (I joined the team just a few months ago). Let's hear their feedback.
In my understanding, the potential backwards incompatibility for partition clauses involving date/time functions is not about the type (because the integers underlying the extended-precision types equal the ones for standard-precision types for the range where the types overlap, as far as I know) but about the result value. In that sense, a backwards compatibility setting would not affect the type but whether the correct value or a truncated value is computed.

(*) Just to nitpick, toLastDayOfMonth() will never able to handle Date inputs in the range [2149-06-01, 2149-06-06] correctly, even with your new approach. Maybe your approach should throw an exception for these values instead of doing truncation as done in this PR.

@rvasin
Copy link
Contributor

rvasin commented Sep 4, 2022

@rschu1ze OK. Robert, it would it great if you could be a reviewer for my PR.

I will probably write more details on Monday. Here I will mention some of my ideas about the current situation.

(*) Just to nitpick, toLastDayOfMonth() will never able to handle Date inputs in the range [2149-06-01, 2149-06-06] correctly, even with your new approach. Maybe your approach should throw an exception for these values instead of doing truncation as done in this PR.

I don't understand what are you talking about [2149-06-01, 2149-06-06] range. In my branch all mentioned functions work fine in [1900-01-01, 2299-12-31] range.

It's implemented in such way that if argument is Date32 or DateTime64 the result is Date32 (wide range) otherwise the result is Date type (narrow range).

So it's backward incompatible solution. Therefore as I discussed with @ilejn we decided to implemented the compatibility option enable_date32_results which currently has FALSE value by default. So my PR would not break compatibility. And if someone wants to to use wide ranges he may set value to TRUE. Lets say it's phase A.

Phase B. correct functions like date_diff to accept Date32 and DateTime64 as arguments. This what is causing the main backward compatibility concerts. If we have things like
date_diff(toStartOfMonth(),toLastDayOfMonth())
and if toStartOfMonth or toLastDayOfMonth return Date32 this will break date_diff().
But if we correct all date/time functions like date_diff to make it working with Date32 or DateTime64 arguments. Then because there are no backward compatibility concerts on this phase. So we may move to Phase C:

Phase C. The idea that with some CH version (for example in October or November) we may set this option TRUE by default (there is special compatibility section for that in CH).

Phase D. Somewhere in the future to make is function obsolete. Because behavior will be as TRUE

What comes to my mind:

Maybe narrowing implemented in current PR makes sense. We may keep it (while I have ideas how to refactor it using std::clamp function etc).

So my idea is:
In my PR I may keep all narrowing code (don't remove it). So it will keep CH backward compatible.
And put my code with enable_date32_results=false
So my PR would be a combination of

  1. turned on narrowing and
  2. turned off Date32 results by default.
    What do you think?

And maybe somewhere on phase D we may remove all narrowing code.

Maybe it's a good strategy. What do you think? I think there is should be some strategy of evolution of date/time functions with final goal: all date/time functions must accept Date32/DateTime64 in addition to Date/DateTime and return also narrow range ro wide range. So finally full range 1900-01-01 - 2299-12-31 must be supported by all CH date/time functions.

What bothers me in current PR #40217 there is performance degradation because of clamping/cropping.
Please read my comment from by previous PR. There is similar clamping/cropping. I tried many ways to find a better code. But there is still the performance degradation.
Finally @alexey-milovidov said to remove the clamping code from PR.
But as I said maybe on Phase D we may remove all narrowing code so the performance will be the same as in code at the beginning of August.

Another thing: I also think the approach to use functions like toStartOfDay64 is not a good idea. Because

  1. We need to keep CH more ANSI SQL compatible. Other databases like MySQL/PostgreSQL don't have functions like toStartOfDay64.
  2. CH already has functions like date_add and date_sub (and it does not have functions like date_add 64) which return Date, Date32, DateTime or DateTime64 types depending on arguments. So it's llogical when for example toStartOfMonth returns Date or Date32 (as currently implemented in my branch).

@rschu1ze
Copy link
Member

rschu1ze commented Sep 5, 2022

(*) Just to nitpick, toLastDayOfMonth() will never able to handle Date inputs in the range [2149-06-01, 2149-06-06] correctly, even with your new approach. Maybe your approach should throw an exception for these values instead of doing truncation as done in this PR.

I don't understand what are you talking about [2149-06-01, 2149-06-06] range. In my branch all mentioned functions work fine in [1900-01-01, 2299-12-31] range.

toLastDayOfMonth() of values [2149-06-01, 2149-06-06] produces 2049-06-30. If the input is a Date, we'll produce a Date as well. But 2049-06-30 cannot be represented as a Date. This PR takes the "truncation" route, meaning that the result is 2149-05-31 (and wrong - yet this kind of behavior is documented).

It's implemented in such way that if argument is Date32 or DateTime64 the result is Date32 (wide range) otherwise the result is Date type (narrow range).

So your approach faces the same problem and it cannot solve it. Since (as far as I understand) you try to avoid truncation, my recommendation was to throw an exception for Date (!) inputs [2149-06-01, 2149-06-06]. Date32 inputs are okay.

So it's backward incompatible solution. Therefore as I discussed with @ilejn we decided to implemented the compatibility option enable_date32_results which currently has FALSE value by default. So my PR would not break compatibility. And if someone wants to to use wide ranges he may set value to TRUE. Lets say it's phase A.

Sounds reasonable.

Phase B. correct functions like date_diff to accept Date32 and DateTime64 as arguments. This what is causing the main backward compatibility concerts. If we have things like
date_diff(toStartOfMonth(),toLastDayOfMonth())
and if toStartOfMonth or toLastDayOfMonth return Date32 this will break date_diff().
But if we correct all date/time functions like date_diff to make it working with Date32 or DateTime64 arguments. Then because there are no backward compatibility concerts on this phase. So we may move to Phase C:

(I suppose mixed arguments, i.e. one standard-precision argument and another extended-precision argument, will also be made possible? It involves promotion of the standard-precision type to the extended type. But that's more of a convenience thing and not necessary if too much work.)

In "Phase B", the behavior will be different only if 1. config "enable_date32_results" was enabled and 2. the inputs are extended types. So instead of calling the behavioral change "backwards incompatible", I would rather say that the date/time functions (like date_diff) will gain new capabilities. If the inputs are standard-precision types, neither the result type nor the result value is different and nothing "breaks". Or do I miss something?

Phase C. The idea that with some CH version (for example in October or November) we may set this option TRUE by default (there is special compatibility section for that in CH).

Phase D. Somewhere in the future to make is function obsolete. Because behavior will be as TRUE

If I read you right, config option "enable_date32_results" will control for all date/time functions (not only toLastDayOfMonth) whether extended-precision inputs are accepted. Since users can nest date/time SQL functions in arbitrary combinations, the entire compatibility extension topic gets an "all-or-nothing" taste. If only some functions accept/produce extended types and others don't, we are in for a bad surprise (= bug reports).

So my idea is:
In my PR I may keep all narrowing code (don't remove it). So it will keep CH backward compatible.
And put my code with enable_date32_results=false
So my PR would be a combination of

turned on narrowing and
turned off Date32 results by default.
What do you think?

Sounds good overall. ("Narrowing" is "turned on" (= the default behavior) right now already).

And maybe somewhere on phase D we may remove all narrowing code.

I concur with removing the cfg parameter at some point in future. Yes, the narrowing code can also go at that point. There will still be edge cases like toLastDayOfMonth(Date([2149-06-01, 2149-06-06]) mentioned above which need special treatment.

Maybe it's a good strategy. What do you think? I think there is should be some strategy of evolution of date/time functions with final goal: all date/time functions must accept Date32/DateTime64 in addition to Date/DateTime and return also narrow range ro wide range. So finally full range 1900-01-01 - 2299-12-31 must be supported by all CH date/time functions.

Agree.

What bothers me in current PR #40217 there is performance degradation because of clamping/cropping.

Are you saying that this PR #40217 causes performance degradation? Could you point me to the performance report which shows that?

Please read #39425 (comment) from by previous PR. There is similar clamping/cropping. I tried many ways to find a better code.

Let's address regressions when they occur. Actually, since the calculation with standard-precision types should remain unchanged, regressions should (in theory) only occur with extended types.

Another thing: I also think the approach to use functions like toStartOfDay64 is not a good idea.

Agree.

@rvasin
Copy link
Contributor

rvasin commented Sep 8, 2022

@rschu1ze Right now I am working on integration of the code from this PR into my branch. So I will talk about compatibility aspects later (which are also important). I am planning to make my PR at the middle of the next week.

Now I want to answer on the following:

Are you saying that this PR #40217 causes performance degradation? Could you point me to the performance report which shows that?

As an example of performance degradation as you see there were added many ifs in toDateTime() function:
https://github.com/ClickHouse/ClickHouse/pull/40217/files#diff-2859e9a878020ac61a5586d7413871dfc37b790071c191bc1fe3f091ee56370eR324-R338

static inline UInt32 execute(Int64 d, const DateLUTImpl & time_zone)
{
    if (d < 0)
        return 0;

    auto date_time = time_zone.toDate(d);
    return date_time <= 0xffffffff ? date_time : 0xffffffff;
}

static inline UInt32 execute(const DecimalUtils::DecimalComponents<DateTime64> & t, const DateLUTImpl & /*time_zone*/)
{
    return dt64;
    if (t.whole < 0 || (t.whole >= 0 && t.fractional < 0))
        return 0;

    return std::min<Int64>(t.whole, Int64(0xFFFFFFFF));
}

And on the tests we see:
https://s3.amazonaws.com/clickhouse-test-reports/40217/b9b8f7a05d29104af1dbeb8216c3d99da2cb140c/performance_comparison_[2/4]/report.html

SELECT count() FROM numbers(10000000) WHERE NOT ignore(toDateTime('2017-01-01 00:00:00') + number % 100000000 + rand() % 100000 AS t, formatDateTime(t, '%Y-%m-%d %H:%M:%S'))

+1.502x - it's 50% degradation.

I have to note that all performance tests of this PR are "green". Maybe date time functions are not covered well by tests.

As I explained earlier:
In my previous PR I had the similar situation:
My performance tests were also "green".
But after adding clamping/cropping code to AddDaysImpl and AddWeeksImpl we have noticed the degradation:
https://s3.amazonaws.com/clickhouse-test-reports/39425/96598e35743aa83e3ac31cc4776190543671f1d1/performance_comparison_[4/4]/report.html

SELECT count() FROM numbers(50000000) WHERE NOT ignore(toDate('2017-01-01') + number % 1000 + rand() % 10 AS t, toRelativeDayNum(t))

+1.217x

So in final commit we removed the clamping code:
537ba61

I must admit that such performance degradations must be considered separately and more carefully. I will probably write several queries specific to affected functions like toLastDayOfMonth, toStartOf and I will compare the performance of different branches.
Right now I want to concentrate on integrating on current PR (which is in master now) with my branch as explained in my previous messages.

@rvasin
Copy link
Contributor

rvasin commented Sep 12, 2022

PS: @rvasin Feel free to set me as reviewer to your PR once it is ready, I am by now quite deep in the entire date/time madness smile

@rschu1ze Robert, I created PR #41214 please set yourself as reviewer for the PR.
We will discuss compatibility questions etc in PR #41214 and probably make some more improvements after a talk there.

@rschu1ze
Copy link
Member

rschu1ze commented Sep 14, 2022

Agree that it is a good idea to write performance tests for all date/time functions (or to check at least the test coverage for them). With your new PR #41214, I would not worry too much about the performance degradations because at some future point, narrowing will completely go away and with it all the if statements. And for the remaining few cases, likely()/unlikely() statements can help.

@alexey-milovidov
Copy link
Member

@zvonand I also don't understand what was the motivation for this change.
It is unclear whether we need narrowing.

alexey-milovidov added a commit that referenced this pull request Oct 21, 2022
@rschu1ze
Copy link
Member

The original motivation was #39249.

I don't mind if we keep the original behavior which is to produce random results for arguments values outside the converted-to type's range.
E.g. SELECT toDate(toDate32('1930-01-01')) -- before the revert: 1970-01-01, after the revert: 2109-06-07
It should just be documented (--> here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Casting of the date/timestamps to smaller types should return min/max instead of overflow
9 participants