Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13561: [C++] Implement week kernel that accepts WeekOptions #11026

Closed
wants to merge 11 commits into from

Conversation

rok
Copy link
Member

@rok rok commented Aug 28, 2021

This is to resolve ARROW-13561.

@github-actions
Copy link

@rok
Copy link
Member Author

rok commented Aug 28, 2021

Shall we also change DayOfWeekOptions to WeekOptions? It would be a better fit overall if this PR is merged.

@rok rok force-pushed the ARROW-13561 branch 2 times, most recently from f8d1e4c to 6997246 Compare August 29, 2021 11:38
@rok rok force-pushed the ARROW-13561 branch 4 times, most recently from 33b27e7 to 4a1b8a7 Compare August 30, 2021 10:59
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially, ISOWeek starts from 1, Week starts from 0, and neither use DayOfWeekOptions.week_start? In that case I would just make them two optionless kernels.

@rok
Copy link
Member Author

rok commented Aug 30, 2021

So essentially, ISOWeek starts from 1, Week starts from 0, and neither use DayOfWeekOptions.week_start? In that case I would just make them two optionless kernels.

To get parity with MySQL DayOfWeekOptions.week_start should be settable either 1 or 7.

@lidavidm
Copy link
Member

So essentially, ISOWeek starts from 1, Week starts from 0, and neither use DayOfWeekOptions.week_start? In that case I would just make them two optionless kernels.

To get parity with MySQL DayOfWeekOptions.week_start should be settable either 1 or 7.

Ah, ok. But unless I'm mistaken, week_start is not actually used by the week kernel.

@rok
Copy link
Member Author

rok commented Aug 30, 2021

Ah, ok. But unless I'm mistaken, week_start is not actually used by the week kernel.

I'm just trying to understand if it does :)

@rok
Copy link
Member Author

rok commented Aug 30, 2021

This evaluates to true in MySQL on sqlfiddle:

SELECT WEEK('2008-1-6', 1) != WEEK('2008-1-6', 4)
Mode First day of week Range Week 1 is the first week
0 Sunday 0-53 with a Sunday in this year
1 Monday 0-53 with 4 or more days this year
2 Sunday 1-53 with a Sunday in this year
3 Monday 1-53 with 4 or more days this year
4 Sunday 0-53 with 4 or more days this year
5 Monday 0-53 with a Monday in this year
6 Sunday 1-53 with 4 or more days this year
7 Monday 1-53 with a Monday in this year

@rok rok force-pushed the ARROW-13561 branch 2 times, most recently from 18a2b49 to 69eed89 Compare August 30, 2021 22:09
@@ -337,6 +339,41 @@ struct ISOWeek {
Localizer localizer_;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given ISOWeek is a special case of Week now, maybe we should just have one kernel (and perhaps some conveniences, e.g. DayOfWeekOptions::IsoWeek())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's a nice idea.
I'm still trying to exactly match WEEK from MySQL and once I do I'll look into this. Sorry for the WIP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no worries. I can hold off looking at this until you're ready.

Copy link
Member Author

@rok rok Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early feedback is great though! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this? Since it seems the two kernels are very similar.

@rok
Copy link
Member Author

rok commented Sep 1, 2021

@ianmcook This now covers MySQL WEEK modes 1, 3, 4 and 6 (week 1 is the first week with 4 or more days this year).
If we want to cover modes 0, 2, 5, and 7 (week 1 is the first week with a Monday/Sunday in this year) we could add another parameter e.g.: DayOfWeekOptions.week_start_in_year. Do you think we should?

Another thing to note here is that DayOfWeekOptions.one_based_numbering doesn't exactly mean we count from 0 or 1 but rather that a date from iso week 52 of previous year that is in this year will be counted as 0 instead of 52. It's a MySQL convention.

@rok
Copy link
Member Author

rok commented Sep 1, 2021

we could to add another parameter e.g.: DayOfWeekOptions.week_start_in_year.

Perhaps DayOfWeekOptions.count_full_weeks is more idiomatic.

@lidavidm
Copy link
Member

lidavidm commented Sep 1, 2021

If the option doesn't apply to DayOfWeek, maybe it should be a separate WeekOptions in that case.

@rok
Copy link
Member Author

rok commented Sep 1, 2021

So then we now have:
DayOfWeekOptions with week_start and one_based_numbering

We would need to add three binary parameters to match the 8 modes of MySQL:

  • WeekOptions with week_start (first day of week),count_from_zero (range), count_full_weeks (week 1 is ..)
Mode First day of week Range Week 1 is the first week
0 Sunday 0-53 with a Sunday in this year
1 Monday 0-53 with 4 or more days this year
2 Sunday 1-53 with a Sunday in this year
3 Monday 1-53 with 4 or more days this year
4 Sunday 0-53 with 4 or more days this year
5 Monday 0-53 with a Monday in this year
6 Sunday 1-53 with 4 or more days this year
7 Monday 1-53 with a Monday in this year

@rok
Copy link
Member Author

rok commented Sep 1, 2021

I tried adding WeekOptions with the following parameters:

  /// What day does the week start with (Monday=true, Sunday=false)
  bool week_starts_monday;
  /// Days in current year that fall into last years ISO week return week 0 if true
  bool count_from_zero;
  /// Is the first week fully in the the year or only its 4 or more days
  bool first_week_in_year;

I've also added tests with all the MySQL modes. Docs and the logic need some more work, but first: do we go this way or stick to iso week only?

@rok rok force-pushed the ARROW-13561 branch 2 times, most recently from ef87424 to 062e5f2 Compare September 2, 2021 10:02
@rok
Copy link
Member Author

rok commented Sep 23, 2021

@lidavidm @pitrou I've removed the ISOWeek kernel and added iso_week and us_week functions that use week with specific options.

@rok
Copy link
Member Author

rok commented Sep 23, 2021

R CI is reporting:

── Error (test-dplyr-lubridate.R:80:3): extract isoweek from timestamp ─────────
Error: Invalid: Attempted to initialize KernelState from null FunctionOptions
Backtrace:
     █
  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr-lubridate.R:80:2
  2. │ ├─testthat::expect_warning(...) helper-expectation.R:101:4
  3. │ │ └─testthat:::quasi_capture(enquo(object), label, capture_warnings)
  4. │ │   ├─testthat:::.capture(...)
  5. │ │   │ └─base::withCallingHandlers(...)
  6. │ │   └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
  7. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = record_batch(tbl))))
  8. ├─input %>% mutate(x = isoweek(datetime)) %>% collect()
  9. ├─dplyr::collect(.)
 10. └─arrow:::collect.arrow_dplyr_query(.)
 11.   └─arrow:::do_exec_plan(x)
 12.     └─plan$Build(.data)
 13.       └─node$Project(projection)
 14.         ├─self$preserve_sort(ExecNode_Project(self, cols, names(cols)))
 15.         └─arrow:::ExecNode_Project(self, cols, names(cols))
── Error (test-dplyr-lubridate.R:205:3): extract isoweek from date ─────────────
Error: Invalid: Attempted to initialize KernelState from null FunctionOptions

I suppose we then need ISOWeek and USWeek declared with options in the API.

@lidavidm
Copy link
Member

Yeah, we'd need a case here:

std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(

@rok
Copy link
Member Author

rok commented Sep 23, 2021

Yeah, we'd need a case here:

std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(

@lidavidm oh, didn't realize it had to be there.
Maybe would be good to add R tests as well then?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits on style/wording.

If you're up to add R tests, I think that would be appreciated, but otherwise we can put that in a follow-up JIRA (since there might be other work to bind the new kernels to dplyr features or whatnot).

r/src/compute.cpp Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
@@ -1135,10 +1210,24 @@ void RegisterScalarTemporal(FunctionRegistry* registry) {
"iso_year", {WithDates, WithTimestamps}, int64(), &iso_year_doc);
DCHECK_OK(registry->AddFunction(std::move(iso_year)));

auto iso_week = MakeTemporal<ISOWeek, TemporalComponentExtract, Int64Type>(
"iso_week", {WithDates, WithTimestamps}, int64(), &iso_week_doc);
static const auto default_iso_week_options = WeekOptions(true, false, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, can we put comments for the parameter literals? Or actually, it might be nice to define WeekOptions::IsoDefaults and WeekOptions::UsDefaults which could then be used here and in the R bindings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added WeekOptions::ISODefaults and WeekOptions::USDefaults.

/// \note API not yet finalized
ARROW_EXPORT Result<Datum> ISOWeek(const Datum& values, ExecContext* ctx = NULLPTR);
ARROW_EXPORT Result<Datum> Week(const Datum& values, WeekOptions options = WeekOptions(),
ExecContext* ctx = NULLPTR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a helper for USWeek?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to be missing its counterpart in api_scalar.cc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

cpp/src/arrow/compute/kernels/scalar_temporal.cc Outdated Show resolved Hide resolved
@rok rok force-pushed the ARROW-13561 branch 3 times, most recently from 1d4a49c to 4f68c1c Compare September 23, 2021 14:49
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@rok
Copy link
Member Author

rok commented Sep 23, 2021

Thanks for the review @lidavidm ! :)

@rok rok requested a review from pitrou September 23, 2021 16:04
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased, will merge

@pitrou pitrou closed this in af9a50a Sep 27, 2021
@rok
Copy link
Member Author

rok commented Sep 27, 2021

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This is to resolve [ARROW-13561](https://issues.apache.org/jira/browse/ARROW-13561`).

Closes apache#11026 from rok/ARROW-13561

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants