Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13628: [Format][C++][Java] Add MONTH_DAY_NANO interval type #10177

Closed
wants to merge 30 commits into from

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Apr 27, 2021

Trying to formalize mailing list discussion

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start. Can you nix the submodule update?

format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emkornfield

I did some more digging around and one thing I noticed is that the INTERVAL_TYPE defined in parquet reference uses a similar construction with three fields

It uses 4 bytes to represent month, 4 bytes to represent days, and 4 bytes to represent milliseconds rather than 8 bytes to represent nanoseconds.

I know I was lobbying for nanosecond precision intervals, but given postgres and parquet both use millisecond precision, perhaps it would be best if Arrow followed the same model?

@alamb
Copy link
Contributor

alamb commented Apr 28, 2021

@ovr and @nevi-me I wonder if you have any thoughts / opinions

@emkornfield
Copy link
Contributor Author

I know I was lobbying for nanosecond precision intervals, but given postgres and parquet both use millisecond precision, perhaps it would be best if Arrow followed the same model?

I thought postgres used Microseconds not Milliseconds. This is why the postgres struct there was 8 bytes for seconds/subseconds. I actually liked the argument for having the finest grained available data available of other Arrow temporal types. I'm not sure any alignment is going to be perfect. Using this type in language specific concepts is already going to require a special type (e.g. there is nothing that seems to include all the fields together in the survey).

@alamb
Copy link
Contributor

alamb commented Apr 28, 2021

I actually liked the argument for having the finest grained available data available of other Arrow temporal types. I'm not sure any alignment is going to be perfect.

Makes sense to me

@westonpace
Copy link
Member

I believe this would also imply a minor bump to the Arrow columnar format version? (1.0.0 -> 1.1.0)? From a quick glance it seems forwards compatibility should be ok (I think you'll get Status::NotImplemented("Unrecognized interval type."))

@emkornfield
Copy link
Contributor Author

I believe this would also imply a minor bump to the Arrow columnar format version? (1.0.0 -> 1.1.0)? From a quick glance it seems forwards compatibility should be ok (I think you'll get Status::NotImplemented("Unrecognized interval type."))

Yes I added a comment above with version history. We should discuss on the ML, but I think it makes sense to release as soon as changes are accepted for the format.

jorgecarleitao
jorgecarleitao previously approved these changes Apr 29, 2021
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks a lot, @emkornfield !

format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

The Parquet reference includes this paragraph:

Each component in this representation is independent of the others. For example, there is no requirement that a large number of days should be expressed as a mix of months and days because there is not a constant conversion from days to months.

Should something similar be added to clarify that eg the nanoseconds field is not limited to 1 day?

@emkornfield
Copy link
Contributor Author

@jorisvandenbossche good point. Added some text, let me know if it makes sense.

@jorisvandenbossche
Copy link
Member

Looks good!

format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jun 1, 2021

Was this voted on already? I forget.

@emkornfield
Copy link
Contributor Author

emkornfield commented Jun 1, 2021 via email

@pitrou
Copy link
Member

pitrou commented Jul 27, 2021

Are you planning to write the C++ and Java implementations @emkornfield ?

@emkornfield
Copy link
Contributor Author

Yes still planning on it. Hopefully will have some time this week.

@emkornfield
Copy link
Contributor Author

@liyafan82 @pitrou thank you for the reviews, I think I addressed the relevant comments.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more comment. Thank you for the updates!


void rand_day_millis(int64_t N, std::vector<DayTimeIntervalType::DayMilliseconds>* out) {
const int random_seed = 0;
std::default_random_engine gen(random_seed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're using pcg32_fast in RandomArrayGenerator, use it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

PeriodDuration pd1 = new PeriodDuration(Period.of(1, 2, 3), Duration.ofNanos(123));
PeriodDuration pdEq1 = new PeriodDuration(Period.of(1, 2, 3), Duration.ofNanos(123));
PeriodDuration pd2 = new PeriodDuration(Period.of(1, 2, 3), Duration.ofNanos(12));
PeriodDuration pd3 = new PeriodDuration(Period.of(1, 2, 0), Duration.ofNanos(123));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need some cases for which the month/day/nano are negative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@liyafan82
Copy link
Contributor

@liyafan82 @pitrou thank you for the reviews, I think I addressed the relevant comments.

Thanks for your update.
Had another pass today. Mostly looks good to me, with one additional minor comment.

@pitrou
Copy link
Member

pitrou commented Aug 24, 2021

Are the Rust integration failures expected? the CI builds are green on git master.

@emkornfield
Copy link
Contributor Author

Are the Rust integration failures expected? the CI builds are green on git master.

@jorgecarleitao investigated and did not think it was related to this change but rather a latent bug.

@emkornfield
Copy link
Contributor Author

@pitrou @liyafan82 any more feedback? Otherwise I'd like to get this merged so I can start on the follow-up items from the vote (I'll merge end of week unless there are more comments or we have more concerns about the integration tests)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you

@liyafan82
Copy link
Contributor

The Java changes LGTM. Good job! @emkornfield

@emkornfield
Copy link
Contributor Author

Thanks for the reviews. merging.

zeroshade pushed a commit to zeroshade/arrow that referenced this pull request Sep 12, 2021
Trying to formalize [mailing list discussion](https://lists.apache.org/thread.html/rd919c4ed8ad2f2827a2d4f665d8da99e545ba92ef992b2e557831751%40%3Cdev.arrow.apache.org%3E)

Closes apache#10177 from emkornfield/interval

Lead-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: emkornfield <micahk@google.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
asfgit pushed a commit that referenced this pull request Oct 13, 2021
#10177 Added the interval type Month, Day, Nano to the flatbuffer and C++/Java, this updates the flatbuffer generated files and adds support for the type to Go.

Closes #11310 from zeroshade/arrow-13804-month-day-nano

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants