Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MONTH_DAY_NANO interval type, impl ArrowNativeType for i128 #779

Merged
merged 20 commits into from
Dec 20, 2021

Conversation

b41sh
Copy link
Contributor

@b41sh b41sh commented Sep 15, 2021

Which issue does this PR close?

Closes #732

Rationale for this change

What changes are included in this PR?

Add MONTH_DAY_NANO interval type

Are there any user-facing changes?

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Sep 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2021

Codecov Report

Merging #779 (841321a) into master (72c9d1e) will decrease coverage by 0.07%.
The diff coverage is 46.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #779      +/-   ##
==========================================
- Coverage   82.31%   82.23%   -0.08%     
==========================================
  Files         168      168              
  Lines       49031    49116      +85     
==========================================
+ Hits        40360    40392      +32     
- Misses       8671     8724      +53     
Impacted Files Coverage Δ
arrow/src/array/builder.rs 86.54% <0.00%> (-0.11%) ⬇️
arrow/src/array/data.rs 80.85% <0.00%> (-0.32%) ⬇️
arrow/src/array/equal/mod.rs 93.13% <0.00%> (-0.17%) ⬇️
arrow/src/array/ord.rs 67.15% <0.00%> (-0.50%) ⬇️
arrow/src/array/transform/mod.rs 84.86% <0.00%> (-0.38%) ⬇️
arrow/src/compute/kernels/sort.rs 94.81% <0.00%> (-0.14%) ⬇️
arrow/src/datatypes/types.rs 88.88% <ø> (ø)
arrow/src/ipc/gen/Schema.rs 41.60% <ø> (ø)
arrow/src/ipc/reader.rs 85.74% <ø> (ø)
arrow/src/util/display.rs 19.81% <0.00%> (-0.19%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72c9d1e...841321a. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @b41sh !

Thank you for the contribution @b41sh !

I took a brief look at this code and it looks quite good. @jorgecarleitao / @nevi-me what do you think?

cc @ovr

Ok(builder.finish())
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to implement some tests for this logic (showing for example that round tripping from IntervalMonthDayNanoArray to FixedLenByteArray and back produces an equivalent array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add some tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntervalMonthDayNanoArray can't converted to parquet format, as Interval in parquet is fixed 12 bytes, but IntervalMonthDayNano is 16 bytes. I have deleted parquet related codes.

@b41sh b41sh marked this pull request as ready for review September 18, 2021 13:49
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @b41sh and sorry for the delay in review.

I went through this PR carefully, and I think it adds the basic plumbing for MONTH_DAY_NANO -- while there is more to be done, I think it is a good step forward.

Note that since this adds new values to the IntervalUnit enums it is not "backwards compatible" in the semver sense and thus will have to wait for the next major release (arrow-6.0)

@@ -38,7 +38,7 @@ path = "src/lib.rs"
[dependencies]
serde = { version = "1.0", features = ["rc"] }
serde_derive = "1.0"
serde_json = { version = "1.0", features = ["preserve_order"] }
serde_json = { version = "1.0", features = ["preserve_order", "arbitrary_precision"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is often much concern about adding new dependencies to arrow - however, this feature does not seem to add any new dependencies: https://github.com/serde-rs/json/blob/master/Cargo.toml#L74

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this feature because we need to deserialize i128 numbers
https://github.com/serde-rs/json/blob/master/src/number.rs#L534

@@ -624,6 +625,22 @@ mod tests {
assert!(arr.is_null(1));
assert_eq!(-5, arr.value(2));
assert_eq!(-5, arr.values()[2]);

// a month_day_nano interval contains months, days and nanoseconds,
// but we do not yet have accessors for the values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding some ticket explaining what type of accessors would be valuable? Namely month, day and nanos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some TODO comments, PTAL

arrow/src/datatypes/datatype.rs Outdated Show resolved Hide resolved
arrow/src/datatypes/native.rs Outdated Show resolved Hide resolved
} else {
let value: u128 = array.value($row) as u128;

let months_part: i32 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to break this logic out into accessors, as you say, rather than have the field extraction in the stringification. But that could be done as a follow on PR I think

parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 22, 2021

I think it looks good to go. @jorgecarleitao / @nevi-me / @houqp do you have any interest in reviewing this PR?

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, @b41sh , thanks a lot for that!

Two suggestions:

  • AFAIK we must activate the IPC integration test before merging to guarantee compatibility with the ecosystem, and the test should pass. This is done by removing this line.
  • There is an alternative approach to this on which we use [i32, i32, i64] instead of i128 as the backing container. It makes the API simpler.

AFAIK the first item is a "must" within the arrow project.

houqp
houqp previously requested changes Sep 23, 2021
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, reverting my approval based on @jorgecarleitao 's comment :) At the very minimum, we should enable the integration test.

@jorgecarleitao
Copy link
Member

A trick I use in arrow2 for this is to have a git patch and apply it to arrow during execution here. An alternative is to create a PR in apache/arrow and point to that PR on our CI here. Once tests pass here, merge this PR without that change here and afterwards the PR in apache/arrow.

@b41sh
Copy link
Contributor Author

b41sh commented Sep 27, 2021

A trick I use in arrow2 for this is to have a git patch and apply it to arrow during execution here. An alternative is to create a PR in apache/arrow and point to that PR on our CI here. Once tests pass here, merge this PR without that change here and afterwards the PR in apache/arrow.

! [remote rejected] interval-MonthDayNano -> interval-MonthDayNano (refusing to allow a Personal Access Token to create or update workflow .github/workflows/integration.yml without workflow scope)

I don't have permission to modify the files in the workflow directory, can you help me start the integration test? @houqp @jorgecarleitao

@ovr
Copy link
Contributor

ovr commented Dec 2, 2021

Any news on it?

@alamb
Copy link
Contributor

alamb commented Dec 3, 2021

I don't have any news

@ovr
Copy link
Contributor

ovr commented Dec 3, 2021

I don't have permission to modify the files in the workflow directory, can you help me start the integration test? @houqp @jorgecarleitao

Just a friendly ping. Can anyone help @b41sh to finish this PR?

@jorgecarleitao
Copy link
Member

uhm, I can't understand the error message: you are pushing to your own fork of the repo, so you have full control over its CI (it is under your name, right?). Could it be that you are pushing via https and using a personal token that does not have the workflow scope active (on github)?

@b41sh
Copy link
Contributor Author

b41sh commented Dec 4, 2021

Sorry for the long delay, the problem is caused by my github ssh settings and I have fixed it.

@alamb alamb changed the title Add MONTH_DAY_NANO interval type Add MONTH_DAY_NANO interval type, impl ArrowNativeType for i128 Dec 6, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the integration tests is running and using Rust. Nice work @b41sh !

https://github.com/apache/arrow-rs/runs/4430618418?check_suite_focus=true

2021-12-06T13:25:33.8748227Z ##########################################################
2021-12-06T13:25:33.8749321Z IPC: C++ producing, Rust consuming
2021-12-06T13:25:33.8749914Z ##########################################################
...
2021-12-06T13:25:34.1225215Z ==========================================================
2021-12-06T13:25:34.1226543Z Testing file /tmp/arrow-integration-u5ctqfpn/generated_interval_mdn.json
2021-12-06T13:25:34.1227227Z ==========================================================
2021-12-06T13:25:34.1227755Z -- Creating binary inputs
2021-12-06T13:25:34.1228284Z -- Validating file
2021-12-06T13:25:34.1228787Z -- Validating stream
2021-12-06T13:25:34.1231126Z ==========================================================
...

I also re-reviewed the code. Thank you for pushing it through

I think we can make improvements as follow ons (e.g adding the accessor for the m, d, and nano fields)

.skip_category('C#')
- .skip_category('JS')
- .skip_category('Rust'),
+ .skip_category('JS'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

so is the intent that after we merge this PR into arrow-rs we would then go upstream and apply this patch in the arrow repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, once this PR is merged, we can merge #11238 in the arrow repo.

integration-testing/src/lib.rs Outdated Show resolved Hide resolved

// a month_day_nano interval contains months, days and nanoseconds,
// but we do not yet have accessors for the values.
// TODO: implement month, day, and nanos access method for month_day_nano.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 When this PR is merged, I will try and file a ticket for adding these accessors -- I think it would be a fairly good "first PR" type change for new contributors

@alamb
Copy link
Contributor

alamb commented Dec 7, 2021

@houqp and @jorgecarleitao are you interested in reviewing / re-reviewing this PR?

@liukun4515
Copy link
Contributor

It's is great to implement i128 as ArrowNativeType.
@b41sh

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR and perseverance, looks great!

I left a couple of comments where I think there are semantic issues, but if this is blocking someone, go ahead.

@@ -243,6 +243,11 @@ pub fn sort_to_indices(
DataType::Interval(IntervalUnit::DayTime) => {
sort_primitive::<IntervalDayTimeType, _>(values, v, n, cmp, &options, limit)
}
DataType::Interval(IntervalUnit::MonthDayNano) => {
sort_primitive::<IntervalMonthDayNanoType, _>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The i128 order relationship does not hold for months,days,nanos. AFAIK month,days,nanos do not have a partial order relationship.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I write a function like sort_month_day_nanos for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in general the time intervals do not have an natural order without an associated datetime pinning them to a specific time line. The conversion month,days,nanos -> seconds is lossy because one day is not 24 hours (some days are 23 and others 25).

@@ -348,6 +348,7 @@ make_numeric_type!(Time64MicrosecondType, i64, i64x8, m64x8);
make_numeric_type!(Time64NanosecondType, i64, i64x8, m64x8);
make_numeric_type!(IntervalYearMonthType, i32, i32x16, m32x16);
make_numeric_type!(IntervalDayTimeType, i64, i64x8, m64x8);
make_numeric_type!(IntervalMonthDayNanoType, i128, i128x4, m128x4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically, the numerics of an i128 are not the same as the numerics of (months,days,nanos) since i128 + i128 != (months, days,nanos) + (months, days,nanos).

The consequence of defining this type numerically here is that arithmetic kernels will accept this type, but they will yield a semantically incorrect result (e.g. i128 + i128 to sum two intervals of 1 month each).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we can't use IntervalMonthDayNanoType as a numeric type? But IntervalDayTimeType is also a numeric type here, the sum of two IntervalDayTimeType will also produce an incorrect result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, that seems a bug to me.

@alamb alamb dismissed houqp’s stale review December 8, 2021 19:26

Integration test has been completed

@alamb
Copy link
Contributor

alamb commented Dec 8, 2021

For this PR, I propose:

  1. We file follow on tickets for the bugs @jorgecarleitao identified above (I can do so)
  2. We merge this PR as is and work on those tickets as we have time.

Thoughts?

@b41sh
Copy link
Contributor Author

b41sh commented Dec 9, 2021

For this PR, I propose:

  1. We file follow on tickets for the bugs @jorgecarleitao identified above (I can do so)
  2. We merge this PR as is and work on those tickets as we have time.

Thoughts?

Can I delete the integration-testing/unskip.patch file? As the test has passed, this file was originally temporary, we can directly use the arrow repo test later.

@b41sh
Copy link
Contributor Author

b41sh commented Dec 9, 2021

thread 'rustc' panicked at 'assertion failed: sentinel == STR_SENTINEL', /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/compiler/rustc_serialize/src/opaque.rs:669:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md

note: rustc 1.59.0-nightly (e6b883c74 2021-12-08) running on x86_64-unknown-linux-gnu

note: compiler flags: -C embed-bitcode=no -C debuginfo=2 -C debuginfo=1 --crate-type lib

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack
error: could not compile `flatbuffers`
warning: build failed, waiting for other jobs to finish...
error: build failed
Error: Process completed with exit code 101.

It looks like a rust bug rust-lang/rust#91663

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

Filed #1022 to track CI failure in "nightly" builds

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

The rust compiler thing is fixed in #1023 -- I'll try and merge to this PR

@alamb alamb merged commit 8c17cff into apache:master Dec 20, 2021
@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

Thanks again @b41sh -- sorry for the delay in merging

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

Follow on tickets: #1065 and #1066 and #1067

pitrou pushed a commit to apache/arrow that referenced this pull request Jan 4, 2022
…test for rust

arrow-rs has added `MONTH_DAY_NANO` interval type in PR [#779](apache/arrow-rs#779), we need to enable integration tests for it.

Closes #11238 from b41sh/rust-month_day_nano_interval

Lead-authored-by: b41sh <baishen2009@gmail.com>
Co-authored-by: baishen <baishen2009@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for Interval Type Month, Day, Nanos
7 participants