-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10924: [C++] Validate temporal data in ValidateArrayFull #12014
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There are quite a few failing tests. We'll have to go through each of them individually to decide what to do about them.
You will want to lint and format the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates.
You'll want to lint and format again (see the CI failure: https://github.com/apache/arrow/runs/4612596806?check_suite_focus=true)
Looking at test failures:
- Random data generators need to be updated (also see these failures and these failures and there's more further down in the log)
- bridge_test.cc needs updating
- This constant also needs updating
@JabariBooker Thanks for working on this! I left some minor comments. |
@edponce There are no limits on |
@pitrou So there is a limit for |
Date32Type's unit is days and so there's nothing to check, but Date64Type's unit is milliseconds, hence it doesn't have a limit per se, but it must be a multiple of 86400000 (=1 day in milliseconds). |
FYI, that test failure in
The test uses one set of data to go through several data types, so you could just update the hardcoded data there too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM. I left some nits (we can check DataType::id() instead of the string name). And thanks for going through all the function tests; we should perhaps consider removing Date64 from the list of types to check in order to just split off those tests separately given the restrictions on value. (EDIT: not in this PR of course.)
SCOPED_TRACE(ty->ToString()); | ||
auto in_schema = schema({field("argument0", ty), field("key", int64())}); | ||
auto table = | ||
TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can check ty->id() == Type::DATE64
[{"min": 3, "max": 5}, 4], | ||
[{"min": 1, "max": 4}, null] | ||
])"), | ||
(ty->name() == "date64") ? date64_expected : default_expected), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ditto here
// Sliced | ||
CheckUnique(ArrayFromJSON(type, "[1, 2, null, 3, 2, null]")->Slice(1, 4), | ||
ArrayFromJSON(type, "[2, null, 3]")); | ||
if (type->name() == "date64") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here and so on (sorry)
cpp/src/arrow/testing/random.cc
Outdated
} | ||
|
||
case Type::type::TIME32: { | ||
TimeUnit::type unit = std::dynamic_pointer_cast<Time32Type>(field.type())->unit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we have internal::checked_[pointer_]cast
which will use dynamic or static cast depending on debug/release build
@@ -182,6 +188,9 @@ static std::shared_ptr<NumericArray<ArrowType>> GenerateNumericArray(int64_t siz | |||
|
|||
buffers[1] = *AllocateBuffer(sizeof(CType) * size); | |||
options.GenerateData(buffers[1]->mutable_data(), size); | |||
if (std::is_same<ArrowType, Date64Type>::value) { | |||
GenerateFullDayMillisNoNan(buffers[1]->mutable_data(), size); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an if-else. buffers[1]->mutable_data()
is modified twice if Date64Type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more suitable place for this type-specific generation of numeric arrays, is in GenerateTypeDataNoNan()
from GenerateOptions
. Make GenerateFullDayMillisNoNan
as a private member to GenerateOptions
and perform the if-else
check in GenerateTypedDataNoNan()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. Since we need to generate multiples of 86400000 for Date64Type
, random numbers are generated in GenerateData
and ultimately multiplied by 86400000 in GenerateFullDayMillisNoNan
. The behavior of GenerateFullDayMillisNoNan
is separated from GenerateOptions
since this class doesn't know the arrow type, only the underlying C type. However, behavior of GenerateFullDayMillisNoNan
could be moved into the above if
statement if needed.
@JabariBooker Integration data is generated in Python by Archery here: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L196 You can find more info about the JSON integration format here, and how to run the integration tests locally, here: https://arrow.apache.org/docs/format/Integration.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a minor nit on the datagen (which we probably won't run into, but…)
return super().generate_range(size, lower, upper, name) | ||
|
||
full_day_millis = 1000 * 60 * 60 * 24 | ||
lower //= full_day_millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python appears to round towards -Inf so a lower negative bound that is not exactly even will get rounded to an out-of-bound value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark runs are scheduled for baseline = 9b53235 and contender = c715beb. c715beb is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Validate that temporal data in arrays meet the restrictions of their data type specifications. Closes apache#12014 from JabariBooker/ARROW-10924 Lead-authored-by: JabariBooker <o.jabari.booker@gmail.com> Co-authored-by: Jabari Booker <bookeroronde@pop-os.localdomain> Signed-off-by: David Li <li.davidm96@gmail.com>
Validate that temporal data in arrays meet the restrictions of their data type specifications.