Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10924: [C++] Validate temporal data in ValidateArrayFull #12014

Closed
wants to merge 21 commits into from

Conversation

JabariBooker
Copy link
Contributor

Validate that temporal data in arrays meet the restrictions of their data type specifications.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

There are quite a few failing tests. We'll have to go through each of them individually to decide what to do about them.

You will want to lint and format the code.

cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

You'll want to lint and format again (see the CI failure: https://github.com/apache/arrow/runs/4612596806?check_suite_focus=true)

Looking at test failures:

cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/validate.cc Outdated Show resolved Hide resolved
@edponce
Copy link
Contributor

edponce commented Dec 29, 2021

@JabariBooker Thanks for working on this! I left some minor comments.
Also, we are missing validation for the following temporal types: Date32Type and TimestampType.

@pitrou
Copy link
Member

pitrou commented Dec 30, 2021

@edponce There are no limits on Date32Type and TimestampType AFAICT.

@edponce
Copy link
Contributor

edponce commented Jan 6, 2022

@pitrou So there is a limit for Date64Type but not for Date32Type? I now understand that TimestampType does not have limits.

@lidavidm
Copy link
Member

lidavidm commented Jan 6, 2022

Date32Type's unit is days and so there's nothing to check, but Date64Type's unit is milliseconds, hence it doesn't have a limit per se, but it must be a multiple of 86400000 (=1 day in milliseconds).

@lidavidm
Copy link
Member

FYI, that test failure in GroupBy.MinMaxTypes is here:

TEST(GroupBy, MinMaxTypes) {

The test uses one set of data to go through several data types, so you could just update the hardcoded data there too.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. I left some nits (we can check DataType::id() instead of the string name). And thanks for going through all the function tests; we should perhaps consider removing Date64 from the list of types to check in order to just split off those tests separately given the restrictions on value. (EDIT: not in this PR of course.)

SCOPED_TRACE(ty->ToString());
auto in_schema = schema({field("argument0", ty), field("key", int64())});
auto table =
TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can check ty->id() == Type::DATE64

[{"min": 3, "max": 5}, 4],
[{"min": 1, "max": 4}, null]
])"),
(ty->name() == "date64") ? date64_expected : default_expected),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ditto here

// Sliced
CheckUnique(ArrayFromJSON(type, "[1, 2, null, 3, 2, null]")->Slice(1, 4),
ArrayFromJSON(type, "[2, null, 3]"));
if (type->name() == "date64") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here and so on (sorry)

}

case Type::type::TIME32: {
TimeUnit::type unit = std::dynamic_pointer_cast<Time32Type>(field.type())->unit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have internal::checked_[pointer_]cast which will use dynamic or static cast depending on debug/release build

@JabariBooker
Copy link
Contributor Author

@lidavidm @edponce @pitrou There are still failing integration tests (all are also failing due to invalid values for date64). The data involved in these tests seem to be generated by some mechanism I'm unaware of.

@@ -182,6 +188,9 @@ static std::shared_ptr<NumericArray<ArrowType>> GenerateNumericArray(int64_t siz

buffers[1] = *AllocateBuffer(sizeof(CType) * size);
options.GenerateData(buffers[1]->mutable_data(), size);
if (std::is_same<ArrowType, Date64Type>::value) {
GenerateFullDayMillisNoNan(buffers[1]->mutable_data(), size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an if-else. buffers[1]->mutable_data() is modified twice if Date64Type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more suitable place for this type-specific generation of numeric arrays, is in GenerateTypeDataNoNan() from GenerateOptions. Make GenerateFullDayMillisNoNan as a private member to GenerateOptions and perform the if-else check in GenerateTypedDataNoNan().

Copy link
Contributor Author

@JabariBooker JabariBooker Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. Since we need to generate multiples of 86400000 for Date64Type, random numbers are generated in GenerateData and ultimately multiplied by 86400000 in GenerateFullDayMillisNoNan. The behavior of GenerateFullDayMillisNoNan is separated from GenerateOptions since this class doesn't know the arrow type, only the underlying C type. However, behavior of GenerateFullDayMillisNoNan could be moved into the above if statement if needed.

cpp/src/arrow/testing/random.cc Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jan 31, 2022

@JabariBooker Integration data is generated in Python by Archery here: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L196

You can find more info about the JSON integration format here, and how to run the integration tests locally, here: https://arrow.apache.org/docs/format/Integration.html

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a minor nit on the datagen (which we probably won't run into, but…)

return super().generate_range(size, lower, upper, name)

full_day_millis = 1000 * 60 * 60 * 24
lower //= full_day_millis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python appears to round towards -Inf so a lower negative bound that is not exactly even will get rounded to an out-of-bound value

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we should be good here. This turned out to be quite a bit more involved than it looked at first.

@pitrou @edponce any other comments?

dev/archery/archery/integration/tester_cpp.py Show resolved Hide resolved
@lidavidm lidavidm closed this in c715beb Feb 2, 2022
@ursabot
Copy link

ursabot commented Feb 2, 2022

Benchmark runs are scheduled for baseline = 9b53235 and contender = c715beb. c715beb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Feb 4, 2022
Validate that temporal data in arrays meet the restrictions of their data type specifications.

Closes apache#12014 from JabariBooker/ARROW-10924

Lead-authored-by: JabariBooker <o.jabari.booker@gmail.com>
Co-authored-by: Jabari Booker <bookeroronde@pop-os.localdomain>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants