Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17289: [C++] Add type category membership checks #13783

Merged
merged 12 commits into from
Aug 17, 2022

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Aug 2, 2022

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 2, 2022

cc @icexelloss, @westonpace

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm extremely lukewarm about this. In any case, the implementation is needlessly inefficient.

@@ -2494,6 +2519,61 @@ const std::vector<std::shared_ptr<DataType>>& PrimitiveTypes() {
return g_primitive_types;
}

bool IsBaseBinaryType(std::shared_ptr<DataType> type) {
std::call_once(static_data_initialized, InitStaticData);
return g_base_binary_types_set.find(type) != g_base_binary_types_set.end();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really an inefficient way to do it instead of simply switching based on the id().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -2151,4 +2151,32 @@ const std::vector<std::shared_ptr<DataType>>& IntervalTypes();
ARROW_EXPORT
const std::vector<std::shared_ptr<DataType>>& PrimitiveTypes();

ARROW_EXPORT
bool IsSignedIntType(std::shared_ptr<DataType>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards -1 on adding these APIs. There are similar APIs, with different names and inlined, that take a type id. It's not significantly to write IsSignedIntType(type) rather than is_signed_integer(type->id()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the signature is inefficient as this is always copying a shared_ptr (implying an atomic reference increment). Instead this could just take a const DataType&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the set of *Type() functions does not correspond to the set of is_* ones. This can be really confusing to the user; for example, one could easily incorrectly assume that PrimtiveTypes() corresponds to is_primitive. So, I think some clean up is warranted around this code, while considering backwards compatibility.

I'll soon add a commit here that shows the is_* functions I needed to add to obtain correspondence. However, I don't like the mess this makes of the function names, so I'd appreciate a suggestion here.

@pitrou
Copy link
Member

pitrou commented Aug 3, 2022

It sounds like it would be better to document the existing predicates, rather than add new ones with different names.

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 3, 2022

It sounds like it would be better to document the existing predicates, rather than add new ones with different names.

The existing set of predicates is both confusing and incomplete, so I'd argue for both documenting and adding predicates. I'm not sure what to do with the names, considering backwards compatibility, though. Suppose we agree that a predicate corresponding to PrimitiveTypes should be added, since is_primitive is not the one, what should be the name of this predicate? There should be some naming convention here; perhaps *Types() should have a corresponding check_*(type) predicate.

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 4, 2022

I added docs that clarify the meaning of and correspondence between *Types() and Is*Type() functions.

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

The existing set of predicates is both confusing and incomplete, so I'd argue for both documenting and adding predicates

Hmm... can you point out the causes for confusion? Perhaps there's a way to improve that.

Same for incompleteness: is the set of Type::type predicates incomplete? if so, we should add the missing and desired predicates.

Suppose we agree that a predicate corresponding to PrimitiveTypes should be added, since is_primitive is not the one

Why is is_primitive not the one? It seems like it is returning the desired information, is it not? (is it missing some types perhaps?)

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 4, 2022

Hmm... can you point out the causes for confusion? Perhaps there's a way to improve that.

Same for incompleteness: is the set of Type::type predicates incomplete? if so, we should add the missing and desired predicates.

Sure. Check out this commit:

  • is_numeric is missing and I added it.
  • is_primitive exists but does not correspond to PrimitiveTypes(), which is confusing. I added is_primitive_like which does, though I'm not sold on the name.
  • I added the missing is_binary, is_string, is_temporal, and is_interval.
  • There are other is_*() primitives that do not correspond to any *Types() function - is_base_binary_like is one example.

So, part of the confusion is the unexpected non-correspondence of is_primitive and part is due to missing and extra predicates with respect to the set of *Types() functions. In this PR, I extended the set of is_*() predicates to cover the correspondences and added Is*Type() predicates that correspond to *Types() functions as expected, while maintaining backwards-compatibility, and I added docs to clarify the meaning and correspondences.

Why is is_primitive not the one? It seems like it is returning the desired information, is it not? (is it missing some types perhaps?)

As compared to PrimitiveTypes() (and is_primitive_like() that I added), is_primitive is missing NA, BINARY, STRING, LARGE_BINARY and LARGE_STRING while it adds DURATION, INTERVAL_MONTHS, INTERVAL_MONTH_DAY_NANO, and INTERVAL_DAY_TIME.

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

Ok, so there are two different areas here:

  1. complete and/or fix the set of predicates operating on a Type::type parameter
  2. add a separate set of predicates operating on a const DataType& parameter

I think the former (case number 1 above) is the most interesting to solve, because it directly affects readability.

Would you like to defer number 2 to a separate JIRA (and potentially PR, if people agree it's a worthwhile thing to do)?

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

Also, for the record, functions such as PrimitiveTypes() were added for testing originally, and they were exposed publicly because there didn't seem to be a strong reason against it (the main use case is to be able to write testing loops or parameterized tests e.g. for (const auto& type : PrimitiveTypes()) ...).

@pitrou
Copy link
Member

pitrou commented Aug 4, 2022

Now to answer the problems with the current predicates:

  • +1 to adding is_numeric, is_binary, is_string, is_temporal and is_interval
  • for is_primitive: is "primitive" any type that's not nested nested/compound; whether or Type::NA should be included is a bit contentious, but otherwise all temporals, numerics (including decimals) and binary-likes should probably be included
  • I think is_primitive_like should probably be avoided; granted, we have other is_*_like functions, but they turn out to be confusing, so we should probably not reuse that pattern
  • therefore I would be in favor of changing is_primitive as outlined above (meaning a slight compatibility break unfortunately)

Also note that the is_* predicates can handle parameterized types while the *Types() functions by essence cannot really handle them fully if there is a large or infinite number of parametrizations possible.

Does that make sense?

Also @lidavidm what do you think?

@lidavidm
Copy link
Member

lidavidm commented Aug 4, 2022

Note #13753 removes the *Types() anyways.

I'm -1 on the PR as is, but I think we should add the missing is_*(Type::type id) functions, and I think it's fine to also add overloads that take const DataType&. (This is probably overkill, but those who rely on IDE autocompletion may also appreciate DataType::is_integer etc.)

"primitive" is hard to define and it depends on what the user is after. I might favor a set of more specific predicates like the ones we have for template metaprogramming, e.g. has_c_type (which I think is roughly what the proposed is_primitive is: NA and nested types don't have a C type, but binary-likes do).

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 4, 2022

OK, so I'll avoid changes around the "primitive" functions and fix this PR to only:

  • add the missing is_numeric, is_binary, is_string, is_temporal and is_interval predicates
  • add overloads (same names) for all is_* predicates that accept const DataType
  • test these predicates

@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 4, 2022

Anyone knows what the errors in the Dev jobs mean?

@rtpsw rtpsw requested a review from pitrou August 4, 2022 18:52
@lidavidm
Copy link
Member

lidavidm commented Aug 4, 2022

I think you can ignore those, some files need to be bumped post-release

///
/// \param[in] type_id the type-id to check
/// \return whether type-id is a primitive-like type one
static inline bool is_primitive_like(Type::type type_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would is_not_nested or something be clearer here? "primitive-like" is a little confusing (though, I suppose that then raises the question of why fixed-width binary and decimal are also missing here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this predicate and I'll do so soon. It was only added to correspond to some *Types() function, all of which are going to be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: the existing (pre-PR) code has an is_nested predicate, though it is not the complement of the function discussed here, e.g., due to timestamp and interval types.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I posted a couple suggestions but this looks good overall.

cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/type_traits.h Outdated Show resolved Hide resolved
@rtpsw rtpsw requested a review from pitrou August 10, 2022 06:44
@rtpsw
Copy link
Contributor Author

rtpsw commented Aug 15, 2022

@pitrou, is this good to go?

rtpsw and others added 3 commits August 17, 2022 15:44
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@pitrou
Copy link
Member

pitrou commented Aug 17, 2022

Sorry for the delay @rtpsw . I rebased from master and did a couple of very minor changes.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pitrou
Copy link
Member

pitrou commented Aug 17, 2022

Will wait for CI now...

@pitrou pitrou merged commit f0688d0 into apache:master Aug 17, 2022
@ursabot
Copy link

ursabot commented Aug 17, 2022

Benchmark runs are scheduled for baseline = cef6894 and contender = f0688d0. f0688d0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.46% ⬆️0.44%] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f0688d01 ec2-t3-xlarge-us-east-2
[Failed] f0688d01 test-mac-arm
[Failed] f0688d01 ursa-i9-9960x
[Finished] f0688d01 ursa-thinkcentre-m75q
[Finished] cef68940 ec2-t3-xlarge-us-east-2
[Finished] cef68940 test-mac-arm
[Failed] cef68940 ursa-i9-9960x
[Failed] cef68940 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
See https://issues.apache.org/jira/browse/ARROW-17289

Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com>
Co-authored-by: rtpsw <rtpsw@hotmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants