Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More types in HivePartitionFunction #1466

Closed
wants to merge 14 commits into from

Conversation

usurai
Copy link
Contributor

@usurai usurai commented Apr 22, 2022

Add support for more types in HivePartitionFunction:

  • TINYINT
  • SMALLINT
  • INTEGER
  • REAL
  • DOUBLE
  • VARBINARY
  • TIMESTAMP
  • DATE

Fixes #327

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2022
@mbasmanova
Copy link
Contributor

@usurai Somehow I missed this. Will take a look soon. In the meantime, would you rebase and update to clear "This branch has conflicts that must be resolved" message?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@usurai Nice change. Some comments and a question below.

velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/tests/HivePartitionFunctionTest.cpp Outdated Show resolved Hide resolved
}

TEST_F(HivePartitionFunctionTest, Timestamp) {
// TODO Fix flatVectorNullable to set Timestamp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to take care of this TODO in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbasmanova, seems #778 has already done fixing flatVectorNullable, I change the test of varchar and timestamp to use the initalizer_list form of the API and it works. I created a PR on this. Thanks.

static_assert(sizeof(float) == sizeof(uint32_t));
auto f = [](float value) {
uint32_t ret;
memcpy(&ret, &value, sizeof ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to copy data? Can you use reinterpret_cast instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I did some searching before doing this change. Seems although reinterpret_cast the reference / pointer of float to uint32_t works, reading the result of it is undefined behavior.
We might be able to use std::bit_cast when velox supports using C++20.

Reference:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you explain a bit more? memcpy is just copying byte-by-byte, it is not type aware, right? Hence, the result would be the same as reinterpret_cast, no?

Copy link
Contributor Author

@usurai usurai May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you explain a bit more?

float f = 42.0;
uint32_t u = *reinterpret_cast<uint32_t*>(&f);

In practice, this works in some condition[1]. Underlying, reinterpret_cast<uint32_t*>(&f) returns a pointer of type uint32_t that actually points to an object of type float, this is called type punning. As both types are of the same size, we get the exact same result we as do a bit-wise copy.
The issue here is that C++ standard defines "accessing a variable of a given type through a pointer to another type" as undefined behavior, which breaks strict aliasing rule. That is, if type punning is allowed for all the types(not limited in uint32_t and float), some vital optimize will be prohibited. (I'm not expert in this, so I just accept it as a truth)

memcpy is just copying byte-by-byte, it is not type aware, right? Hence, the result would be the same as reinterpret_cast, no?

Yes, they both emit the same result. But the difference is by memcpy, we copy the actually data byte-by-byte to an object of type uint32_t, then when reading it, we are reading an object of type uint32_t. This is legal.

Anyway, I'm totally open to using reinterpret_cast instead of memcpy as long velox's compile flags don't trigger the warning / error of strict-aliasing, and memcpy indeed introduces one more copy which has impact to performance.

[1] To see when it's not working, please compile

#include <iostream>

int main()
{
    float f = 42.1;
    uint32_t u = *reinterpret_cast<uint32_t*>(&f);
    std::cout << u << std::endl;
    return 0;
}

with

g++ -O2 -Wall -Werror

or more specifically

g++ -O2 -Werror -Wstrict-aliasing

you will see the error

error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
    6 |     uint32_t u = *reinterpret_cast<uint32_t*>(&f);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@usurai Thank you for explaining. @laithsakka Laith, would you take a look? Is there a way to avoid a copy here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@usurai Data copy is expensive, especially one value at a time. I wonder if we could use values.valueAt<uint32_t>() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@usurai Data copy is expensive, especially one value at a time. I wonder if we could use values.valueAt<uint32_t>() instead.

Oh this is optimal. Pushed new change according to this. Thanks!

@usurai
Copy link
Contributor Author

usurai commented May 11, 2022

@usurai Somehow I missed this. Will take a look soon. In the meantime, would you rebase and update to clear "This branch has conflicts that must be resolved" message?

Hi @mbasmanova, the merge conflict has been resolved. Thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % a couple of minor comments.

velox/connectors/hive/HivePartitionFunction.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HivePartitionFunction.cpp Outdated Show resolved Hide resolved
const DecodedVector& values,
vector_size_t size,
bool mix,
std::vector<uint32_t>& hashes) {
std::vector<uint32_t>& hashes,
std::function<uint32_t(const T&)> const& f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: f -> hashOne or something similar
order of parameters: input, followed by in/out, followed by output

It might be more efficient to templatize on the function type. It would be nice to add a benchmark in a follow-up PR.

template <typename T, typename THash>
void abstractHashTyped(
    const DecodedVector& values,
    vector_size_t size,
    bool mix,
    THash hashOne,
    std::vector<uint32_t>& hashes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: f -> hashOne or something similar order of parameters: input, followed by in/out, followed by output

Done renaming function and changing the order.

It might be more efficient to templatize on the function type. It would be nice to add a benchmark in a follow-up PR.

Will do. Should I open an issue to track the progress of benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I open an issue to track the progress of benchmark?

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a PR adding the benchmark, thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@usurai Thank you for the contribution.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@usurai usurai requested a review from laithsakka May 12, 2022 16:17
Arty-Maly pushed a commit to Arty-Maly/velox that referenced this pull request May 13, 2022
Summary:
Add support for more types in `HivePartitionFunction`:
- `TINYINT`
- `SMALLINT`
- `INTEGER`
- `REAL`
- `DOUBLE`
- `VARBINARY`
- `TIMESTAMP`
- `DATE`

Fixes facebookincubator#327

Pull Request resolved: facebookincubator#1466

Reviewed By: Yuhta

Differential Revision: D36345326

Pulled By: mbasmanova

fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
facebook-github-bot pushed a commit that referenced this pull request May 16, 2022
Summary:
- Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see #778).
- Add `assertPartitionsWithConstChannel` for newly added types.

This is a follow-up for #1466

Pull Request resolved: #1625

Reviewed By: kagamiori

Differential Revision: D36412294

Pulled By: mbasmanova

fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Jun 21, 2022
Summary:
Add support for more types in `HivePartitionFunction`:
- `TINYINT`
- `SMALLINT`
- `INTEGER`
- `REAL`
- `DOUBLE`
- `VARBINARY`
- `TIMESTAMP`
- `DATE`

Fixes facebookincubator#327

Pull Request resolved: facebookincubator#1466

Reviewed By: Yuhta

Differential Revision: D36345326

Pulled By: mbasmanova

fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
zhejiangxiaomai pushed a commit to zhejiangxiaomai/velox that referenced this pull request Jun 21, 2022
Summary:
- Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see facebookincubator#778).
- Add `assertPartitionsWithConstChannel` for newly added types.

This is a follow-up for facebookincubator#1466

Pull Request resolved: facebookincubator#1625

Reviewed By: kagamiori

Differential Revision: D36412294

Pulled By: mbasmanova

fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
shiyu-bytedance pushed a commit to shiyu-bytedance/velox-1 that referenced this pull request Aug 18, 2022
Summary:
Add support for more types in `HivePartitionFunction`:
- `TINYINT`
- `SMALLINT`
- `INTEGER`
- `REAL`
- `DOUBLE`
- `VARBINARY`
- `TIMESTAMP`
- `DATE`

Fixes facebookincubator#327

Pull Request resolved: facebookincubator#1466

Reviewed By: Yuhta

Differential Revision: D36345326

Pulled By: mbasmanova

fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
shiyu-bytedance pushed a commit to shiyu-bytedance/velox-1 that referenced this pull request Aug 18, 2022
Summary:
- Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see facebookincubator#778).
- Add `assertPartitionsWithConstChannel` for newly added types.

This is a follow-up for facebookincubator#1466

Pull Request resolved: facebookincubator#1625

Reviewed By: kagamiori

Differential Revision: D36412294

Pulled By: mbasmanova

fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
@usurai usurai deleted the hive_pf_types branch May 31, 2023 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for more types in HivePartitionFunction
5 participants