-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More types in HivePartitionFunction #1466
Conversation
@usurai Somehow I missed this. Will take a look soon. In the meantime, would you rebase and update to clear "This branch has conflicts that must be resolved" message? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@usurai Nice change. Some comments and a question below.
} | ||
|
||
TEST_F(HivePartitionFunctionTest, Timestamp) { | ||
// TODO Fix flatVectorNullable to set Timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to take care of this TODO in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbasmanova, seems #778 has already done fixing flatVectorNullable
, I change the test of varchar
and timestamp
to use the initalizer_list form of the API and it works. I created a PR on this. Thanks.
static_assert(sizeof(float) == sizeof(uint32_t)); | ||
auto f = [](float value) { | ||
uint32_t ret; | ||
memcpy(&ret, &value, sizeof ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to copy data? Can you use reinterpret_cast instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I did some searching before doing this change. Seems although reinterpret_cast
the reference / pointer of float
to uint32_t
works, reading the result of it is undefined behavior.
We might be able to use std::bit_cast
when velox supports using C++20.
Reference:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain a bit more? memcpy is just copying byte-by-byte, it is not type aware, right? Hence, the result would be the same as reinterpret_cast, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain a bit more?
float f = 42.0;
uint32_t u = *reinterpret_cast<uint32_t*>(&f);
In practice, this works in some condition[1]. Underlying, reinterpret_cast<uint32_t*>(&f)
returns a pointer of type uint32_t
that actually points to an object of type float
, this is called type punning. As both types are of the same size, we get the exact same result we as do a bit-wise copy.
The issue here is that C++ standard defines "accessing a variable of a given type through a pointer to another type" as undefined behavior, which breaks strict aliasing rule. That is, if type punning is allowed for all the types(not limited in uint32_t
and float
), some vital optimize will be prohibited. (I'm not expert in this, so I just accept it as a truth)
memcpy is just copying byte-by-byte, it is not type aware, right? Hence, the result would be the same as reinterpret_cast, no?
Yes, they both emit the same result. But the difference is by memcpy
, we copy the actually data byte-by-byte to an object of type uint32_t
, then when reading it, we are reading an object of type uint32_t
. This is legal.
Anyway, I'm totally open to using reinterpret_cast
instead of memcpy
as long velox's compile flags don't trigger the warning / error of strict-aliasing, and memcpy
indeed introduces one more copy which has impact to performance.
[1] To see when it's not working, please compile
#include <iostream>
int main()
{
float f = 42.1;
uint32_t u = *reinterpret_cast<uint32_t*>(&f);
std::cout << u << std::endl;
return 0;
}
with
g++ -O2 -Wall -Werror
or more specifically
g++ -O2 -Werror -Wstrict-aliasing
you will see the error
error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
6 | uint32_t u = *reinterpret_cast<uint32_t*>(&f);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@usurai Thank you for explaining. @laithsakka Laith, would you take a look? Is there a way to avoid a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@usurai Data copy is expensive, especially one value at a time. I wonder if we could use values.valueAt<uint32_t>()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@usurai Data copy is expensive, especially one value at a time. I wonder if we could use
values.valueAt<uint32_t>()
instead.
Oh this is optimal. Pushed new change according to this. Thanks!
Hi @mbasmanova, the merge conflict has been resolved. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good % a couple of minor comments.
const DecodedVector& values, | ||
vector_size_t size, | ||
bool mix, | ||
std::vector<uint32_t>& hashes) { | ||
std::vector<uint32_t>& hashes, | ||
std::function<uint32_t(const T&)> const& f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: f -> hashOne or something similar
order of parameters: input, followed by in/out, followed by output
It might be more efficient to templatize on the function type. It would be nice to add a benchmark in a follow-up PR.
template <typename T, typename THash>
void abstractHashTyped(
const DecodedVector& values,
vector_size_t size,
bool mix,
THash hashOne,
std::vector<uint32_t>& hashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: f -> hashOne or something similar order of parameters: input, followed by in/out, followed by output
Done renaming function and changing the order.
It might be more efficient to templatize on the function type. It would be nice to add a benchmark in a follow-up PR.
Will do. Should I open an issue to track the progress of benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I open an issue to track the progress of benchmark?
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a PR adding the benchmark, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@usurai Thank you for the contribution.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Add support for more types in `HivePartitionFunction`: - `TINYINT` - `SMALLINT` - `INTEGER` - `REAL` - `DOUBLE` - `VARBINARY` - `TIMESTAMP` - `DATE` Fixes facebookincubator#327 Pull Request resolved: facebookincubator#1466 Reviewed By: Yuhta Differential Revision: D36345326 Pulled By: mbasmanova fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
Summary: - Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see #778). - Add `assertPartitionsWithConstChannel` for newly added types. This is a follow-up for #1466 Pull Request resolved: #1625 Reviewed By: kagamiori Differential Revision: D36412294 Pulled By: mbasmanova fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
Summary: Add support for more types in `HivePartitionFunction`: - `TINYINT` - `SMALLINT` - `INTEGER` - `REAL` - `DOUBLE` - `VARBINARY` - `TIMESTAMP` - `DATE` Fixes facebookincubator#327 Pull Request resolved: facebookincubator#1466 Reviewed By: Yuhta Differential Revision: D36345326 Pulled By: mbasmanova fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
Summary: - Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see facebookincubator#778). - Add `assertPartitionsWithConstChannel` for newly added types. This is a follow-up for facebookincubator#1466 Pull Request resolved: facebookincubator#1625 Reviewed By: kagamiori Differential Revision: D36412294 Pulled By: mbasmanova fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
Summary: Add support for more types in `HivePartitionFunction`: - `TINYINT` - `SMALLINT` - `INTEGER` - `REAL` - `DOUBLE` - `VARBINARY` - `TIMESTAMP` - `DATE` Fixes facebookincubator#327 Pull Request resolved: facebookincubator#1466 Reviewed By: Yuhta Differential Revision: D36345326 Pulled By: mbasmanova fbshipit-source-id: a32b76e8608cac3a8044ef0d468a2c479f62037d
Summary: - Use initialize_list form of `flatVectorNullable` to test `varchar` and `timestamp` and remove the TODO (see facebookincubator#778). - Add `assertPartitionsWithConstChannel` for newly added types. This is a follow-up for facebookincubator#1466 Pull Request resolved: facebookincubator#1625 Reviewed By: kagamiori Differential Revision: D36412294 Pulled By: mbasmanova fbshipit-source-id: 608dda88833df6c16fb53a9009bf2c6269f64c32
Add support for more types in
HivePartitionFunction
:TINYINT
SMALLINT
INTEGER
REAL
DOUBLE
VARBINARY
TIMESTAMP
DATE
Fixes #327