New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-805: Read Int96 into Arrow Timestamp(ns) #204
Conversation
Also verified locally that this correctly reads the timestamps produced by Spark 1.6.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpplint issues
std::vector<std::shared_ptr<schema::Node>> fields({schema::PrimitiveNode::Make("int96", Repetition::REQUIRED, ParquetType::INT96)}); | ||
std::shared_ptr<schema::GroupNode> schema = std::static_pointer_cast<GroupNode>(schema::GroupNode::Make("schema", Repetition::REQUIRED, fields)); | ||
|
||
// We cannot write this column with Arrow, so we have to use the plain parquet-cpp API to write an Int96 file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we need to implement a fixed-width bytes type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I omitted this for now as INT96 is at least declared deprecated. But at least for IMPALA this still is currently only natively supported timestamp.
@@ -44,6 +45,15 @@ using ParquetReader = parquet::ParquetFileReader; | |||
namespace parquet { | |||
namespace arrow { | |||
|
|||
constexpr int64_t kJulianToUnixEpochDays = 2440588l; | |||
constexpr int64_t kNanosecondsInADay = 86400l * 1000l * 1000l * 1000l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use L
-- I squinted at this one for a while!
constexpr int64_t kJulianToUnixEpochDays = 2440588l; | ||
constexpr int64_t kNanosecondsInADay = 86400l * 1000l * 1000l * 1000l; | ||
|
||
int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare static inline
int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) { | ||
int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays; | ||
int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value))); | ||
return days_since_epoch * kNanosecondsInADay + nanoseconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient? I believe you that it is, but I had seen the Impala timestamp code (https://github.com/apache/incubator-impala/blob/master/be/src/runtime/timestamp-value.h) and had though there might be more work to be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although there is a lot of code around on various sources on how to convert Julian dates to datetime objects, this is not so complicated as we only have to deal with integer Julian dates. That's why I also verified with Spark INT96 output locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
No description provided.