Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-805: Read Int96 into Arrow Timestamp(ns) #204

Closed
wants to merge 3 commits into from

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Dec 19, 2016

No description provided.

@xhochy
Copy link
Member Author

xhochy commented Dec 19, 2016

Also verified locally that this correctly reads the timestamps produced by Spark 1.6.1

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpplint issues

std::vector<std::shared_ptr<schema::Node>> fields({schema::PrimitiveNode::Make("int96", Repetition::REQUIRED, ParquetType::INT96)});
std::shared_ptr<schema::GroupNode> schema = std::static_pointer_cast<GroupNode>(schema::GroupNode::Make("schema", Repetition::REQUIRED, fields));

// We cannot write this column with Arrow, so we have to use the plain parquet-cpp API to write an Int96 file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need to implement a fixed-width bytes type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted this for now as INT96 is at least declared deprecated. But at least for IMPALA this still is currently only natively supported timestamp.

@@ -44,6 +45,15 @@ using ParquetReader = parquet::ParquetFileReader;
namespace parquet {
namespace arrow {

constexpr int64_t kJulianToUnixEpochDays = 2440588l;
constexpr int64_t kNanosecondsInADay = 86400l * 1000l * 1000l * 1000l;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use L -- I squinted at this one for a while!

constexpr int64_t kJulianToUnixEpochDays = 2440588l;
constexpr int64_t kNanosecondsInADay = 86400l * 1000l * 1000l * 1000l;

int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare static inline

int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays;
int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value)));
return days_since_epoch * kNanosecondsInADay + nanoseconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient? I believe you that it is, but I had seen the Impala timestamp code (https://github.com/apache/incubator-impala/blob/master/be/src/runtime/timestamp-value.h) and had though there might be more work to be done

Copy link
Member Author

@xhochy xhochy Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there is a lot of code around on various sources on how to convert Julian dates to datetime objects, this is not so complicated as we only have to deal with integer Julian dates. That's why I also verified with Spark INT96 output locally.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit asfgit closed this in 16466b1 Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants