New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt #2509
Comments
@satishkotha Could you take a look at this one ? |
Hi If you set support_timestamp property mentioned here, hudi will convert the field to timestamp type in hive. Note that you need to verify compatibility of this with hive/presto/athena versions you are using. We made some changes to interpret the field correctly as timestamp. Refer to this change in presto for example. We did similar changes in our internal hive deployment. Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details. |
Great to know, I will test this feature in Athena and Redshift Spectrum, if someone already made this test, please let me know. |
@satishkotha I added that parameter to my example, now after writing data into s3 , when I run
which is good , however when I run |
@zuyanton yes, as i mentioned earlier some changes are needed in query engines. Refer to this change in presto for example. See this ticket for how this is fixed upstream in hive. You likely need to port this change to your hive deployment to make this work. (Or you could also upgrade your hive version to 4) |
@satishkotha could you help me how to explain to aws support which fixes should be applied to athena. @umehrot2 do you know if anything should be changed on emr? thank you |
@rubenssoto AFAIK, athena is built on top of Presto. So you could ask them to apply above presto change. You can say this is needed for interpreting Parquet INT64 timestamp correctly. |
Thanks for jumping in @satishkotha |
Going back to @zuyanton 's point, that is still from Spark. And are you suggesting that Spark's Hive version needs to also pick up the change? (that sounds painful) |
Hi, @vinothchandar @satishkotha @zuyanton Is there any workaround? I oppened an aws ticket but probably will take a while because the difference of presto version. I have some tables in regular parquet with timestamp fields, and it work, what the difference comparing to Hudi? thank you |
@vinothchandar @satishkotha @zuyanton I think the only workaround here is to convert the timestamp column to string, do you have better ideas? thank you. |
@rubenssoto : Here is a link to suggestions from Athena support on timestamp conversion. |
@nsivabalan it worked but I think a view it is not a good solution, because we will have a maintenence problem. It is not a Hudi fault, so we need to wait for athena, but I think it should not be solved soon... in Hudi side is there anything what we can do? My timestamp is not a timestamp micro is timestamp milisecond |
I think if you query using spark datasource APIs, queries will be able to read timestamp field correctly. Querying through Athena, i don't think there is another workaround unfortunately. |
Hello Guys, Athena Behavior changes, This is a great news, but BETWEEN operator doesn't work. For exemple, this query works: and this query doens't work: |
@rubenssoto : just incase you haven't seen this #2544. talks about timestamp and hive. |
Just reading through this again. We def need to understand if this is an issue even when using Spark as the only engine (i.e no registration to HMS). and understand if parquet-avro is the problem child. |
AWS Glue3
I had this issue. I was able to handle this issue by setting this value when I insert data. But, I am not sure if there is any downside of setting this value to true. |
@Gatsby-Lee : hoodie.datasource.hive_sync.support_timestamp is the right way to go. @rubenssoto : is everything resolved on your end or are you still having any issues. Let us know. if things are resolved, feel free to close out the issue. |
Thank you for your comment. BTW, AWS Athena fails to read MoR Realtime table. ( Read Optimized table is ok ) Any input you want me to provide to AWS Athena team? |
@umehrot2 @zhedoubushishi : Do you folks have any pointers on this. |
@nsivabalan |
When i am trying to read using spark-sql getting below error which was same mentioned by @zuyanton . |
I think this is related to https://issues.apache.org/jira/browse/HUDI-83 and we have a patch. Can you please try out with #3391 |
@zuyanton Did you get a chance to try out the suggested patch? |
Describe the problem you faced
It looks like org.apache.spark.sql.types.TimestampType when saved to hudi table gets converted to bigInt
To Reproduce
create dataframe with TimestampType
preview dataframe
save dataframe to hudi table
view hudi table
result, timestamp column is bigint
view schema
result
Environment Description
Hudi version : 0.7.0
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :S3
Running on Docker? (yes/no) : no
Additional context
full code snippet
The text was updated successfully, but these errors were encountered: