Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] hoodie commit time format change #6907

Closed
sknukala opened this issue Oct 10, 2022 · 13 comments
Closed

[SUPPORT] hoodie commit time format change #6907

sknukala opened this issue Oct 10, 2022 · 13 comments
Labels
priority:minor everything else; usability gaps; questions; feature reqs writer-core Issues relating to core transactions/write actions

Comments

@sknukala
Copy link

Describe the problem you faced

We are trying to migrate hudi table from 0.8 to 0.12 and noticed that _hoodie_commit_time format has changed to include milliseconds.

Below is example of sample data in different version:
hudi 0.8: 20220920044733
hudi 0.12: 20220923141615400

Is there a property to configure timestamp format? We need this to ensure backward compatibility and also reduce changes to data while migrating

To Reproduce

Steps to reproduce the behavior:

  1. Create a hudi table with version 0.8
  2. Write data
  3. Upgrade table to 0.12
  4. Write data

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.12

  • Spark version : 3.1

  • EMR version : EMR 6.3

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

@KnightChess
Copy link
Contributor

use hudi-cli or in 0.12, you can use spark procedure to upgrade table.
use hudi-cli:
image

use spark-sql:
call upgrade_table(table => 'xxx', to_version => 'FIVE');
image

@sknukala
Copy link
Author

@KnightChess My issue is not with table upgrade but timestamp format. Is there a property to configure _hoodie_commit_type timestamp format?

@KnightChess
Copy link
Contributor

@yihua
Copy link
Contributor

yihua commented Oct 13, 2022

@sknukala as @KnightChess mentioned you don't have to worry about the timestamp format as the new millisecond instant time is designed to be backward compatible. You don’t have to do any special handling.

Do you have any specific reason for enforcing timestamp format?

@yihua yihua added priority:minor everything else; usability gaps; questions; feature reqs writer-core Issues relating to core transactions/write actions labels Oct 13, 2022
@sknukala
Copy link
Author

@yihua A lot of downstreams use this column to incrementally pull data and change in format impacts all of them. If the format can be controlled, it will be easy.

Also, as we upgraded table, old data is in legacy format while new loads have ms leading to inconsistency.

@yihua
Copy link
Contributor

yihua commented Oct 14, 2022

@sknukala if you are using Hudi incremental query, the instant timestamp format (second vs millisecond granularity) should not matter, because internally Hudi treats the instant time as a String and uses the predicate based on the instant time with String comparison for filtering records, so millisecond-level instant time is still backward compatible with second-level instant time. Could you clarify how the incremental pull is impacted?

@nsivabalan
Copy link
Contributor

@sknukala : let us know if you see any inconsistencies. and may be provide a reproducible script if feasible.

@sknukala
Copy link
Author

@yihua @nsivabalan : As you pointed, hudi is handling changes as it uses string comparison. However, this change is affecting places where _hudi_commit_column is casted to timestamp format. Ex: scripts loading incremental data to database using _hoodie_commit_time

@nsivabalan
Copy link
Contributor

hmmm, I see. sorry, I can't think of easier route here.
We can add a config to use older format if need be. But if a table already has a mix of both, not sure if we can do anything about it rather than fixing on the consumer end.

Also, we usually don't have any backporting fixes. i.e. even if we solve the issue, we can't port it back to 0.10.1 and other versions > 0.10.1.
sorry about that.

@xushiyan @yihua : can you folks think of any other approach here.

@sknukala
Copy link
Author

@nsivabalan : Adding a config to current hudi version 0.12 and future versions would help. Please let me know

@nsivabalan
Copy link
Contributor

@sknukala : is it not possible to fix the consumers to detech whether its sec or ms granularity before casting. bcoz, then you can never upgrade hudi to ms granularity and are essentially stuck.

@sknukala
Copy link
Author

@nsivabalan : the granularity of seconds just works fine for us. Having a config can help users control the timestamp format or if need be to adapt ms, we can plan this activity.

Currently, having a default to ms is blocking migration to hudi 12(as we need to update all consumers) and we are loosing all performance improvements implemented in recent hudi versions. with the config, upgrade would be seamless.

@nsivabalan
Copy link
Contributor

I discussed w/ few other hudi experts. We feel this has to be addressed at app layer where commit times are casted to timestamp. we don't have plans to support sec level granularity. sorry about that.
let us know if we can help in any other way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs writer-core Issues relating to core transactions/write actions
Projects
Archived in project
Development

No branches or pull requests

4 participants