Skip to content

Comments

[HUDI-3402] Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default#4749

Merged
nsivabalan merged 1 commit intoapache:masterfrom
YannByron:master_bulkinsert_timestamp_type
Feb 11, 2022
Merged

[HUDI-3402] Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default#4749
nsivabalan merged 1 commit intoapache:masterfrom
YannByron:master_bulkinsert_timestamp_type

Conversation

@YannByron
Copy link
Contributor

Hoodie converts Timestamp to TIMESTAMP_MICROS format when upsert and other operations, except bulk_insert.

And bulk_insert enables hoodie.datasource.write.row.writer.enable, and use HoodieRowParquetWriteSupport to write datas.

For the issue #4552 , that will cause problems by default. So i suggest to modify the hoodie.parquet.outputtimestamptype default value to TIMESTAMP_MICROS so that it will be convenience to users.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 4, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YannByron : thanks for the fix. But lets be cautious of any any breaking change. If you think about a user who has been using hudi for the past 2 to 3 releases (even just upsert), when they upgrade to 0.11, wouldn't the updates be inadvertantly treated as inserts? I assume timestamp based partition path will be impacted with this.
Can you help clarify please.

@YannByron
Copy link
Contributor Author

YannByron commented Feb 7, 2022

@nsivabalan
For now this config hoodie.parquet.outputtimestamptype is only used in 'bulk_insert' mode, not upsert. and it won't affect timestamp based partition path.
This config will decide that Long values representing the timestamp type will be persisted in parquet file using the format of TIMESTAMP_MILLIS or TIMESTAMP_MICROS when bulk_insert. Like 2021-10-26 03:44:03, converting it to Long type is 1635234243 in TIMESTAMP_MILLIS format, but 1635234243000 in TIMESTAMP_MICROS format.
BTW, upsert uses the TIMESTAMP_MICROS format.

@codope
Copy link
Member

codope commented Feb 7, 2022

If set false, Long values representing the timestamp type will be persisted in parquet file using the format of TIMESTAMP_MILLIS when bulk_insert. And set true, using the format of TIMESTAMP_MICROS. BTW, upsert uses the TIMESTAMP_MICROS format.

Which config is being set to true/false in the above statement?

@YannByron
Copy link
Contributor Author

@codope
sorry for the confused description. I edited again.

@codope codope self-assigned this Feb 7, 2022
@codope
Copy link
Member

codope commented Feb 7, 2022

sorry for the confused description. I edited again.

Thanks for the clarification.
I still have this question. Suppose user did bulk insert in a table created with version 0.10.0 and now upgrades and runs another bulk insert with 0.11.0 where default is now TIMESTAMP_MICROS. Is the type change from TIMESTAMP_MILLIS to TIMESTAMP_MICROS backward compatible? For your example, in the newer version, 1635234243 would be some date in the year 1970 while 1635234243000 would be much more recent. Wouldn't this affect the queries?

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 8, 2022
@nsivabalan
Copy link
Contributor

can we file a jira please as we try to get consensus

@nsivabalan
Copy link
Contributor

thanks Sagar for bringing up a good point. Just playing devil's advocate. Spark in general has much incompatability issues even after being very mature. So, for the benefit we get (same behavior across bulk insert and other operations), I am thinking if we can go ahead with this change. Again, my assumption is that w/ non row writer operations, we are already honoring micro secs with timestamp type.

Even today, users have some inconsistencies here which they are living with.
especially given this is not affecting partition path flow. just some random timestamp type fields. @YannByron : correct me if I am mistaken in any of the above statements.

@YannByron YannByron changed the title Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default [HUDI-3402] Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default Feb 9, 2022
@YannByron
Copy link
Contributor Author

can we file a jira please as we try to get consensus

https://issues.apache.org/jira/browse/HUDI-3402

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YannByron @nsivabalan I've verified the patch. It does not affect the partition path. Should be good to land.

@nsivabalan
Copy link
Contributor

cool, thanks a lot man!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants