Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Duplicates on MOR table #8703

Closed
eyjian opened this issue May 14, 2023 · 7 comments
Closed

[SUPPORT] Duplicates on MOR table #8703

eyjian opened this issue May 14, 2023 · 7 comments
Assignees
Labels
data-duplication spark Issues related to spark

Comments

@eyjian
Copy link

eyjian commented May 14, 2023

Hudi: 0.12.1
Flink: 0.15
Spark: 3.1

Duplicates span multiple files within the same partitionpath:

ds ts transcode actid level _hoodie_file_name _hoodie_partition_path _hoodie_record_key _hoodie_commit_time
20230509 2023/5/10 20:11 1486092202305095800051572 300108 3 00000007-6d54-4918-9abb-6f8d48843bba_4-10-0_20230514101617270.parquet 20230509 transcode:1486092202305095800051572,actid:300148,level:3 20230510200812817
20230509 2023/5/14 10:11 1486092202305095800051572 300108 3 00000001-7f3b-4b8c-9b46-dff2c4987f72_4-10-0_20230514101617270.parquet 20230509 transcode:1486092202305095800051572,actid:300148,level:3 20230514100809825
CREATE TABLE `t_test_01` (
  `_hoodie_commit_time` STRING,
  `_hoodie_commit_seqno` STRING,
  `_hoodie_record_key` STRING,
  `_hoodie_partition_path` STRING,
  `_hoodie_file_name` STRING,
  `ds` BIGINT,
  `dt` STRING,
  `ts` STRING,
  `transcode` STRING,
  `actid` BIGINT,
  `level` BIGINT
)
USING hudi
OPTIONS (
  `hoodie.query.as.ro.table` 'false'
)
PARTITIONED BY (ds)
LOCATION 'hdfs://user/root/warehouse/test_db.db/t_test_01'
TBLPROPERTIES (
  'type' = 'mor',
  'primaryKey' = 'transcode,actid,level',
  'preCombineField' = 'ts',
  'connector' = 'hudi',
  'hoodie.bucket.index.num.buckets' = '19',
  'hoodie.index.type' = 'BUCKET',
  'hoodie.datasource.write.recordkey.field' = 'transcode,actid,level',
  'hoodie.table.name' = 't_test',
  'hoodie.table.precombine.field' = 'ts'
);
INSERT INTO t_test_01
SELECT
ds,
dt,
DATE_FORMAT(CURRENT_TIMESTAMP,'yyyy-MM-dd HH:mm:ss') AS ts,
transcode,
actid,
level
FROM t_test_02;

Can not use the records deduplicate command to fix for no permissions.

@danny0405 danny0405 added spark Issues related to spark data-duplication labels May 15, 2023
@danny0405 danny0405 self-assigned this May 15, 2023
@eyjian
Copy link
Author

eyjian commented May 16, 2023

Problem causes: drop a table and create the table again, the table path is not deleted, so two files at the same time.

@danny0405
Copy link
Contributor

Is the table a managed table or external table then?

@eyjian
Copy link
Author

eyjian commented May 17, 2023

Is the table a managed table or external table then?

A hudi table created by flink,it's not an external table.

@danny0405
Copy link
Contributor

How do you drop the table, if it is a managed table and if you use the Flink Hudi Hive catalog, the table path would be deleted.

@eyjian
Copy link
Author

eyjian commented May 18, 2023

Drop the hudi table using SparkSQL

@danny0405
Copy link
Contributor

danny0405 commented May 19, 2023

There is a param named purge:

case class DropHoodieTableCommand(
    tableIdentifier: TableIdentifier,
    ifExists: Boolean,
    isView: Boolean,
    purge: Boolean) extends HoodieLeafRunnableCommand {

If it is true, the directory on fs would be deleted recursively.

@eyjian
Copy link
Author

eyjian commented May 23, 2023

There is a param named purge:

case class DropHoodieTableCommand(
    tableIdentifier: TableIdentifier,
    ifExists: Boolean,
    isView: Boolean,
    purge: Boolean) extends HoodieLeafRunnableCommand {

If it is true, the directory on fs would be deleted recursively.

Thanks, it's good.

@eyjian eyjian closed this as completed May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-duplication spark Issues related to spark
Projects
Archived in project
Development

No branches or pull requests

2 participants