-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Description
Describe the problem you faced
When we ingest the same records to a MOR table the used disk space grows for each run, even though compaction and cleansing has been enabled.
The normal parquet output for the test data is 8.3M. Hudi table sizes for each run:
- 9.4M
- 51M
- 83M
- 125M
- 157M
To Reproduce
write the same DF multiple times:
df
.coalesce(1)
.write
.format("org.apache.hudi")
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option("hoodie.cleaner.commits.retained", "3")
.option("hoodie.cleaner.fileversions.retained", "2")
.option("hoodie.compact.inline", "true")
.option("hoodie.compact.inline.max.delta.commits", "2")
.option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "some_unique_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date")
.option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, classOf[ComplexKeyGenerator].getName)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "version")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save("/tmp/test_hudi_mor")Expected behavior
The used disk space should stop growing.
Environment Description
-
Hudi version :
0.5.2 -
Spark version :
2.4.4 -
Hive version :
-
Hadoop version :
2.7 -
Storage (HDFS/S3/GCS..) :
local -
Running on Docker? (yes/no) :
no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels