Skip to content

[SUPPORT] MOR upsert table grows in size when ingesting same records #1625

@rolandjohann

Description

@rolandjohann

Describe the problem you faced

When we ingest the same records to a MOR table the used disk space grows for each run, even though compaction and cleansing has been enabled.

The normal parquet output for the test data is 8.3M. Hudi table sizes for each run:

  • 9.4M
  • 51M
  • 83M
  • 125M
  • 157M

To Reproduce
write the same DF multiple times:

df
      .coalesce(1)
      .write
      .format("org.apache.hudi")
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      .option("hoodie.cleaner.commits.retained", "3")
      .option("hoodie.cleaner.fileversions.retained", "2")
      .option("hoodie.compact.inline", "true")
      .option("hoodie.compact.inline.max.delta.commits", "2")
      .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
      .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "some_unique_key")
      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date")
      .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, classOf[ComplexKeyGenerator].getName)
      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "version")
      .option(HoodieWriteConfig.TABLE_NAME, tableName)
      .mode(SaveMode.Append)
      .save("/tmp/test_hudi_mor")

Expected behavior
The used disk space should stop growing.

Environment Description

  • Hudi version :
    0.5.2

  • Spark version :
    2.4.4

  • Hive version :

  • Hadoop version :
    2.7

  • Storage (HDFS/S3/GCS..) :
    local

  • Running on Docker? (yes/no) :
    no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions