Skip to content

[SUPPORT] Compaction & Clustering are not working #10183

@Cpandey43

Description

@Cpandey43

Describe the problem you faced

Issue:1
I configured the application with async compaction, async clustering, and async cleaning in the job but all are not working as per the configured settings.

async compaction - Not working at all
async clustering - Not working at all
async cleaning - cleaning was executed after every commit

Issue:2
Configured table _type is MOR with operation upsert but in partitions, Hudi is not creating .log files and also generating 18MB parquet files. Application created two partitions and both have maximum 18MB files. Attaching the content of both partitions below FYR.

Environment Description

  • Hudi version : 0.13.1

  • Spark version : 3.2.4

  • Storage (HDFS/S3/GCS..) : Minio as a S3 extension

  • Running on k8s : Using spark operator

Hudi conf used in spark application

    df.write
    .format("org.apache.hudi")
    // Write Config
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
    .option("hoodie.datasource.write.precombine.field", "ts")
    .option("hoodie.datasource.write.recordkey.field", "recordkey")
    .option("hoodie.datasource.write.partitionpath.field", "date")
    .option("hoodie.datasource.write.table.name", "spark_streaming")
    .option("hoodie.table.name", "spark_streaming")
    .option("hoodie.datasource.write.operation", "upsert")
    .option("hoodie.merge.small.file.group.candidates.limit", "1")
    // Hive Sync
    .option("hoodie.datasource.hive_sync.mode", "hms")
    .option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9090")
    .option("hoodie.datasource.hive_sync.database", "hudi_test")
    .option("hoodie.datasource.hive_sync.table", "test_table")
    .option("hoodie.datasource.hive_sync.partition_fields", "date")
    .option("hoodie.datasource.hive_sync.enable", "true")
    // Compaction
    .option("hoodie.compact.inline", "false")
    .option("hoodie.compact.inline.max.delta.commits", "6")
    .option("hoodie.datasource.compaction.async.enable", "true")
    .option("hoodie.parquet.small.file.limit", "104857600")
    // Clustering
    .option("hoodie.clustering.async.enabled", "true")
    .option("hoodie.clustering.async.max.commits", "1")
    .option("hoodie.clustering.execution.strategy.class", "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
    // Cleaning
    .option("hoodie.clean.async", "true")
    .option("hoodie.cleaner.commits.retained", "3")
    // Archive
    .option("hoodie.archive.async", "true")
    // Index
    .option("hoodie.index.type", "BLOOM")
    // Payload
    .option("hoodie.payload.event.time.field", "ts")
    .option("hoodie.payload.ordering.field", "ts")
    // KeyGenerator
    .option("hoodie.datasource.write.hive_style_partitioning", "true")
    .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.SimpleKeyGenerator")
    // Marker
    .option("hoodie.rollback.using.markers", "true")
    // Multi-writes
    .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
    .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
    .option("hoodie.write.lock.zookeeper.url", "zookeeper")
    .option("hoodie.write.lock.zookeeper.port", "2181")
    .option("hoodie.write.lock.zookeeper.base_path", "/hoodie")
    .option("hoodie.cleaner.policy.failed.writes", "LAZY")
    .option("hoodie.write.lock.zookeeper.lock_key", "lock_v1")
    .option("hoodie.write.lock.zookeeper.connection_timeout_ms", "30000")
    .option("hoodie.write.lock.wait_time_ms", "600000")
    // Multi-Modal Index 
    .option("hoodie.metadata.index.bloom.filter.enable", "true")
    .option("hoodie.metadata.index.column.stats.enable", "true")
    // MetaData 
    .option("hoodie.metadata.enable", "true")
    // Data Skipping
    .option("hoodie.enable.data.skipping", "true")
    // Storage
    .option("hoodie.parquet.max.file.size", "125829120")
    .option("hoodie.logfile.max.size", "1073741824")
    .option("hoodie.logfile.to.parquet.compression.ratio", "0.35"))

hoodie.properties file content

#Updated at 2023-11-21T20:26:57.745347Z
#Tue Nov 21 20:26:57 UTC 2023
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.table.type=MERGE_ON_READ
hoodie.table.metadata.partitions=bloom_filters,column_stats,files
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=date
hoodie.archivelog.folder=archived
hoodie.table.cdc.enabled=false
hoodie.timeline.layout.version=1
hoodie.table.checksum=3323849328
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.timeline.timezone=LOCAL
hoodie.table.name=spark_streaming
hoodie.table.recordkey.fields=recordkey
hoodie.compaction.record.merger.strategy=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
hoodie.datasource.write.hive_style_partitioning=true
hoodie.partition.metafile.use.base.format=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.populate.meta.fields=true
hoodie.table.base.file.format=PARQUET
hoodie.database.name=
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.version=5

.hoodie folder content captured in below attached file name hoodie.txt
hoodie.txt

.hoodie folder recursive all folders content captured in below-attached file name hoodie-recursive.txt
hoodie-recursive.txt

Application processed two days data 1st & 2nd November. Attaching the content of both partitions below:
date=2023-11-01.txt
date=2023-11-02.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions