[SUPPORT] Compaction & Clustering are not working

**Describe the problem you faced**

**Issue:1**
I configured the application with async compaction, async clustering, and async cleaning in the job but all are not working as per the configured settings.

async compaction - Not working at all
async clustering - Not working at all
async cleaning - cleaning was executed after every commit

**Issue:2**
Configured table _type is MOR with operation upsert but in partitions, Hudi is not creating .log files and also generating 18MB parquet files. Application created two partitions and both have maximum 18MB files. Attaching the content of both partitions below FYR.

**Environment Description**

* Hudi version : 0.13.1

* Spark version : 3.2.4

* Storage (HDFS/S3/GCS..) : Minio as a S3 extension

* Running on k8s : Using spark operator


**Hudi conf used in spark application**

```
    df.write
    .format("org.apache.hudi")
    // Write Config
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
    .option("hoodie.datasource.write.precombine.field", "ts")
    .option("hoodie.datasource.write.recordkey.field", "recordkey")
    .option("hoodie.datasource.write.partitionpath.field", "date")
    .option("hoodie.datasource.write.table.name", "spark_streaming")
    .option("hoodie.table.name", "spark_streaming")
    .option("hoodie.datasource.write.operation", "upsert")
    .option("hoodie.merge.small.file.group.candidates.limit", "1")
    // Hive Sync
    .option("hoodie.datasource.hive_sync.mode", "hms")
    .option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9090")
    .option("hoodie.datasource.hive_sync.database", "hudi_test")
    .option("hoodie.datasource.hive_sync.table", "test_table")
    .option("hoodie.datasource.hive_sync.partition_fields", "date")
    .option("hoodie.datasource.hive_sync.enable", "true")
    // Compaction
    .option("hoodie.compact.inline", "false")
    .option("hoodie.compact.inline.max.delta.commits", "6")
    .option("hoodie.datasource.compaction.async.enable", "true")
    .option("hoodie.parquet.small.file.limit", "104857600")
    // Clustering
    .option("hoodie.clustering.async.enabled", "true")
    .option("hoodie.clustering.async.max.commits", "1")
    .option("hoodie.clustering.execution.strategy.class", "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
    // Cleaning
    .option("hoodie.clean.async", "true")
    .option("hoodie.cleaner.commits.retained", "3")
    // Archive
    .option("hoodie.archive.async", "true")
    // Index
    .option("hoodie.index.type", "BLOOM")
    // Payload
    .option("hoodie.payload.event.time.field", "ts")
    .option("hoodie.payload.ordering.field", "ts")
    // KeyGenerator
    .option("hoodie.datasource.write.hive_style_partitioning", "true")
    .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.SimpleKeyGenerator")
    // Marker
    .option("hoodie.rollback.using.markers", "true")
    // Multi-writes
    .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
    .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
    .option("hoodie.write.lock.zookeeper.url", "zookeeper")
    .option("hoodie.write.lock.zookeeper.port", "2181")
    .option("hoodie.write.lock.zookeeper.base_path", "/hoodie")
    .option("hoodie.cleaner.policy.failed.writes", "LAZY")
    .option("hoodie.write.lock.zookeeper.lock_key", "lock_v1")
    .option("hoodie.write.lock.zookeeper.connection_timeout_ms", "30000")
    .option("hoodie.write.lock.wait_time_ms", "600000")
    // Multi-Modal Index 
    .option("hoodie.metadata.index.bloom.filter.enable", "true")
    .option("hoodie.metadata.index.column.stats.enable", "true")
    // MetaData 
    .option("hoodie.metadata.enable", "true")
    // Data Skipping
    .option("hoodie.enable.data.skipping", "true")
    // Storage
    .option("hoodie.parquet.max.file.size", "125829120")
    .option("hoodie.logfile.max.size", "1073741824")
    .option("hoodie.logfile.to.parquet.compression.ratio", "0.35"))
```

**hoodie.properties file content**
```
#Updated at 2023-11-21T20:26:57.745347Z
#Tue Nov 21 20:26:57 UTC 2023
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.table.type=MERGE_ON_READ
hoodie.table.metadata.partitions=bloom_filters,column_stats,files
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=date
hoodie.archivelog.folder=archived
hoodie.table.cdc.enabled=false
hoodie.timeline.layout.version=1
hoodie.table.checksum=3323849328
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.timeline.timezone=LOCAL
hoodie.table.name=spark_streaming
hoodie.table.recordkey.fields=recordkey
hoodie.compaction.record.merger.strategy=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
hoodie.datasource.write.hive_style_partitioning=true
hoodie.partition.metafile.use.base.format=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.populate.meta.fields=true
hoodie.table.base.file.format=PARQUET
hoodie.database.name=
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.version=5
```

**.hoodie folder content captured in below attached file name hoodie.txt**
[hoodie.txt](https://github.com/apache/hudi/files/13479589/hoodie.txt) 

**.hoodie folder recursive all folders content captured in below-attached file name hoodie-recursive.txt**
[hoodie-recursive.txt](https://github.com/apache/hudi/files/13479587/hoodie-recursive.txt)

**Application processed two days data 1st & 2nd November. Attaching the content of both partitions below:**
[date=2023-11-01.txt](https://github.com/apache/hudi/files/13479585/date.2023-11-01.txt)
[date=2023-11-02.txt](https://github.com/apache/hudi/files/13479586/date.2023-11-02.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Compaction & Clustering are not working #10183

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] Compaction & Clustering are not working #10183

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions