[SUPPORT] Parquet file size is small after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning

**Problem**

Parquet file are being generated of small size (almost 1mb) after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning. We wanted to have files with larger size for optimal processing 

We have tried various configurations as well.

Following configurations worked when we run deltastreamer in UPSERT mode, it started generating parquet files relatively bigger files of arround 5-6mb. Dont know why 5-6mb, it should have created according to the values given below? But in bulk insert, it doesnt work.
```
hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
```

I tried passing following configs as well.

```
hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
hoodie.copyonwrite.record.size.estimate=150
hoodie.memory.merge.max.size=2004857600000
hoodie.insert.shuffle.parallelism=2000
hoodie.upsert.shuffle.parallelism=2000
hoodie.copyonwrite.insert.split.size=1000000
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=4
```

**Expected behavior**
It should work for bulk insert operation as well. And for upserts, it should consider values provided in config. It is always giving 5-6mb parquet file size irrespective of the values given in case of upsert.


**Environment Description**

* Aws EMR: emr-6.8.0

* Hudi version : 0.11.1

* Spark version : 3.3.0

* Storage (HDFS/S3/GCS..) : S3

* Running on Docker? (yes/no) : EMR cluster


**Additional context**

I tried clustering to merge smaller files created using bulk insert. 

**Command:**
```
spark-submit \
--master local[4] \
--class org.apache.hudi.utilities.HoodieClusteringJob \
s3://<my-bucket>/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
--base-path s3://<my-bucket>/<my-table-name> \
--instant-time 20210122190240 \
--table-name <my-table-name> \
--props s3://<my-bucket>/clusteringjob.properties \
--spark-memory 1g
```

**Stacktrace**

```
23/02/09 10:43:55 INFO Javalin: Stopping Javalin ...
23/02/09 10:43:55 INFO Javalin: Javalin has stopped
23/02/09 10:43:55 ERROR UtilHelpers: Cluster failed
org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://<my-bucket>/<db-name>/<my-table-name>/.hoodie/20210122190240.replacecommit.requested
        at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:763) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:264) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:90) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:106) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.table.action.cluster.SparkExecuteClusteringCommitActionExecutor.<init>(SparkExecuteClusteringCommitActionExecutor.java:45) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
```

**Query**: How this works and whats the ideal approach to handle this? And what I am missing in clustering as its not working for me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Parquet file size is small after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning #8017

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] Parquet file size is small after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning #8017

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions