Skip to content

[SUPPORT] How can we avoid small file creations for spark streaming #7910

@JnaneshwarikTR

Description

@JnaneshwarikTR

Hi,

  • Hudi version :0.11.1

  • Spark version :3.2.1

  • Hive version : NA

  • Hadoop version : NA

  • Storage (HDFS/S3/GCS..) :S3

  • Running on Docker? (yes/no) : no

We have spark streaming application running with batch interval of 5 min. We added below configs to avoid small file creation.

HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key() -> String.valueOf(104857600)
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> String.valueOf(125829120)

However when i run my application i see my parquet file are created with lesser than the mentioned small file limit.

here is the complete hudi config we are using in application.

HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key() -> String.valueOf(104857600),
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> String.valueOf(125829120),
HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key() -> CompactionTriggerStrategy.TIME_ELAPSED.name,
HoodieCompactionConfig.INLINE_COMPACT_TIME_DELTA_SECONDS.key() -> String.valueOf(60 * 60),
HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "936",
HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "937",
HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "960",
HoodieCompactionConfig.ASYNC_CLEAN.key() -> "false",
HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
HoodieMetricsConfig.TURN_METRICS_ON.key() -> "true",
HoodieMetricsConfig.METRICS_REPORTER_TYPE_VALUE.key() -> MetricsReporterType.DATADOG.name(),
HoodieMetricsDatadogConfig.API_SITE_VALUE.key() -> "US",
HoodieMetricsDatadogConfig.METRIC_PREFIX_VALUE.key() -> "tacticalnovusingest.hudi",
HoodieMetricsDatadogConfig.API_KEY_SUPPLIER.key() -> "com.tr.indigo.tacticalnovusingest.utils.DatadogKeySupplier",
HoodieMetadataConfig.ENABLE.key() -> "false",
HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE.key() -> "false",

Parquet files which created are as below.

image

how can we avoid small file creations?

@koochiswathiTR my teammate in case need more info.

Appreciate all the help you guys do.

Thanks,JK

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions