Skip to content

Force Rewrite Avro Files During Optimization #4057

@lintingbin

Description

@lintingbin

Description

Currently, the optimizer may skip rewriting Avro format files even when optimization is triggered. This can lead to suboptimal table performance when Avro files exist in the table.

Problem

Avro files have different characteristics compared to columnar formats (like Parquet or ORC):

  • Write Performance: Avro format provides significantly better write performance and higher throughput compared to columnar formats, making it ideal for high-speed data ingestion scenarios
  • Read Performance: However, Avro's row-based format is less efficient for analytical queries compared to columnar formats
  • Format Consistency: Mixed file formats in a table can complicate maintenance and optimization

Proposed Solution

Add a configurable option self-optimizing.rewrite-all-avro to force rewrite all Avro files during optimization. This enables a write-optimized ingestion strategy:

  1. Use Avro format for fast, high-throughput data ingestion
  2. Automatically convert Avro files to Parquet/ORC during optimization for better read performance
  3. Maintain optimal performance for both write and read workloads

Implementation Details

The changes include:

  1. Add a needRewriteAvroFile flag in CommonPartitionEvaluator to track if any Avro files exist
  2. Check file format using ContentFiles.isAvroFile() method
  3. Always mark Avro files for rewriting when the feature is enabled
  4. Update partition evaluation to consider Avro file presence as a necessary condition for optimization
  5. Add configuration property with proper validation (ignored when table's default format is Avro)

Benefits

  • High-throughput ingestion: Leverage Avro's superior write performance for data ingestion
  • Optimal read performance: Ensure all data is eventually converted to columnar format for efficient queries
  • Best of both worlds: Maximize both write and read performance through format conversion
  • Flexible configuration: Enable/disable based on workload characteristics
  • Table health: Maintains consistency in file formats across the table

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions