Skip to content

[SPARK-51446][SQL] Improve the codecNameMap for the compression codec#50221

Closed
beliefer wants to merge 1 commit intoapache:masterfrom
beliefer:SPARK-51446
Closed

[SPARK-51446][SQL] Improve the codecNameMap for the compression codec#50221
beliefer wants to merge 1 commit intoapache:masterfrom
beliefer:SPARK-51446

Conversation

@beliefer
Copy link
Copy Markdown
Contributor

@beliefer beliefer commented Mar 9, 2025

What changes were proposed in this pull request?

This PR proposes to improve the codecNameMap for the [hadoop|avro|orc|parquet] compression codecs.

Why are the changes needed?

Currently, [hadoop|avro|orc|parquet] compression codecs select java.util.Map to store the mapping between compression codec to the short name.
Enum Map has the following advantages.

  • High performance: Due to the limited number of enumeration values, Enum Map uses arrays internally to store data, with an access speed close to O (1).

  • Low memory usage: Enum Map only stores key value pairs of enumeration values, without the need for a hash table structure, resulting in smaller memory usage.

  • Orderliness: The Enum Map maintains key value pairs in the order defined by the enumeration, making it suitable for scenarios that require ordered traversal.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

GA.

Was this patch authored or co-authored using generative AI tooling?

No

@beliefer beliefer changed the title [WIP][SPARK-51446][SQL] Improve the codecNameMap for the compression codec [SPARK-51446][SQL] Improve the codecNameMap for the compression codec Mar 10, 2025
@beliefer
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static final EnumMap<HadoopCompressionCodec, String> codecNameMap = new EnumMap<>(
    Arrays.stream(HadoopCompressionCodec.values())
        .collect(Collectors.toMap(
            Function.identity(),
            codec -> codec.name().toLowerCase(Locale.ROOT)
        ))
);

Does it look better this way?

Copy link
Copy Markdown
Contributor Author

@beliefer beliefer Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It brings extra transform from java.util.Map to EnumMap. How about ?

  private static final EnumMap<HadoopCompressionCodec, String> codecNameMap =
    Arrays.stream(HadoopCompressionCodec.values()).collect(
      Collectors.toMap(
        codec -> codec,
        codec -> codec.name().toLowerCase(Locale.ROOT),
        (oldValue, newValue) -> oldValue,
        () -> new EnumMap<>(HadoopCompressionCodec.class)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine to me

@beliefer
Copy link
Copy Markdown
Contributor Author

Merged into master.
@LuciferYang Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants