Parquet: make row group check (min and max record count) configurable. #3181

stevenzwu · 2021-09-26T22:21:05Z

This is useufl when we want to use smaller row group size. In the past, we also found that tuning these configurations to smaller value is important lto avoid OOM problem when there are large records (like MBs). E.g., in the past, we have set the min/max check count default to 10/100 by default for the Flink streaming ingestion part. The performance impact is very insignificant. For streams with large records, we have even set the min/max to much smaller value like 1/5.

stevenzwu · 2021-09-26T22:23:35Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

 class ParquetWriter<T> implements FileAppender<T>, Closeable {

-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
+  private static final DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors


In this PR, I also fixed this compiler warnings and RuntimeIOException deprecation (to UncheckedIOException). Please let me know if it is preferred to leave those out as a separate PR

Thanks for working on that! I think we are accumulating more and more of these warnings, better to fix them as much as possible along the way.

stevenzwu · 2021-09-27T15:31:29Z

@rdblue can you help take a look?

jackye1995

Overall looks good to me.

The only concern I have is that this seems to be a configuration specific to the use case on engine side. For example, in streaming we might want to turn this to smaller number, but for normal Spark ingestion the defaults might work better.

But as of today we don't have a way to add runtime session options to the appender, it just takes table properties. I don't have enough historical context to say if that's intentional or not, maybe @rdblue can comment more about this. But overall I think it good that at least we have a config rather than hard-coded values.

jackye1995 · 2021-09-30T02:09:38Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

 class ParquetWriter<T> implements FileAppender<T>, Closeable {

-  private static DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors
+  private static final DynConstructors.Ctor<PageWriteStore> pageStoreCtorParquet = DynConstructors


Thanks for working on that! I think we are accumulating more and more of these warnings, better to fix them as much as possible along the way.

jackye1995 · 2021-09-30T02:18:34Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+      "write.parquet.row-group-check-max-record-count";
+  public static final String DELETE_PARQUET_ROW_GROUP_CHECK_MAX_RECORD_COUNT =
+      "write.delete.parquet.row-group-check-max-record-count";
+  public static final String PARQUET_ROW_GROUP_CHECK_MAX_RECORD_COUNT_DEFAULT = "10000";


this can be integer

jackye1995 · 2021-09-30T02:18:39Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+      "write.parquet.row-group-check-min-record-count";
+  public static final String DELETE_PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT =
+      "write.delete.parquet.row-group-check-min-record-count";
+  public static final String PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT_DEFAULT = "100";


this can be integer

jackye1995 · 2021-09-30T02:19:35Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

        String compressionLevel = config.getOrDefault(PARQUET_COMPRESSION_LEVEL, PARQUET_COMPRESSION_LEVEL_DEFAULT);

-        return new Context(rowGroupSize, pageSize, dictionaryPageSize, codec, compressionLevel);
+        int rowGroupCheckMinRecordCount = Integer.parseInt(config.getOrDefault(


can use PropertyUtil.propertyAsInt

Originally, I was trying to stay with the existing code/style in this class. In the new commit, I have updated Integer.partInt to PropertyUtil.propertyAsInt for all integer configs (old or new)

jackye1995 · 2021-09-30T02:19:49Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+            PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT, PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT_DEFAULT));
+        int rowGroupCheckMaxRecordCount = Integer.parseInt(config.getOrDefault(
+            PARQUET_ROW_GROUP_CHECK_MAX_RECORD_COUNT, PARQUET_ROW_GROUP_CHECK_MAX_RECORD_COUNT_DEFAULT));
+


I think we need to add some basic validations such as the config value is positive, max is greater than min.

thx for pointing this out. added validations.

stevenzwu · 2021-09-30T04:09:30Z

@jackye1995 Please take another look.

Regarding your comment that new configs is more specific to the use case on engine side, I think this is not engine specific. Sure, it probably matters a little more on the streaming ingestion (Flink or Spark streaming). It can matter to batch write too.

E.g., We want to have smaller row group size (like 16 MB) to be able to split files into more splits for higher parallelism. if the average row size is big (like MBs), then we need to tune down these configs to have more accurate control on the target row group size. This is useful if we want more accurate control on the row group size (and memory consumption) irrespective to streaming or batch write.

stevenzwu · 2021-11-09T05:20:20Z

@rdblue can you help take a look?

rdblue · 2022-02-06T20:35:53Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+        int rowGroupSize = PropertyUtil.propertyAsInt(config,
+            PARQUET_ROW_GROUP_SIZE_BYTES, PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT);
+        Preconditions.checkArgument(rowGroupSize > 0,
+            "Row group size must be > 0");


Does this need to be on a separate line?

rdblue · 2022-02-06T20:43:39Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java

+    // Even though row group size is 16 bytes, we still have to write 101 records
+    // as default PARQUET_ROW_GROUP_CHECK_MIN_RECORD_COUNT is 100.
+    File parquetFile = generateFileWithTwoRowGroups(null, 101, props)
+        .first();


Does this need to be on a separate line?

rdblue · 2022-02-06T20:50:20Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java

  }

-  private Pair<File, Long> generateFileWithTwoRowGroups(Function<MessageType, ParquetValueWriter<?>> createWriterFunc)
+  private Pair<File, Long>  generateFileWithTwoRowGroups(


The modification to these tests doesn't seem to fit. This is pulling the writer configuration out of this method, but still using the name generateFileWithTwoRowGroups even though there's no longer a guarantee that will actually happen. I think it would be better to build the properties in this method and pass in settings:

private ... generateFile(Function<...> createWriterFunc, Long rowGroupSizeBytes, Integer minCheckRecordCount, Integer maxCheckRecordCount)

That way, all this function does is create the file. I think it should also set its own desired record count based on the min row group count so you don't have to pass in 1 more than the row group size in records.

Updated the method to build the props inside this method.

desiredRecordCount is already part of the method signature. No more hardcoded 1 more than the row group size in records.

rdblue

The main changes look good, but I think this could make the test modifications more clear.

This is useufl when we want to use smaller row group size. In the past, we also found that tuning these configurations to smaller value is important lto avoid OOM problem when there are large records (like MBs). E.g., in the past, we have set the min/max check count default to 10/100 by default for the Flink streaming ingestion part. The performance impact is very insignificant. For streams with large records, we have even set the min/max to much smaller value like 1/5.

kbendick

LGTM

github-actions bot added core parquet labels Sep 26, 2021

stevenzwu commented Sep 26, 2021

View reviewed changes

jackye1995 reviewed Sep 30, 2021

View reviewed changes

github-actions bot added the spark label Sep 30, 2021

stevenzwu force-pushed the parquetWriter branch 2 times, most recently from e02dac1 to 817f426 Compare November 9, 2021 01:41

stevenzwu closed this Nov 9, 2021

stevenzwu reopened this Nov 9, 2021

stevenzwu force-pushed the parquetWriter branch from 817f426 to 87bb5e8 Compare January 5, 2022 22:32

stevenzwu force-pushed the parquetWriter branch from 87bb5e8 to 98672bf Compare January 27, 2022 06:05

rdblue reviewed Feb 6, 2022

View reviewed changes

rdblue requested changes Feb 6, 2022

View reviewed changes

stevenzwu added 7 commits February 6, 2022 21:24

fix checkstyle

c7f0415

Address Jack's review comments

809826e

fix test compile

be8f41f

removed remaining deprecated RuntimeIOException after rebase

781a30c

fix style based on Ryan's comments

6f2da14

build props inside generateFile as Ryan suggested

7414433

stevenzwu force-pushed the parquetWriter branch from 98672bf to 7414433 Compare February 7, 2022 05:38

rdblue approved these changes Feb 18, 2022

View reviewed changes

kbendick approved these changes Feb 18, 2022

View reviewed changes

rdblue merged commit 9cc2ac8 into apache:master Feb 18, 2022

Parquet: make row group check (min and max record count) configurable. #3181

Parquet: make row group check (min and max record count) configurable. #3181

Uh oh!

Conversation

stevenzwu commented Sep 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Sep 27, 2021

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzwu commented Nov 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stevenzwu commented Sep 30, 2021 •

edited

Loading