HIVE-27309: Large number of partitions and small files causes OOM in … #4645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

deniskuzZ merged 1 commit into apache:master from difin:HIVE-27309-table-ser

Sep 13, 2023

+10 −21

Contributor

difin commented Aug 30, 2023 •

edited

Loading

…query coordinator.

What changes were proposed in this pull request?

Removed serialized table from Iceberg split and instead using the serialized table already existing in the config.

Why are the changes needed?

"org.apache.iceberg.SerializableTable" is getting serialized in every split takes a hit with large number of small splits. E.g in the case provided in the ticket, i had 100,000+ splits which were grouped to 41 splits.

However, during serialization "Table" information is serialized and each entity is around 7 KB. With 100,000 entries it will be easily 700 MB where as AM size itself is 2 GB.
When large number of nested partitions (with small files) are read, AM bails out with OOM.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Passing pre-commit testing.

asf-ci-hive added tests pending tests failed and removed tests pending labels

zhangbutao reviewed

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java Outdated Show resolved Hide resolved

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java Outdated Show resolved Hide resolved

difin force-pushed the HIVE-27309-table-ser branch from 6c36bda to 4ece6c3 Compare

September 2, 2023 20:01

asf-ci-hive added tests pending and removed tests failed labels

difin force-pushed the HIVE-27309-table-ser branch from 4ece6c3 to d1f6191 Compare

September 2, 2023 20:02

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels

difin force-pushed the HIVE-27309-table-ser branch from d1f6191 to 0aca7fb Compare

September 3, 2023 20:06

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels

difin force-pushed the HIVE-27309-table-ser branch from 0aca7fb to 0a99fe5 Compare

September 4, 2023 01:34

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels

zhangbutao reviewed

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated

    
                  }

                  if (conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)) == null) {

                    conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER),

Contributor

zhangbutao Sep 4, 2023

IIUC, every split has a conf and each conf would contain a srialized iceberg table. I want to know if the srialized iceberg table would make conf bigger? and also would cause AM oom in case of many splits?
Did we have a benchmark test to verify the validation of this change?

Contributor Author

difin Sep 4, 2023 •

edited

Loading

Conf contained serialized Iceberg table even before these change. All q-tests besides one were passing after removing the serialized table from splits and without adding serialized table to conf because it was already there. So these changes actually do not increase memory usage anywhere, but strictly decrease, because serialized table is removed from the splits.
One q-test that was failing was show_partitions_test.q. It was failing because it is doing select from table.partitions.
There were 2 unit test classes (not q-tests) that were failing because serialized table wasn't in conf, but it is something that was present only in unit tests.

For these 2 edge cases, I made the change to conditionally add serialized table into conf in IcebergInputFormat.getSplits(), only in case it is missing in conf.

Contributor

zhangbutao Sep 4, 2023

I almost see what you mean. thx

zhangbutao reviewed

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated

    
                  }

                  if (conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)) == null) {

                    conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER),

Contributor

zhangbutao Sep 4, 2023

I almost see what you mean. thx

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated

    
                      .orElseGet(() -> Catalogs.loadTable(conf));

                  if (conf.get(InputFormatConfig.TABLE_IDENTIFIER) == null) {

                    conf.set(InputFormatConfig.TABLE_IDENTIFIER, table.name());

Contributor

zhangbutao Sep 4, 2023

I think we do not need this param. We can just use table.name(),

Member

deniskuzZ Sep 6, 2023

also we can skip the if check as well

Contributor

zhangbutao Sep 6, 2023

yes, this is also what i mean ;)

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

    
                    this.conf = newContext.getConfiguration();

                    this.table = ((IcebergSplit) split).table();

                    this.table = SerializationUtil.deserializeFromBase64(

                              conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)));

Contributor

zhangbutao Sep 4, 2023

Suggested change

      
                            conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)));
          
                            conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + table.name()));

difin force-pushed the HIVE-27309-table-ser branch from 0a99fe5 to 03e684e Compare

September 6, 2023 11:59

asf-ci-hive added tests pending and removed tests passed tests pending labels

asf-ci-hive added tests passed and removed tests pending labels

deniskuzZ approved these changes

View reviewed changes

Member

deniskuzZ left a comment

LGTM +1

difin force-pushed the HIVE-27309-table-ser branch from 03e684e to 744acbc Compare

September 7, 2023 15:21

asf-ci-hive added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels

deniskuzZ reviewed

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated

    
                      .ofNullable(HiveIcebergStorageHandler.table(conf, conf.get(InputFormatConfig.TABLE_IDENTIFIER)))

                      .orElseGet(() -> Catalogs.loadTable(conf));

                  conf.set(InputFormatConfig.TABLE_IDENTIFIER, table.name());

Member

deniskuzZ Sep 12, 2023 •

edited

Loading

could we do the conf setup when loading the table:

Table table = Optional
  .ofNullable(HiveIcebergStorageHandler.table(conf, conf.get(InputFormatConfig.TABLE_IDENTIFIER)))
  .orElseGet(() -> {
        Table tbl = Catalogs.loadTable(conf);
        conf.set(InputFormatConfig.TABLE_IDENTIFIER, tbl.name());
        conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + tbl.name(), SerializationUtil.serializeToBase64(tbl));
        return tbl;
  });


          HIVE-27309: Large number of partitions and small files causes OOM in …

e7c5580

…query coordinator.

difin force-pushed the HIVE-27309-table-ser branch from 744acbc to e7c5580 Compare

September 12, 2023 17:15

asf-ci-hive added tests pending and removed tests passed labels

sonarqubecloud bot commented Sep 12, 2023

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

asf-ci-hive added tests passed and removed tests pending labels

deniskuzZ merged commit 88bc826 into apache:master

tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request


          HIVE-27309: Large number of partitions and small files causes OOM in …

311afb4

…query coordinator (Dmitriy Fingerman, reviewed by Butao Zhang, Denys Kuzmenko)

Closes apache#4645

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels