Skip to content

Conversation

@difin
Copy link
Contributor

@difin difin commented Aug 30, 2023

…query coordinator.

What changes were proposed in this pull request?

Removed serialized table from Iceberg split and instead using the serialized table already existing in the config.

Why are the changes needed?

"org.apache.iceberg.SerializableTable" is getting serialized in every split takes a hit with large number of small splits. E.g in the case provided in the ticket, i had 100,000+ splits which were grouped to 41 splits.

However, during serialization "Table" information is serialized and each entity is around 7 KB. With 100,000 entries it will be easily 700 MB where as AM size itself is 2 GB.
When large number of nested partitions (with small files) are read, AM bails out with OOM.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Passing pre-commit testing.

}

if (conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)) == null) {
conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, every split has a conf and each conf would contain a srialized iceberg table. I want to know if the srialized iceberg table would make conf bigger? and also would cause AM oom in case of many splits?
Did we have a benchmark test to verify the validation of this change?

Copy link
Contributor Author

@difin difin Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Conf contained serialized Iceberg table even before these change. All q-tests besides one were passing after removing the serialized table from splits and without adding serialized table to conf because it was already there. So these changes actually do not increase memory usage anywhere, but strictly decrease, because serialized table is removed from the splits.

  • One q-test that was failing was show_partitions_test.q. It was failing because it is doing select from table.partitions.

  • There were 2 unit test classes (not q-tests) that were failing because serialized table wasn't in conf, but it is something that was present only in unit tests.

For these 2 edge cases, I made the change to conditionally add serialized table into conf in IcebergInputFormat.getSplits(), only in case it is missing in conf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost see what you mean. thx

}

if (conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)) == null) {
conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost see what you mean. thx

.orElseGet(() -> Catalogs.loadTable(conf));

if (conf.get(InputFormatConfig.TABLE_IDENTIFIER) == null) {
conf.set(InputFormatConfig.TABLE_IDENTIFIER, table.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need this param. We can just use table.name(),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we can skip the if check as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is also what i mean ;)

this.conf = newContext.getConfiguration();
this.table = ((IcebergSplit) split).table();
this.table = SerializationUtil.deserializeFromBase64(
conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + conf.get(InputFormatConfig.TABLE_IDENTIFIER)));
conf.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + table.name()));

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

.ofNullable(HiveIcebergStorageHandler.table(conf, conf.get(InputFormatConfig.TABLE_IDENTIFIER)))
.orElseGet(() -> Catalogs.loadTable(conf));

conf.set(InputFormatConfig.TABLE_IDENTIFIER, table.name());
Copy link
Member

@deniskuzZ deniskuzZ Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do the conf setup when loading the table:

Table table = Optional
  .ofNullable(HiveIcebergStorageHandler.table(conf, conf.get(InputFormatConfig.TABLE_IDENTIFIER)))
  .orElseGet(() -> {
        Table tbl = Catalogs.loadTable(conf);
        conf.set(InputFormatConfig.TABLE_IDENTIFIER, tbl.name());
        conf.set(InputFormatConfig.SERIALIZED_TABLE_PREFIX + tbl.name(), SerializationUtil.serializeToBase64(tbl));
        return tbl;
  });

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@deniskuzZ deniskuzZ merged commit 88bc826 into apache:master Sep 13, 2023
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…query coordinator (Dmitriy Fingerman, reviewed by Butao Zhang, Denys Kuzmenko)

Closes apache#4645
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants