Skip to content

Propagate iceberg.hadoop.conf. prefixed properties to HadoopFileIO's Configuration #16017

@lrsb

Description

@lrsb

Feature Request / Improvement

Description

HadoopFileIO is configurable via two paths:

  1. FileIO.initialize(Map<String, String>) — the catalog pushes catalog properties in.
  2. HadoopConfigurable#setConf(Configuration) — a caller-supplied Hadoop Configuration.

Today, catalog properties are never propagated into the Hadoop Configuration that HadoopFileIO uses for its FileSystem lookups. If a user wants to parameterize Hadoop-level settings (fs.s3a.endpoint, fs.s3a.path.style.access, fs.defaultFS,
fs.s3a.access.key, fs.azure.account.key.*, etc.) on a per-catalog basis, their options today are:

  • Mutate the shared Configuration object out-of-band before creating the catalog. This leaks settings across catalogs that share the same Configuration, and is impossible in environments like Flink session clusters where the same Configuration is reused for many catalogs.
  • Carry a separate hadoop-site.xml per deployment. Doesn't compose with REST-catalog deployments where the catalog is discovered dynamically.
  • Implement a custom FileIO. Overkill for a configuration-only concern.

Proposed solution

Treat catalog properties whose key starts with iceberg.hadoop.conf. as Hadoop Configuration overrides: strip the prefix and set(...) them on the Configuration that HadoopFileIO uses. Apply the override on both initialize(...) and setConf(...) so the feature works whether the catalog driver or a Hadoop-aware caller is in charge of the configuration.

To avoid mutating caller-owned objects, copy the base Configuration before overlaying:

private static final String HADOOP_CONF_PREFIX = "iceberg.hadoop.conf.";

@Override
public void initialize(Map<String, String> props) {
  this.properties = SerializableMap.copyOf(props);
  rebuildConfWithProperties(getConf());
}

@Override
public void setConf(Configuration conf) {
  rebuildConfWithProperties(conf);
}

private void rebuildConfWithProperties(Configuration baseConf) {
  Configuration conf = new Configuration(baseConf);
  PropertyUtil.propertiesWithPrefix(properties, HADOOP_CONF_PREFIX).forEach(conf::set);
  this.hadoopConf = new SerializableConfiguration(conf);
}

Example usage

# REST catalog config, or any Iceberg catalog config
iceberg.hadoop.conf.fs.s3a.endpoint            = https://s3.eu-west-1.example.com
iceberg.hadoop.conf.fs.s3a.path.style.access   = true
iceberg.hadoop.conf.fs.s3a.connection.maximum  = 200

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions