Feature Request / Improvement
Description
HadoopFileIO is configurable via two paths:
FileIO.initialize(Map<String, String>) — the catalog pushes catalog properties in.
HadoopConfigurable#setConf(Configuration) — a caller-supplied Hadoop Configuration.
Today, catalog properties are never propagated into the Hadoop Configuration that HadoopFileIO uses for its FileSystem lookups. If a user wants to parameterize Hadoop-level settings (fs.s3a.endpoint, fs.s3a.path.style.access, fs.defaultFS,
fs.s3a.access.key, fs.azure.account.key.*, etc.) on a per-catalog basis, their options today are:
- Mutate the shared
Configuration object out-of-band before creating the catalog. This leaks settings across catalogs that share the same Configuration, and is impossible in environments like Flink session clusters where the same Configuration is reused for many catalogs.
- Carry a separate
hadoop-site.xml per deployment. Doesn't compose with REST-catalog deployments where the catalog is discovered dynamically.
- Implement a custom
FileIO. Overkill for a configuration-only concern.
Proposed solution
Treat catalog properties whose key starts with iceberg.hadoop.conf. as Hadoop Configuration overrides: strip the prefix and set(...) them on the Configuration that HadoopFileIO uses. Apply the override on both initialize(...) and setConf(...) so the feature works whether the catalog driver or a Hadoop-aware caller is in charge of the configuration.
To avoid mutating caller-owned objects, copy the base Configuration before overlaying:
private static final String HADOOP_CONF_PREFIX = "iceberg.hadoop.conf.";
@Override
public void initialize(Map<String, String> props) {
this.properties = SerializableMap.copyOf(props);
rebuildConfWithProperties(getConf());
}
@Override
public void setConf(Configuration conf) {
rebuildConfWithProperties(conf);
}
private void rebuildConfWithProperties(Configuration baseConf) {
Configuration conf = new Configuration(baseConf);
PropertyUtil.propertiesWithPrefix(properties, HADOOP_CONF_PREFIX).forEach(conf::set);
this.hadoopConf = new SerializableConfiguration(conf);
}
Example usage
# REST catalog config, or any Iceberg catalog config
iceberg.hadoop.conf.fs.s3a.endpoint = https://s3.eu-west-1.example.com
iceberg.hadoop.conf.fs.s3a.path.style.access = true
iceberg.hadoop.conf.fs.s3a.connection.maximum = 200
Query engine
None
Willingness to contribute
Feature Request / Improvement
Description
HadoopFileIOis configurable via two paths:FileIO.initialize(Map<String, String>)— the catalog pushes catalog properties in.HadoopConfigurable#setConf(Configuration)— a caller-supplied HadoopConfiguration.Today, catalog properties are never propagated into the Hadoop Configuration that
HadoopFileIOuses for itsFileSystemlookups. If a user wants to parameterize Hadoop-level settings (fs.s3a.endpoint,fs.s3a.path.style.access,fs.defaultFS,fs.s3a.access.key,fs.azure.account.key.*, etc.) on a per-catalog basis, their options today are:Configurationobject out-of-band before creating the catalog. This leaks settings across catalogs that share the sameConfiguration, and is impossible in environments like Flink session clusters where the sameConfigurationis reused for many catalogs.hadoop-site.xmlper deployment. Doesn't compose with REST-catalog deployments where the catalog is discovered dynamically.FileIO. Overkill for a configuration-only concern.Proposed solution
Treat catalog properties whose key starts with
iceberg.hadoop.conf.as Hadoop Configuration overrides: strip the prefix andset(...)them on theConfigurationthatHadoopFileIOuses. Apply the override on bothinitialize(...)andsetConf(...)so the feature works whether the catalog driver or a Hadoop-aware caller is in charge of the configuration.To avoid mutating caller-owned objects, copy the base
Configurationbefore overlaying:Example usage
Query engine
None
Willingness to contribute