Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-27846][CORE] Eagerly compute Configuration.properties in sc.ha…
…doopConfiguration ## What changes were proposed in this pull request? Hadoop `Configuration` has an internal `properties` map which is lazily initialized. Initialization of this field, done in the private `Configuration.getProps()` method, is rather expensive because it ends up parsing XML configuration files. When cloning a `Configuration`, this `properties` field is cloned if it has been initialized. In some cases it's possible that `sc.hadoopConfiguration` never ends up computing this `properties` field, leading to performance problems when this configuration is cloned in `SessionState.newHadoopConf()` because each cloned `Configuration` needs to re-parse configuration XML files from disk. To avoid this problem, we can call `Configuration.size()` to trigger a call to `getProps()`, ensuring that this expensive computation is cached and re-used when cloning configurations. I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload. ## How was this patch tested? Examined YourKit profiles before and after my change. Closes #24714 from JoshRosen/fuzzing-perf-improvements. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
- Loading branch information