Feature Request / Improvement
The class HadoopFileIO uses the hadoop 2.x APIs for opening and creating files. By moving to the newer createFile/openFile APIs and passing in information, it should be possible to
- Eliminate head requests when opening files (s3a, abfs)
- Switch the cloud connector to an optimal read policy for the known file type (random, sequential,..) (s3a, gcs)
- Have the s3a connector skip checks that creating a file won't overwrite a directory. This is consistent with S3OutputFile and relies on iceberg creating unique filenames, or at least not creating files above other objects.
There are some other minor tunings.
Constraints
- All of this shall use APIs in Hadoop 3.3.5 so there's no need to use reflection for spark 3.4+ use.
- There shall not be any adverse consequences when running against hdfs or local fs.
Query engine
None
Willingness to contribute
Feature Request / Improvement
The class HadoopFileIO uses the hadoop 2.x APIs for opening and creating files. By moving to the newer createFile/openFile APIs and passing in information, it should be possible to
There are some other minor tunings.
Constraints
Query engine
None
Willingness to contribute