File I/O Submodule for TableOperations #12

mccheah · 2018-11-27T01:25:35Z

In Netflix/iceberg#107 it was discussed that InputFile and OutputFile instances should be pluggable. We discussed the fact that provision of InputFile and OutputFile instances should be handled by the TableOperations API. However, the Spark data source in particular only uses HadoopInputFile#fromPath for reading and HadoopOutputFile#fromPath for writing. Using TableOperations#newInputFile and TableOperations#newOutputFile, would also be difficult because calling these methods on the executors would require TableOperations instances to be Serializable.

We propose having the TableOperations API provide a FileIO module that handles the narrow role of reading, creating / writing, and deleting files. We propose the following:

interface FileIO extends Serializable {
  InputFile newInputFile(String path);
  OutputFile newOutputFile(String path);
  void deleteFile(String path);
}

Then the following method would be added to TableOperations, and we would remove TableOperations#newInputFile and TableOperations#newMetadataFile.

interface TableOperations {
  FileIO fileIo();
  String resolveNewMetadataPath(String metadataFilename);
}

The need for resolveNewMetadataPath is because the new FileIO abstraction considers all locations as full paths, but the old method TableOperations#newMetadataFile assumes the argument is a file name, not a full path. Therefore now callers that used to call TableOperations#newMetadataFile should first retrieve the full path and then pass that along to FileIO#newOutputFile. For convenience we could add a helper default method like so:

interface TableOperations {
  FileIO fileIo();
  String resolveNewMetadataPath(String metadataFilename);
  default OutputFile newMetadataFile(String fileName) {
    return fileIo().newOutputFile(resolveMetadataPath(fileName));
  }
}

The text was updated successfully, but these errors were encountered:

apache#12)

…and batch reader

This was referenced Nov 27, 2018

Allow overriding provision of FileSystem instances to HadoopTableOperations #13

Closed

Make HadoopTableOperations public with protected constructor. #10

Closed

Pluggable file I/O submodule in TableOperations #14

Merged

Custom metadata in data files #16

Closed

mccheah mentioned this issue Dec 5, 2018

DataFile External Identifier Field #23

Closed

rdblue mentioned this issue Dec 7, 2018

Custom InputFile / OutputFile providers for Spark Netflix/iceberg#107

Closed

rdblue closed this as completed in #14 Dec 11, 2018

rdsr added a commit to rdsr/incubator-iceberg that referenced this issue Mar 13, 2020

Add a iceberg-runtime shaded module (apache#12)

c7e72ef

guilload added a commit to guilload/iceberg that referenced this issue Jul 9, 2020

Refactor IcebergObjectInspector and implement custom object inspectors (

f7c5c39

apache#12)

HotSushi pushed a commit to HotSushi/iceberg that referenced this issue Jul 31, 2020

Shading: Add a iceberg-runtime shaded module (apache#12)

7b0087a

bkahloon pushed a commit to bkahloon/iceberg that referenced this issue Feb 27, 2021

Shading: Add a iceberg-runtime shaded module (apache#12)

f17188b

pavibhai added a commit to pavibhai/iceberg that referenced this issue Mar 24, 2023

Fixed apache#12 - Support selected vector with ORC reader on the row …

3a7db82

…and batch reader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File I/O Submodule for TableOperations #12

File I/O Submodule for TableOperations #12

mccheah commented Nov 27, 2018 •

edited

Loading

File I/O Submodule for TableOperations #12

File I/O Submodule for TableOperations #12

Comments

mccheah commented Nov 27, 2018 • edited Loading

mccheah commented Nov 27, 2018 •

edited

Loading