Python: Add logic for a custom FileIO#5588
Conversation
python/pyiceberg/io/__init__.py
Outdated
|
|
||
| # Mappings from the Java FileIO impl to a Python one. The list is ordered by preference. | ||
| # If a implementation isn't installed, it will fall back to the next one. | ||
| JAVA_FILE_IO_MAPPINGS: Dict[str, List[str]] = { |
There was a problem hiding this comment.
I'm not sure that a mapping from Java implementation to Python implementation makes sense.
The Java class is needed because this is dynamically loaded in Java. There is a similar need here, but there's not a connection between the preferred Java implementation and one in Python. For example, both S3 and GCS have direct implementations (S3FileIO, GCSFileIO) and can be used through HadoopFileIO. If you prefer HadoopFileIO in Java, that doesn't necessarily mean that you prefer ArrowFileIO vs a future FSSpecFileIO in Python. This choice is probably independent.
I think a better option is to ignore the io-impl property from Java and look at the catalog's warehouse location or a table's location. The scheme from either location should tell us what the backing store should be. Then we can use a list of implementations from that scheme.
So I think this should probably be:
FILE_IO_MAPPINGS: Dict[str, List[str]] = {
"s3": [ARROW_FILE_IO],
"gcs": [ARROW_FILE_IO],
"file": [ARROW_FILE_IO],
"hdfs": [ARROW_FILE_IO],
...
}There was a problem hiding this comment.
Thanks for the suggestion, I like it a lot!
python/pyiceberg/io/__init__.py
Outdated
| "org.apache.iceberg.hadoop.HadoopFileIO": [ARROW_FILE_IO], | ||
| "org.apache.iceberg.aliyun.oss.OSSFileIO": [ARROW_FILE_IO], | ||
| "org.apache.iceberg.io.ResolvingFileIO": [ARROW_FILE_IO], | ||
| "org.apache.iceberg.aws.s3.S3FileIO": [ARROW_FILE_IO], |
There was a problem hiding this comment.
I'm not sure whether it is worth it, but would it make sense to take an approach similar to catalog loading?
The map from scheme to implementation could use methods that load a known library using a function like in the other PR. Then the _import_file_io with a raw implementation class would only need to be used for overrides.
There was a problem hiding this comment.
The load_ methods were created for loading the catalog in a lazy manner. We could create a method like pyiceberg.io.pyarrow:load_file_io, but I don't see what benefit that brings, at the cost of additional complexity.
Something like:
# Default to PyArrow
from pyiceberg.io.pyarrow import load_pyarrow()
return load_pyarrow()
But then we still need to have a fallback for FileIOs that are outside of the scope of the pyiceberg package. I like the current implementation because it very specifically imports the FileIO that we're looking for, and we don't import the class on the forehand, removing the possibility of importing optional dependencies.
python/pyiceberg/io/__init__.py
Outdated
| IO_IMPL = "io-impl" | ||
|
|
||
|
|
||
| def load_file_io(properties: Properties) -> FileIO: |
There was a problem hiding this comment.
If we were to use location to determine this automatically, then we'd probably want to pass it here as optional.
There was a problem hiding this comment.
In this case, I would keep it like it is and just check if the keys are in the properties. The main reason is that we can't add py-io-impl as an argument to the function because it contains slashes. This way we would have two different ways of retrieving data from the properties.
There was a problem hiding this comment.
I'm not following what you mean. Why would it matter if a string argument contained slashes?
The reason why I would add location as an argument is because it isn't part of properties. It is tracked separately as warehouse or table location.
There was a problem hiding this comment.
Ah got it. I thought you wanted to expand the properties. It would then look like this:
def load_file_io(py-io-impl: Optional[str], **properties: str) -> FileIO:The dashes are not allowed.
You're right, I was confused with the namespaces where we annotate the properties with the location and comment. Updated the code!
| } | ||
|
|
||
|
|
||
| def _import_file_io(io_impl: str, properties: Properties) -> Optional[FileIO]: |
|
Thanks, @Fokko! |
No description provided.