Python: Add logic for a custom FileIO by Fokko · Pull Request #5588 · apache/iceberg

Fokko · 2022-08-19T16:15:20Z

No description provided.

rdblue · 2022-08-19T22:17:00Z

python/pyiceberg/io/__init__.py

+
+# Mappings from the Java FileIO impl to a Python one. The list is ordered by preference.
+# If a implementation isn't installed, it will fall back to the next one.
+JAVA_FILE_IO_MAPPINGS: Dict[str, List[str]] = {


I'm not sure that a mapping from Java implementation to Python implementation makes sense.

The Java class is needed because this is dynamically loaded in Java. There is a similar need here, but there's not a connection between the preferred Java implementation and one in Python. For example, both S3 and GCS have direct implementations (S3FileIO, GCSFileIO) and can be used through HadoopFileIO. If you prefer HadoopFileIO in Java, that doesn't necessarily mean that you prefer ArrowFileIO vs a future FSSpecFileIO in Python. This choice is probably independent.

I think a better option is to ignore the io-impl property from Java and look at the catalog's warehouse location or a table's location. The scheme from either location should tell us what the backing store should be. Then we can use a list of implementations from that scheme.

So I think this should probably be:

FILE_IO_MAPPINGS: Dict[str, List[str]] = { "s3": [ARROW_FILE_IO], "gcs": [ARROW_FILE_IO], "file": [ARROW_FILE_IO], "hdfs": [ARROW_FILE_IO], ... }

Thanks for the suggestion, I like it a lot!

rdblue · 2022-08-19T22:38:31Z

python/pyiceberg/io/__init__.py

+    "org.apache.iceberg.hadoop.HadoopFileIO": [ARROW_FILE_IO],
+    "org.apache.iceberg.aliyun.oss.OSSFileIO": [ARROW_FILE_IO],
+    "org.apache.iceberg.io.ResolvingFileIO": [ARROW_FILE_IO],
+    "org.apache.iceberg.aws.s3.S3FileIO": [ARROW_FILE_IO],


I'm not sure whether it is worth it, but would it make sense to take an approach similar to catalog loading?

The map from scheme to implementation could use methods that load a known library using a function like in the other PR. Then the _import_file_io with a raw implementation class would only need to be used for overrides.

The load_ methods were created for loading the catalog in a lazy manner. We could create a method like pyiceberg.io.pyarrow:load_file_io, but I don't see what benefit that brings, at the cost of additional complexity.

Something like:

# Default to PyArrow from pyiceberg.io.pyarrow import load_pyarrow() return load_pyarrow()

But then we still need to have a fallback for FileIOs that are outside of the scope of the pyiceberg package. I like the current implementation because it very specifically imports the FileIO that we're looking for, and we don't import the class on the forehand, removing the possibility of importing optional dependencies.

rdblue · 2022-08-19T22:39:18Z

python/pyiceberg/io/__init__.py

+IO_IMPL = "io-impl"
+
+
+def load_file_io(properties: Properties) -> FileIO:


If we were to use location to determine this automatically, then we'd probably want to pass it here as optional.

In this case, I would keep it like it is and just check if the keys are in the properties. The main reason is that we can't add py-io-impl as an argument to the function because it contains slashes. This way we would have two different ways of retrieving data from the properties.

I'm not following what you mean. Why would it matter if a string argument contained slashes?

The reason why I would add location as an argument is because it isn't part of properties. It is tracked separately as warehouse or table location.

Ah got it. I thought you wanted to expand the properties. It would then look like this:

def load_file_io(py-io-impl: Optional[str], **properties: str) -> FileIO:

The dashes are not allowed.

You're right, I was confused with the namespaces where we annotate the properties with the location and comment. Updated the code!

rdblue · 2022-08-19T22:39:32Z

python/pyiceberg/io/__init__.py

+}
+
+
+def _import_file_io(io_impl: str, properties: Properties) -> Optional[FileIO]:


This looks good to me.

…plement-file-io

rdblue · 2022-08-21T20:21:25Z

Thanks, @Fokko!

Python: Add logic for a custom FileIO

87251e1

Fokko marked this pull request as ready for review August 19, 2022 16:15

github-actions bot added the python label Aug 19, 2022

rdblue reviewed Aug 19, 2022

View reviewed changes

Fokko added 3 commits August 21, 2022 21:10

Select the FileIO based on the schema

c89b72a

Update with the correct location

d0a6184

Merge branch 'master' of https://github.com/apache/iceberg into fd-im…

82bfd31

…plement-file-io

rdblue approved these changes Aug 21, 2022

View reviewed changes

rdblue merged commit 9d26979 into apache:master Aug 21, 2022

Fokko deleted the fd-implement-file-io branch August 21, 2022 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add logic for a custom FileIO#5588

Python: Add logic for a custom FileIO#5588
rdblue merged 4 commits intoapache:masterfrom
Fokko:fd-implement-file-io

Fokko commented Aug 19, 2022

Uh oh!

rdblue Aug 19, 2022

Uh oh!

Fokko Aug 21, 2022

Uh oh!

rdblue Aug 19, 2022

Uh oh!

Fokko Aug 21, 2022

Uh oh!

rdblue Aug 19, 2022

Uh oh!

Fokko Aug 21, 2022

Uh oh!

rdblue Aug 21, 2022

Uh oh!

Fokko Aug 21, 2022

Uh oh!

rdblue Aug 19, 2022

Uh oh!

rdblue commented Aug 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		IO_IMPL = "io-impl"


		def load_file_io(properties: Properties) -> FileIO:

		}


		def _import_file_io(io_impl: str, properties: Properties) -> Optional[FileIO]:

Conversation

Fokko commented Aug 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants