Skip to content

Python: Add logic for a custom FileIO#5588

Merged
rdblue merged 4 commits intoapache:masterfrom
Fokko:fd-implement-file-io
Aug 21, 2022
Merged

Python: Add logic for a custom FileIO#5588
rdblue merged 4 commits intoapache:masterfrom
Fokko:fd-implement-file-io

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 19, 2022

No description provided.

@Fokko Fokko marked this pull request as ready for review August 19, 2022 16:15

# Mappings from the Java FileIO impl to a Python one. The list is ordered by preference.
# If a implementation isn't installed, it will fall back to the next one.
JAVA_FILE_IO_MAPPINGS: Dict[str, List[str]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a mapping from Java implementation to Python implementation makes sense.

The Java class is needed because this is dynamically loaded in Java. There is a similar need here, but there's not a connection between the preferred Java implementation and one in Python. For example, both S3 and GCS have direct implementations (S3FileIO, GCSFileIO) and can be used through HadoopFileIO. If you prefer HadoopFileIO in Java, that doesn't necessarily mean that you prefer ArrowFileIO vs a future FSSpecFileIO in Python. This choice is probably independent.

I think a better option is to ignore the io-impl property from Java and look at the catalog's warehouse location or a table's location. The scheme from either location should tell us what the backing store should be. Then we can use a list of implementations from that scheme.

So I think this should probably be:

  FILE_IO_MAPPINGS: Dict[str, List[str]] = {
    "s3": [ARROW_FILE_IO],
    "gcs": [ARROW_FILE_IO],
    "file": [ARROW_FILE_IO],
    "hdfs": [ARROW_FILE_IO],
    ...
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I like it a lot!

"org.apache.iceberg.hadoop.HadoopFileIO": [ARROW_FILE_IO],
"org.apache.iceberg.aliyun.oss.OSSFileIO": [ARROW_FILE_IO],
"org.apache.iceberg.io.ResolvingFileIO": [ARROW_FILE_IO],
"org.apache.iceberg.aws.s3.S3FileIO": [ARROW_FILE_IO],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it is worth it, but would it make sense to take an approach similar to catalog loading?

The map from scheme to implementation could use methods that load a known library using a function like in the other PR. Then the _import_file_io with a raw implementation class would only need to be used for overrides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_ methods were created for loading the catalog in a lazy manner. We could create a method like pyiceberg.io.pyarrow:load_file_io, but I don't see what benefit that brings, at the cost of additional complexity.

Something like:

# Default to PyArrow
from pyiceberg.io.pyarrow import load_pyarrow()
return load_pyarrow()

But then we still need to have a fallback for FileIOs that are outside of the scope of the pyiceberg package. I like the current implementation because it very specifically imports the FileIO that we're looking for, and we don't import the class on the forehand, removing the possibility of importing optional dependencies.

IO_IMPL = "io-impl"


def load_file_io(properties: Properties) -> FileIO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to use location to determine this automatically, then we'd probably want to pass it here as optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I would keep it like it is and just check if the keys are in the properties. The main reason is that we can't add py-io-impl as an argument to the function because it contains slashes. This way we would have two different ways of retrieving data from the properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what you mean. Why would it matter if a string argument contained slashes?

The reason why I would add location as an argument is because it isn't part of properties. It is tracked separately as warehouse or table location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. I thought you wanted to expand the properties. It would then look like this:

def load_file_io(py-io-impl: Optional[str], **properties: str) -> FileIO:

The dashes are not allowed.

You're right, I was confused with the namespaces where we annotate the properties with the location and comment. Updated the code!

}


def _import_file_io(io_impl: str, properties: Properties) -> Optional[FileIO]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@rdblue rdblue merged commit 9d26979 into apache:master Aug 21, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 21, 2022

Thanks, @Fokko!

@Fokko Fokko deleted the fd-implement-file-io branch August 21, 2022 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants