Skip to content

Python: Require __enter__ and __exit__ on {Input,Output}Stream#5436

Merged
rdblue merged 1 commit intoapache:masterfrom
Fokko:fd-refactor-pyarrow-file
Aug 7, 2022
Merged

Python: Require __enter__ and __exit__ on {Input,Output}Stream#5436
rdblue merged 1 commit intoapache:masterfrom
Fokko:fd-refactor-pyarrow-file

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 4, 2022

This way we can nicely use context managers:

f = PyArrowInputFile(..).open()
f.read()
f.close()

turns into:

with PyArrowInputFile('s3://').open() as f:
    f.read()

This way we don't forget to close any files. This is very easy to do and was actually happening when reading the metadata.

@github-actions github-actions bot added the python label Aug 4, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 4, 2022

I don't think that we want to do this. We discussed this when introducing the FIleIO API, but __enter__ and __exit__ don't fit well. Looks like you hit some of the problems:

  • This introduces FileAlreadyOpenError because InputFile can be used for only one stream at a time
  • The overwrite option needs to be added to OutputFile even though that's an option for the writer to decide, not FileIO

I'm all for adding __enter__ and __exit__ support though. What about adding it for InputStream and OutputStream? We could have __exit__ call close. Then usage would look like this:

file = io.newInputFile("s3://bucket/path.parquet")
with file.create(overwrite=True) as f:
    f.write(...)

with file.toInputFile().open() as f:
    f.read(...)

I also think that the close method should still exist. We just want to make __exit__ call that method.

@Fokko
Copy link
Contributor Author

Fokko commented Aug 4, 2022

@rdblue I also thought about that, but the {Input,Output}Stream is part of PyArrow, so that's not something that we control. We could add it to the protocol, but then we require the libraries to implement it (I think most of them do actually, pyarrow does). I can keep the original open and close in place, and add the__enter__ and __exit__ for convenience (but then we still need to split).

Update: S3FS also has the __enter__ and __exit__, so I think we can add them to the protocol 👍🏻

@rdblue
Copy link
Contributor

rdblue commented Aug 4, 2022

I don't think that we should add __enter__ and __exit__ to InputFile or OutputFile because of the inability to use them as factories as intended.

Can we get around the implementations not supporting __enter__ and __exit__ by adding them using object.__setattr__ when we create them?

@Fokko
Copy link
Contributor Author

Fokko commented Aug 4, 2022

Let me take a stab at adding them to the streams, I think that will work quite well 👍🏻

This way we always properly close the streams
@Fokko Fokko force-pushed the fd-refactor-pyarrow-file branch from 782ae22 to 320d97c Compare August 4, 2022 18:15
TableMetadata: A table metadata instance

"""
return FromByteStream.table_metadata(byte_stream=input_file.open(), encoding=encoding)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one caught my eye this morning; it wasn't properly closed

@Fokko
Copy link
Contributor Author

Fokko commented Aug 4, 2022

@rdblue Looks much cleaner, I'm confident that this also will work with future FileIO implementations.

@Fokko Fokko changed the title Python: Split PyArrowFile into PyArrowInputFile and PyArrowOutputFile Python: Require __enter__ and __exit__ on {Input,Output}Stream Aug 5, 2022
"""
return FromByteStream.table_metadata(byte_stream=input_file.open(), encoding=encoding)
with input_file.open() as input_stream:
return FromByteStream.table_metadata(byte_stream=input_stream, encoding=encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support compressed streams? I think we should make sure we can read and write gzipped metadata files.

@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2022

Looks like PyArrow already implements this: https://github.com/apache/arrow/blob/31494860c23ff7fe38a748a09c58822378605477/python/pyarrow/io.pxi#L121-L122

I'll merge. Thanks, @Fokko!

@rdblue rdblue merged commit 64a21b5 into apache:master Aug 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants