Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Make Buffered* IO classes available to Python, incorporate into input_stream, output_stream factory functions #19478

Closed
asfimport opened this issue Aug 28, 2018 · 7 comments

Comments

@asfimport
Copy link

Reporter: Wes McKinney / @wesm
Assignee: Krisztian Szucs / @kszucs

PRs and other links:

Note: This issue was originally created as ARROW-3126. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
What would that do under the hood exactly? Any benchmarks to watch for?

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
As far as I understand the title, I would do the same as https://docs.python.org/3/library/io.html#io.BufferedReader internally does. Simply using the Python class in pyarrow already brought us great improvements in reading Parquet files from Azure.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Ok. The term "read ahead" is a bit misleading, because it implies that I/O is hidden in the background, which is not how a buffering layer works (the buffer is filled up synchronously when empty, it's not fed by a separate thread).

Can we-reuse io.BufferedReader for this or is the intention to have a similar primitive written in C++? Also, does it return a InputStream or a full-blown RandomAccessFile (the latter is quite a bit more difficult to get right and optimize).

@asfimport
Copy link
Author

Wes McKinney / @wesm:
open_stream only uses InputStream https://github.com/apache/arrow/blob/master/python/pyarrow/ipc.pxi#L247. So we should implement a buffering InputStream in C++

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@kszucs would you be interested in working on this? My thinking is to add a buffer_size argument to both pyarrow.input_stream and output_stream. After a raw reader or writer is created, if this argument it set, it will be wrapped in either a BufferedInputStream or BufferedOutputStream as appropriate

@asfimport
Copy link
Author

Wes McKinney / @wesm:
If someone else could pick this up I would be appreciative. If I finish the other work assigned to me this week and this is not done, I will pick it up

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 3252
#3252

@asfimport asfimport added this to the 0.12.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants