-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16272: [Python] Fix NativeFile.read1() #13264
Conversation
@ursabot please benchmark lang=Python |
Benchmark runs are scheduled for baseline = e19acbe and contender = 94cee03. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I left a couple questions
python/pyarrow/io.pxi
Outdated
# amount of bytes, such as with io.TextIOWrapper). | ||
nbytes = self._default_chunk_size | ||
else: | ||
nbytes = min(nbytes, self._default_chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we limiting the read size to the chunk size when an explicit size is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right that it may not be necessary, Python's IO stack is happily letting you read1() large sizes.
python/pyarrow/tests/test_fs.py
Outdated
df = pd.read_csv(f, nrows=2) | ||
assert list(df["vendor_id"]) == ["VTS", "DDS"] | ||
# Some readahead occurred, but not up to the end of file (which is ~2 GB) | ||
assert f.tell() <= 256 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems S3 is unnecessary here? Or at least the 'real' S3 is unnecessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, it works as well with a local file.
`read1(nbytes=None)` should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer. Should fix the performance issue when using `TextIOWrapper` or `pandas.read_csv` on a S3 input file.
9482aca
to
0a608d6
Compare
0a608d6
to
143a05a
Compare
Benchmark runs are scheduled for baseline = b851392 and contender = 3149486. 3149486 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
read1()
should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer.Should fix the performance issue when using
TextIOWrapper
orpandas.read_csv
on a S3 input file.