ARROW-16272: [Python] Fix NativeFile.read1() #13264

pitrou · 2022-05-30T13:53:15Z

read1() should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer.

Should fix the performance issue when using TextIOWrapper or pandas.read_csv on a S3 input file.

github-actions · 2022-05-30T13:53:35Z

https://issues.apache.org/jira/browse/ARROW-16272

pitrou · 2022-05-30T13:53:39Z

@ursabot please benchmark lang=Python

ursabot · 2022-05-30T13:53:44Z

Benchmark runs are scheduled for baseline = e19acbe and contender = 94cee03. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 94cee03d ec2-t3-xlarge-us-east-2
[Finished] 94cee03d test-mac-arm
[Finished] 94cee03d ursa-i9-9960x
[Finished] e19acbe9 ec2-t3-xlarge-us-east-2
[Finished] e19acbe9 test-mac-arm
[Finished] e19acbe9 ursa-i9-9960x
[Finished] e19acbe9 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

lidavidm

LGTM, though I left a couple questions

lidavidm · 2022-05-31T12:18:02Z

python/pyarrow/io.pxi

+            # amount of bytes, such as with io.TextIOWrapper).
+            nbytes = self._default_chunk_size
+        else:
+            nbytes = min(nbytes, self._default_chunk_size)


Why are we limiting the read size to the chunk size when an explicit size is passed?

Hmm, you're right that it may not be necessary, Python's IO stack is happily letting you read1() large sizes.

lidavidm · 2022-05-31T12:20:21Z

python/pyarrow/tests/test_fs.py

+    df = pd.read_csv(f, nrows=2)
+    assert list(df["vendor_id"]) == ["VTS", "DDS"]
+    # Some readahead occurred, but not up to the end of file (which is ~2 GB)
+    assert f.tell() <= 256 * 1024


To me it seems S3 is unnecessary here? Or at least the 'real' S3 is unnecessary here?

Ah, you're right, it works as well with a local file.

`read1(nbytes=None)` should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer. Should fix the performance issue when using `TextIOWrapper` or `pandas.read_csv` on a S3 input file.

ursabot · 2022-06-01T02:01:35Z

Benchmark runs are scheduled for baseline = b851392 and contender = 3149486. 3149486 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.09% ⬆️0.43%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.34% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 31494860 ec2-t3-xlarge-us-east-2
[Finished] 31494860 test-mac-arm
[Failed] 31494860 ursa-i9-9960x
[Finished] 31494860 ursa-thinkcentre-m75q
[Finished] b8513920 ec2-t3-xlarge-us-east-2
[Finished] b8513920 test-mac-arm
[Finished] b8513920 ursa-i9-9960x
[Finished] b8513920 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

pitrou requested a review from lidavidm May 30, 2022 13:53

github-actions bot added the Component: Python label May 30, 2022

lidavidm approved these changes May 31, 2022

View reviewed changes

pitrou added 2 commits May 31, 2022 14:33

Add an integration test with Pandas

e453ccb

pitrou force-pushed the ARROW-16272-native-file-read1 branch from 9482aca to 0a608d6 Compare May 31, 2022 12:46

Apply review comments

143a05a

pitrou force-pushed the ARROW-16272-native-file-read1 branch from 0a608d6 to 143a05a Compare May 31, 2022 12:46

lidavidm approved these changes May 31, 2022

View reviewed changes

pitrou closed this in 3149486 May 31, 2022

pitrou deleted the ARROW-16272-native-file-read1 branch May 31, 2022 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16272: [Python] Fix NativeFile.read1() #13264

ARROW-16272: [Python] Fix NativeFile.read1() #13264

pitrou commented May 30, 2022 •

edited

Loading

github-actions bot commented May 30, 2022

pitrou commented May 30, 2022

ursabot commented May 30, 2022 •

edited

Loading

lidavidm left a comment

lidavidm May 31, 2022

pitrou May 31, 2022

lidavidm May 31, 2022

pitrou May 31, 2022

ursabot commented Jun 1, 2022

ARROW-16272: [Python] Fix NativeFile.read1() #13264

ARROW-16272: [Python] Fix NativeFile.read1() #13264

Conversation

pitrou commented May 30, 2022 • edited Loading

github-actions bot commented May 30, 2022

pitrou commented May 30, 2022

ursabot commented May 30, 2022 • edited Loading

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm May 31, 2022

Choose a reason for hiding this comment

pitrou May 31, 2022

Choose a reason for hiding this comment

lidavidm May 31, 2022

Choose a reason for hiding this comment

pitrou May 31, 2022

Choose a reason for hiding this comment

ursabot commented Jun 1, 2022

pitrou commented May 30, 2022 •

edited

Loading

ursabot commented May 30, 2022 •

edited

Loading