Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16272: [Python] Fix NativeFile.read1() #13264

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented May 30, 2022

read1() should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer.

Should fix the performance issue when using TextIOWrapper or pandas.read_csv on a S3 input file.

@pitrou pitrou requested a review from lidavidm May 30, 2022 13:53
@github-actions
Copy link

@pitrou
Copy link
Member Author

pitrou commented May 30, 2022

@ursabot please benchmark lang=Python

@ursabot
Copy link

ursabot commented May 30, 2022

Benchmark runs are scheduled for baseline = e19acbe and contender = 94cee03. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 94cee03d ec2-t3-xlarge-us-east-2
[Finished] 94cee03d test-mac-arm
[Finished] 94cee03d ursa-i9-9960x
[Finished] e19acbe9 ec2-t3-xlarge-us-east-2
[Finished] e19acbe9 test-mac-arm
[Finished] e19acbe9 ursa-i9-9960x
[Finished] e19acbe9 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I left a couple questions

# amount of bytes, such as with io.TextIOWrapper).
nbytes = self._default_chunk_size
else:
nbytes = min(nbytes, self._default_chunk_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we limiting the read size to the chunk size when an explicit size is passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right that it may not be necessary, Python's IO stack is happily letting you read1() large sizes.

df = pd.read_csv(f, nrows=2)
assert list(df["vendor_id"]) == ["VTS", "DDS"]
# Some readahead occurred, but not up to the end of file (which is ~2 GB)
assert f.tell() <= 256 * 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems S3 is unnecessary here? Or at least the 'real' S3 is unnecessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, it works as well with a local file.

`read1(nbytes=None)` should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer.

Should fix the performance issue when using `TextIOWrapper` or `pandas.read_csv` on a S3 input file.
@pitrou pitrou force-pushed the ARROW-16272-native-file-read1 branch from 9482aca to 0a608d6 Compare May 31, 2022 12:46
@pitrou pitrou force-pushed the ARROW-16272-native-file-read1 branch from 0a608d6 to 143a05a Compare May 31, 2022 12:46
@pitrou pitrou closed this in 3149486 May 31, 2022
@pitrou pitrou deleted the ARROW-16272-native-file-read1 branch May 31, 2022 13:30
@ursabot
Copy link

ursabot commented Jun 1, 2022

Benchmark runs are scheduled for baseline = b851392 and contender = 3149486. 3149486 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.09% ⬆️0.43%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.34% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 31494860 ec2-t3-xlarge-us-east-2
[Finished] 31494860 test-mac-arm
[Failed] 31494860 ursa-i9-9960x
[Finished] 31494860 ursa-thinkcentre-m75q
[Finished] b8513920 ec2-t3-xlarge-us-east-2
[Finished] b8513920 test-mac-arm
[Finished] b8513920 ursa-i9-9960x
[Finished] b8513920 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants