-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6027] Fix slow downloads when reading from GCS #8553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
97c755b
055ebe4
d921782
c0f20e2
9e13b5e
2927358
b81c431
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -80,16 +80,21 @@ def finish(self): | |||||||||||
| class DownloaderStream(io.RawIOBase): | ||||||||||||
| """Provides a stream interface for Downloader objects.""" | ||||||||||||
|
|
||||||||||||
| def __init__(self, downloader, mode='rb'): | ||||||||||||
| def __init__(self, | ||||||||||||
| downloader, | ||||||||||||
| read_buffer_size=io.DEFAULT_BUFFER_SIZE, | ||||||||||||
| mode='rb'): | ||||||||||||
| """Initializes the stream. | ||||||||||||
|
|
||||||||||||
| Args: | ||||||||||||
| downloader: (Downloader) Filesystem dependent implementation. | ||||||||||||
| read_buffer_size: (int) Buffer size to use during read operations. | ||||||||||||
| mode: (string) Python mode attribute for this stream. | ||||||||||||
| """ | ||||||||||||
| self._downloader = downloader | ||||||||||||
| self.mode = mode | ||||||||||||
| self._position = 0 | ||||||||||||
| self._reader_buffer_size = read_buffer_size | ||||||||||||
|
|
||||||||||||
| def readinto(self, b): | ||||||||||||
| """Read up to len(b) bytes into b. | ||||||||||||
|
|
@@ -157,6 +162,16 @@ def seekable(self): | |||||||||||
| def readable(self): | ||||||||||||
| return True | ||||||||||||
|
|
||||||||||||
| def readall(self): | ||||||||||||
| """Read until EOF, using multiple read() call.""" | ||||||||||||
| res = [] | ||||||||||||
| while True: | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this function used ? Prob. remove if unused.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah actually seems like you are overriding the function here: https://docs.python.org/3/library/io.html#io.IOBase
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, still have a question. Does Beam call readlll() function anywhere ? I couldn't find a usage. Beam textio for example, invokes read() not readall(). If it does, I'm not sure what will prevent us from reading a huge amount of data into memory and running into OOMs.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I only found this usage in beam/sdks/python/apache_beam/io/fileio.py Lines 150 to 154 in 1382505
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. I think ReadableFile is intended for small files. But probably we should add a readall() method there as well and update read() to take a buffer (not in this PR). cc: @pabloem |
||||||||||||
| data = self.read(self._reader_buffer_size) | ||||||||||||
| if not data: | ||||||||||||
| break | ||||||||||||
| res.append(data) | ||||||||||||
| return b''.join(res) | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class UploaderStream(io.RawIOBase): | ||||||||||||
| """Provides a stream interface for Uploader objects.""" | ||||||||||||
|
|
||||||||||||
Uh oh!
There was an error while loading. Please reload this page.