[BEAM-6027] Fix slow downloads when reading from GCS #8553

fabito · 2019-05-10T18:15:20Z

Overrides io.RawIOBase.readall in filesystemio.DownloaderStream as proposed in BEAM-6027.
It improves download time in ~40x.

…owse/BEAM-6027 Signed-off-by: fabito <fuechi@ciandt.com>

Signed-off-by: fabito <fuechi@ciandt.com>

udim

Thank you for your contribution!
Very cool

sdks/python/apache_beam/io/filesystemio.py

udim · 2019-05-21T20:13:23Z

cc: @chamikaramj
Can you run some benchmarks and include your results here?

Signed-off-by: fabito <fuechi@ciandt.com>

fabito · 2019-05-23T14:15:53Z

Hi @udim ,

Using this snippet:

import tempfile
import timeit

from apache_beam.io.filesystems import FileSystems
from apache_beam.io.gcp import gcsio
from apache_beam.io.filesystemio import DownloaderStream


# https://issues.apache.org/jira/browse/BEAM-6027
def downloader_stream_readall(self):
    res = []
    while True:
        data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
        if not data:
            break
        res.append(data)
    return b''.join(res)


original_read_all = DownloaderStream.readall


if __name__ == '__main__':
    test_file = 'gs://cloud-samples-tests/vision/saigon.mp4'
    num_executions = 1

    def test_original():
        DownloaderStream.readall = original_read_all
        with FileSystems.open(test_file) as audio_file:
            with tempfile.NamedTemporaryFile(mode='w+b') as temp:
                temp.write(audio_file.read())

    def test_refactored():
        DownloaderStream.readall = downloader_stream_readall
        with FileSystems.open(test_file) as audio_file:
            with tempfile.NamedTemporaryFile(mode='w+b') as temp:
                temp.write(audio_file.read())

    print(timeit.timeit("test_original()", setup="from __main__ import test_original", number=num_executions))
    print(timeit.timeit("test_refactored()", setup="from __main__ import test_refactored", number=num_executions))

I got the following output:

120.99772120200214
1.0684915780002484

Hope that helps

chamikaramj · 2019-05-23T15:53:35Z

Thanks for the update. This looks great.

Seems like the file you used for your microbenchmark is about 4MB which will be within the first chunk for the new buffer size. Can you try running a Beam pipeline with a larger input (say 10GB) with Dataflow to confirm that there's no regression at large scale ?

fabito · 2019-05-23T16:00:26Z

Yes I can. Any advice on how this pipeline would be and how can we measure the reading performance?
Maybe something like this:

    with beam.Pipeline(options=pipeline_options) as pipeline:
        _ = (
            pipeline
            | 'Read 10Gb file' >> beam.io.ReadAllFromText('gs://bucket/10Gb.txt')
            | 'Write file' >> beam.io.WriteToText('gs://bucket/10Gb_copy*.txt')
        )

chamikaramj · 2019-05-23T16:20:05Z

Yeah, that pipeline looks good. End-to-end execution time and Total vCPU time show in Dataflow console should be good metrics to compare.

fabito · 2019-05-23T17:49:37Z

I ran the same pipeline, first with the original code and after with the new implementation of read_all. I used a text file with ~8.81Gb. Apparently performance wasn't affected. Check below the console snapshots from both executions:

Before the change:

After the change:

chamikaramj

Thanks. LGTM.

Thanks for fixing this.

Added one comment. Also please fixup your commits into a single commit for merging.

chamikaramj · 2019-05-23T20:05:43Z

sdks/python/apache_beam/io/filesystemio.py

+  def readall(self):
+    """Read until EOF, using multiple read() call."""
+    res = []
+    while True:


Where is this function used ?

Prob. remove if unused.

Ah actually seems like you are overriding the function here: https://docs.python.org/3/library/io.html#io.IOBase

Sorry, still have a question.

Does Beam call readlll() function anywhere ? I couldn't find a usage. Beam textio for example, invokes read() not readall().
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L272

If it does, I'm not sure what will prevent us from reading a huge amount of data into memory and running into OOMs.

I only found this usage in ReadableFile (relatively new) where we don't specify the size:

beam/sdks/python/apache_beam/io/fileio.py

Lines 150 to 154 in 1382505

def open(self, mime_type='text/plain'):

return filesystems.FileSystems.open(self.metadata.path)

def read(self):

return self.open().read()

That makes sense. I think ReadableFile is intended for small files. But probably we should add a readall() method there as well and update read() to take a buffer (not in this PR).

cc: @pabloem

chamikaramj · 2019-05-23T23:38:05Z

Thanks. I'll squash and merge.

udim · 2019-05-23T23:38:25Z

Adding note to not forget to "run python postcommit" before merging

override readall to solve BEAM-6027 https://issues.apache.org/jira/br…

97c755b

…owse/BEAM-6027 Signed-off-by: fabito <fuechi@ciandt.com>

fabito marked this pull request as ready for review May 10, 2019 18:15

fabito added 2 commits May 11, 2019 10:09

add new contructor param for the read_buffer_size

055ebe4

Signed-off-by: fabito <fuechi@ciandt.com>

💦 using correct attribute

d921782

robertwb requested a review from udim May 21, 2019 12:58

udim requested changes May 21, 2019

View reviewed changes

sdks/python/apache_beam/io/filesystemio.py Show resolved Hide resolved

sdks/python/apache_beam/io/filesystemio.py Outdated Show resolved Hide resolved

fabito added 4 commits May 22, 2019 22:43

fix lint error

c0f20e2

Signed-off-by: fabito <fuechi@ciandt.com>

fix lint error

9e13b5e

Signed-off-by: fabito <fuechi@ciandt.com>

fix lint error

2927358

Signed-off-by: fabito <fuechi@ciandt.com>

fix lint error

b81c431

Signed-off-by: fabito <fuechi@ciandt.com>

chamikaramj approved these changes May 23, 2019

View reviewed changes

udim approved these changes May 23, 2019

View reviewed changes

chamikaramj merged commit 4cf2830 into apache:master May 23, 2019

fabito deleted the override-readall-for-faster-downloads branch June 22, 2019 21:42

Abacn mentioned this pull request Jun 21, 2022

Slow DownloaderStream when reading from GCS #19238

Closed

	def open(self, mime_type='text/plain'):
	return filesystems.FileSystems.open(self.metadata.path)

	def read(self):
	return self.open().read()

[BEAM-6027] Fix slow downloads when reading from GCS #8553

[BEAM-6027] Fix slow downloads when reading from GCS #8553

Uh oh!

Conversation

fabito commented May 10, 2019

Uh oh!

udim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

udim commented May 21, 2019

Uh oh!

fabito commented May 23, 2019

Uh oh!

chamikaramj commented May 23, 2019

Uh oh!

fabito commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chamikaramj commented May 23, 2019

Uh oh!

fabito commented May 23, 2019

Before the change:

After the change:

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

chamikaramj May 23, 2019

Choose a reason for hiding this comment

Uh oh!

chamikaramj May 23, 2019

Choose a reason for hiding this comment

Uh oh!

chamikaramj May 23, 2019

Choose a reason for hiding this comment

Uh oh!

udim May 23, 2019

Choose a reason for hiding this comment

Uh oh!

chamikaramj May 23, 2019

Choose a reason for hiding this comment

Uh oh!

chamikaramj commented May 23, 2019

Uh oh!

udim commented May 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fabito commented May 23, 2019 •

edited

Loading