[BEAM-3099] Split out BufferedReader and BufferedWriter from gcsio. by udim · Pull Request #4471 · apache/beam

udim · 2018-01-24T00:45:09Z

Most of the code in filesystemio.py is copied verbatim from gcsio.py.
The Downloader and Uploader classes are new.

udim · 2018-01-24T00:45:49Z

charlesccychen

Thanks Udi!

charlesccychen · 2018-01-24T01:55:39Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    self.child_conn = child_conn
-    self.conn = parent_conn
+  # TODO: document, rename method?, rename child_conn maybe
+  def start(self, child_conn):


See comment in filesystemio.py about potentially passing a stream (PipeStream) object here instead.

I've changed Uploader's interface so that pipe usage is now an implementation detail of GcsUploader.

charlesccychen · 2018-01-24T01:55:39Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractmethod
+  def start(self, download_stream, buffer_size):
+    """Initialize downloader.


Can you detail that this needs to be called before get_range calls?

See also the comments below regarding whether download_stream should be completely managed by the Downloader as opposed to currently, where it is owned by the BufferedReader. In such a case, we may not need this start method since then the download_stream would be an internal detail we can deal with in the particular constructor.

Thanks. I've removed the start() method entirely.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+    Args:
+      download_stream: (cStringIO.StringIO) A buffer where downloaded data is
+        streamed to.
+      buffer_size: Maximum range size for get_range calls.


It looks like the buffer_size argument here is mostly an internal implementation detail of the specific downloader so maybe we don't need this argument? It may be helpful for us to expose the maximum chunk size allowable for a single call to get_range, which we use to upper-bound the buffer_size parameter of BufferedReader.

Added max_range_size property.

Actually, this was removed. You can pass buffer_size= to io.BufferedReader/Writer.
See examples in gcsio.py and the unit tests.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+# TODO: Consider using cStringIO instead of buffers and data_lists when reading
+# and writing.
+class BufferedWriter(object):
+  """A class for writing files from stateless services.


"writing files to"?

(actually, what does "stateless" here mean?)

Stateless refers to the file API: there is no long-lived file handle, the filesystem server doesn't store our current file position, etc.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+      return self.position
+
+    def seek(self, offset, whence=os.SEEK_SET):
+      # The apitools.base.py.transfer.Upload class insists on seeking to the end


Now this is factored out, we can note in this comment that we do this for the sake of the GCS implementation.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+    self.child_conn = child_conn
+    self.conn = parent_conn
+
+    self.uploader.start(child_conn)


Can we wrap this in a PipeStream here (and modify the Uploader API accordingly)?

Uploader API has been changed so PipeStream is now an implementation detail of GcsUploader.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+    return next(self)
+
+  def next(self):
+    """Read one line delimited by '\\n' from the file.


Remove extra newline.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+    return self
+
+  def __next__(self):
+    """Read one line delimited by '\\n' from the file.


Remove extra newline.

charlesccychen · 2018-01-24T01:55:40Z

sdks/python/apache_beam/io/filesystemio.py

+    self.downloader.get_range(start, end)
+    value = self.download_stream.getvalue()
+    # Clear the cStringIO object after we've read its contents.
+    self.download_stream.truncate(0)


It looks like the reason we have a stream at all is to satisfy the particular API of the apitools client we use for GCS. Would a cleaner API be to have the downloader directly return the bytes from get_range()? (so that the download_stream would be managed by the downloader?)

See above for how we could get rid of Downloader.start(download_stream, ...) with such an approach.

get_range now returns a string.

udim · 2018-01-25T22:38:24Z

PTAL. Thanks

charlesccychen

Thanks!

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    Mimics behavior of the readline() method on standard file objects.
+  def get_range(self, start, end):
+    self.download_stream.truncate(0)
+    self._downloader.GetRange(start, end)


If we make the end index non-inclusive (see comment on filesystemio.py), we need to use end - 1 here. The reason the apitools library uses inclusive indices is because the HTTP range header uses inclusive indices (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests), but that doesn't seem like a great reason for our interface to be inclusive too.

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/filesystemio.py

+      return 0
+
+    start = self._position
+    end = min(self._position + len(b) - 1, self._downloader.size - 1)


See comment above about possibly making this range end non-inclusive.

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/gcp/gcsio.py

+    return self._size

-    Mimics behavior of the readline() method on standard file objects.
+  def get_range(self, start, end):


See comment in filesystemio.py about making the range half-open.

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/gcp/gcsio.py

+    self._get_request.generation = metadata.generation

    # Initialize read buffer state.
    self.download_stream = cStringIO.StringIO()


Can we add an underscore / make this private too?

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractproperty
+  def last_error(self):
+    """Last error encountered for this instance."""


Can you describe usage of this property? When is it set and when would it be useful to query?

Do you mean expand the docstring, or as a reply?

In the docstring I mean, for future filesystem implementors.

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    return self._read_inner(size=size, readline=False)
+    return self._client.objects.Get(get_request)

-  def readline(self, size=-1):


There may a subtle behavior change now that we remove this custom readline() code. The standard library code (_pyio.py) may do some \r\n -> \n translation depending on the read mode and platform, which we don't want. Can you verify that this won't be the case, even on Windows?

CC: @chamikaramj

TextIOWrapper is not used, thus no translation is performed and readline() uses b"\n" as a line terminator.

charlesccychen · 2018-01-30T22:01:06Z

sdks/python/apache_beam/io/gcp/gcsio.py

+  def finish(self):
+    self._conn.close()
+    # TODO(udim): Add timeout=DEFAULT_HTTP_TIMEOUT_SECONDS * 2 and check
+    # isAlive.


Is the TODO content just an optimization, or is this a correctness issue?

Correctness. The way it's called, join() may block forever. It currently works but there might be flows in which the thread never exists. Adding a timeout is important, but it might introduce a bug so I added this TODO instead.

Are you worried about the network hanging? That may already be mitigated by using timeout=DEFAULT_HTTP_TIMEOUT_SECONDS in the httplib2.Http client.

It's just incorrect to wait forever without a timeout or some mechanism to interrupt the wait.

charlesccychen · 2018-01-30T22:01:07Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractmethod
+  def get_range(self, start, end):
+    """Retrieve a given byte range from this download, inclusive.


Can we make the range half-open, i.e. [start, end)? We used inclusive indices because we used the apitools library for GCS, and the reason the apitools library uses inclusive indices is because the HTTP range header uses inclusive indices (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests), but that doesn't seem like a great reason for our interface to be inclusive too.

charlesccychen · 2018-01-30T22:01:07Z

sdks/python/apache_beam/io/filesystemio.py

+    return self.position
+
+  def seek(self, offset, whence=os.SEEK_SET):
+    # The gcsio.Uploader class insists on seeking to the end of a stream to


"The apitools library used by the gcsio.Uploader class"

charlesccychen · 2018-01-30T22:01:08Z

sdks/python/apache_beam/io/filesystemio.py

+      return self.position
+
+    def seek(self, offset, whence=os.SEEK_SET):
+      # The apitools.base.py.transfer.Upload class insists on seeking to the end


udim wrote:
Done.

Done.

charlesccychen · 2018-01-30T22:01:08Z

sdks/python/apache_beam/io/filesystemio.py

+    Args:
+      download_stream: (cStringIO.StringIO) A buffer where downloaded data is
+        streamed to.
+      buffer_size: Maximum range size for get_range calls.


udim wrote:
Actually, this was removed. You can pass buffer_size= to io.BufferedReader/Writer.
See examples in gcsio.py and the unit tests.

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    self.child_conn = child_conn
-    self.conn = parent_conn
+  # TODO: document, rename method?, rename child_conn maybe
+  def start(self, child_conn):


udim wrote:
I've changed Uploader's interface so that pipe usage is now an implementation detail of GcsUploader.

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+    return self
+
+  def __next__(self):
+    """Read one line delimited by '\\n' from the file.


udim wrote:
done

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+    self.downloader.get_range(start, end)
+    value = self.download_stream.getvalue()
+    # Clear the cStringIO object after we've read its contents.
+    self.download_stream.truncate(0)


udim wrote:
get_range now returns a string.

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+    self.child_conn = child_conn
+    self.conn = parent_conn
+
+    self.uploader.start(child_conn)


udim wrote:
Uploader API has been changed so PipeStream is now an implementation detail of GcsUploader.

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+    return next(self)
+
+  def next(self):
+    """Read one line delimited by '\\n' from the file.


udim wrote:
done

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+# TODO: Consider using cStringIO instead of buffers and data_lists when reading
+# and writing.
+class BufferedWriter(object):
+  """A class for writing files from stateless services.


udim wrote:
Stateless refers to the file API: there is no long-lived file handle, the filesystem server doesn't store our current file position, etc.

Done.

charlesccychen · 2018-01-30T22:01:09Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractmethod
+  def start(self, download_stream, buffer_size):
+    """Initialize downloader.


udim wrote:
Thanks. I've removed the start() method entirely.

Done.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/gcp/gcsio.py

+    self._get_request.generation = metadata.generation

    # Initialize read buffer state.
    self.download_stream = cStringIO.StringIO()


charlesccychen wrote:
Can we add an underscore / make this private too?

Done.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractproperty
+  def last_error(self):
+    """Last error encountered for this instance."""


charlesccychen wrote:
In the docstring I mean, for future filesystem implementors.

After further consideration, I think this property is safe to remove.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/filesystemio.py

+      return 0
+
+    start = self._position
+    end = min(self._position + len(b) - 1, self._downloader.size - 1)


charlesccychen wrote:
See comment above about possibly making this range end non-inclusive.

Done.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/filesystemio.py

+
+  @abc.abstractmethod
+  def get_range(self, start, end):
+    """Retrieve a given byte range from this download, inclusive.


charlesccychen wrote:
Can we make the range half-open, i.e. [start, end)? We used inclusive indices because we used the apitools library for GCS, and the reason the apitools library uses inclusive indices is because the HTTP range header uses inclusive indices (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests), but that doesn't seem like a great reason for our interface to be inclusive too.

Done.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    Mimics behavior of the readline() method on standard file objects.
+  def get_range(self, start, end):
+    self.download_stream.truncate(0)
+    self._downloader.GetRange(start, end)


charlesccychen wrote:
If we make the end index non-inclusive (see comment on filesystemio.py), we need to use end - 1 here. The reason the apitools library uses inclusive indices is because the HTTP range header uses inclusive indices (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests), but that doesn't seem like a great reason for our interface to be inclusive too.

Done.

udim · 2018-01-31T03:28:37Z

sdks/python/apache_beam/io/gcp/gcsio.py

+    return self._size

-    Mimics behavior of the readline() method on standard file objects.
+  def get_range(self, start, end):


charlesccychen wrote:
See comment in filesystemio.py about making the range half-open.

Done.

udim · 2018-01-31T03:33:28Z

sdks/python/apache_beam/io/filesystemio.py

+    return self.position
+
+  def seek(self, offset, whence=os.SEEK_SET):
+    # The gcsio.Uploader class insists on seeking to the end of a stream to


charlesccychen wrote:
"The apitools library used by the gcsio.Uploader class"

Done.

charlesccychen

Thanks! This LGTM.

charlesccychen · 2018-01-31T04:18:33Z

sdks/python/apache_beam/io/gcp/gcsio.py

-    return self._read_inner(size=size, readline=False)
+    return self._client.objects.Get(get_request)

-  def readline(self, size=-1):


charlesccychen wrote:
Thanks!

Done.

charlesccychen · 2018-01-31T04:18:33Z

sdks/python/apache_beam/io/gcp/gcsio.py

+  def finish(self):
+    self._conn.close()
+    # TODO(udim): Add timeout=DEFAULT_HTTP_TIMEOUT_SECONDS * 2 and check
+    # isAlive.


udim wrote:
It's just incorrect to wait forever without a timeout or some mechanism to interrupt the wait.

Acknowledged.

charlesccychen · 2018-01-31T04:19:23Z

run python postcommit

New module filesystemio introduces Uploader and Downloader interfaces, plus respective UploaderStream and DownloaderStream adapters that may be wrapped by io.BufferedWriter and io.BufferedReader.

udim · 2018-02-01T02:51:58Z

Rebased commits into one.
@chamikaramj could you please merge?

charlesccychen reviewed Jan 24, 2018

View reviewed changes

udim force-pushed the filesystem-io branch from 8381707 to 79815e6 Compare January 27, 2018 00:17

charlesccychen reviewed Jan 30, 2018

View reviewed changes

udim commented Jan 31, 2018

View reviewed changes

udim force-pushed the filesystem-io branch from 93bd17b to 0d60001 Compare January 31, 2018 03:33

udim commented Jan 31, 2018

View reviewed changes

charlesccychen approved these changes Jan 31, 2018

View reviewed changes

Split out buffered read and write code from gcsio.

fe2de5e

New module filesystemio introduces Uploader and Downloader interfaces, plus respective UploaderStream and DownloaderStream adapters that may be wrapped by io.BufferedWriter and io.BufferedReader.

udim force-pushed the filesystem-io branch from 0d60001 to fe2de5e Compare February 1, 2018 02:51

chamikaramj merged commit e34fee1 into apache:master Feb 1, 2018

udim deleted the filesystem-io branch February 1, 2018 22:52

Conversation

udim commented Jan 24, 2018

Uh oh!

udim commented Jan 24, 2018

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesccychen Jan 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

udim commented Jan 25, 2018

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

charlesccychen Jan 24, 2018 •

edited

Loading