Content bytes option #56

kd2718 · 2015-07-09T08:15:46Z

Added the option to include bytes when requesting the contents of a file. This is a feature that is available in API calls, but not originally in the boxsdk for python.

Example:

bytes = [0,60]
client.file(file_id=id).content(bytes=bytes)

CLA signed

boxcla · 2015-07-09T08:15:55Z

Hi @kd2718, thanks for the pull request. Before we can merge it, we need you to sign our Contributor License Agreement. You can do so electronically here: http://opensource.box.com/cla

Once you have signed, just add a comment to this pull request saying, "CLA signed". Thanks!

landscape-bot · 2015-07-09T08:17:39Z

Repository health decreased by 0.46% when pulling 0a32f72 on kd2718:master into e4388ac on box:master.

13 new problems were found (including 1 error and 4 code smells).
No problems were fixed.

Jeff-Meadows · 2015-07-09T16:27:53Z

Hi, thanks for the PR. Before we can merge it, the unit tests will need to be updated, and the option should probably also be added to the download method.

Are you interested in doing that? If not, I can pick up where you've left off - just let me know.

kd2718 · 2015-07-09T17:44:39Z

@Jeff-Meadows I see what I can finish over the next few days. Thanks!

landscape-bot · 2015-07-10T07:53:33Z

Repository health decreased by 0.39% when pulling da6f9d8 on kd2718:master into e4388ac on box:master.

13 new problems were found (including 0 errors and 5 code smells).
No problems were fixed.

jmoldow · 2015-07-10T07:59:34Z

boxsdk/object/file.py

        """
        Get the content of a file on Box.

+        :peram bytes:


Typo: param, not peram.

Also, bytes is a built-in type in Python 3. Even though this is valid to do in Python, it still might be better to avoid that name. Something like byte_range or byte_range_set is more descriptive anyway.

kd2718 · 2015-07-14T06:20:28Z

I updated @jmoldow suggestions.

I am not familiar with building unit tests. I could make a few guesses, but I felt it would be better if you all implemented them correctly. Once they are in, I can look over them for an idea for next time.

Thanks,
Kory

Jeff-Meadows · 2015-07-14T16:14:47Z

Hi @kd2718 - thanks for the updates. I can merge this to a branch and work on some unit tests for it. In the meantime, could you make sure you've signed the CLA, and once you have, leave a comment here saying "CLA signed"?

jmoldow · 2015-07-14T16:38:42Z

I was originally going to suggest that the function should take a start and end position, something like

def content(self, first_byte_position=None, last_byte_position=None)

I find this to be more Pythonic, as it is similar to functions like range. I think we should avoid passing lists instead of using distinct parameters.

But then I realized that there are a few different scenarios that the RFC calls for:

"{first}-{last}"
"{first}-"
"-{suffix-length}" (not supported by Box right now)
multiple comma-separated byte ranges (not supported by Box right now)

We could do this within the function, but I think this would be hard to do in a way that is future-proof (in case the other forms become allowed in the future), and (if more forms become available in the future) hard to do in a way that is not confusing to users of the SDK. The docstring would need to explain the semantics of first_byte_position, last_byte_position, and suffix_length; and explain the three different valid combinations of parameters. And the function would need to figure out what to do depending on what parameters are passed.

Instead, I think we should move all of this into a new class with a smart constructor. My proposal is something like this:

class ByteRange(object):
  """Represents a byte range for a byte range retrieval request, from section 14.35 of W3 RFC 2616: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35."""

  def range_header_value(self):
    return 'bytes={}'.format(self.byte_range_set())

  @abc.abstractmethod
  def byte_range_set(self):
    pass

class ByteRangeSpec(ByteRange):
  """Represents a single, continuous byte range, where first_byte_position is the (0-indexed) position of the first byte of the entity that should be concluded, and last_byte_position is the position of the last byte that should be included (None is taken to mean EOF). E.g., ByteRangeSpec(2, 4) represents skipping bytes 0 and 1, including bytes 2, 3, and 4, and skipping all bytes beyond that. And ByteRangeSpec(2) includes all bytes except for 0 and 1.
  """
  def __init__(self, first_byte_position, last_byte_position=None):
    self.first_byte_position = first_byte_position
    self.last_byte_position = last_byte_position

  def byte_range_set(self):
    return '{}-{}'.format(self.first_byte_position, (self.last_byte_position or ''))

# In the hypothetical future.
#class SuffixByteRangeSpec(ByteRange):
#  def __init__(self, suffix_length):
#    self.suffix_length = suffix_length
#  def byte_range_set(self):
#    return '-{}'.format(self.suffix_length)

We can use these class's docstrings to explain all the semantics and behaviors, rather than cluttering the content and download_to functions. For example, we'll want to explain:

0 <= start <= end
It is a closed interval. So [0, 5] will give you bytes 0, 1, 2, 3, 4, and 5 (6 bytes total). This is different than the typical Python function. I suppose we could make this more like a typical Python range function, but I don't think it's worth the potential confusion - mimicking the RFC is probably best.
The end range can be omitted (e.g. "bytes=50-"), in which case it will return everything after (and including) the 50th byte.

This way, we also don't have to duplicate logic in the content and download_to functions.

I guess one downside is that right now, only the ByteRangeSpec form is supported, so this seems a little heavy-handed, especially if none of the other forms ever get implemented. I still like it though.

@Jeff-Meadows @kd2718 what do you think?

kd2718 · 2015-07-17T06:10:15Z

@jmoldow and @Jeff-Meadows
Sorry for the delay in my response. I had thought about making the start and stop bytes their own variables. Making them an object would help control what we get from the user.

This plays into an idea I had. If the file being downloaded was large, it may be more beneficial to download it in chunks, rather than all at once. What if Download_to function returned an an object that could be repeatedly called to get specified chunks from that file? However, at this point, we would have to make a new api call for each chunk that is downloaded.

If I am not making my self clear, Think of this as similar to fid.readlines(). But we would call file.get_bytes() and the object would keep track of where we are in the file.

let me know if this sounds ok.

landscape-bot · 2015-07-17T17:01:52Z

Repository health decreased by 0.22% when pulling 4cd98d4 on kd2718:master into e4388ac on box:master.

12 new problems were found (including 1 error and 3 code smells).
No problems were fixed.

kd2718 · 2015-08-26T21:00:43Z

I got distracted and had to step away from this for a while. Is this still something you are still interested in? @Jeff-Meadows @jmoldow

jmoldow · 2015-12-14T22:55:07Z

I'm closing this PR for now. The work will be tracked in issue #95, which I've started work on. This will be a more generic and Pythonic implementation, along the lines of what I mentioned in my comment. Thanks a lot for the initial pass at this!

With regards to your other comment:

This plays into an idea I had. If the file being downloaded was large, it may be more beneficial to download it in chunks, rather than all at once. What if Download_to function returned an an object that could be repeatedly called to get specified chunks from that file? However, at this point, we would have to make a new api call for each chunk that is downloaded.

I definitely agree that we should support chunked downloading. However, we don't need to do this via Byte Ranges, as HTTP/1.0 already has support for receiving a non-byte-range request in chunks, and the requests library already exposes this in its API. I've opened a ticket for my idea in #96.

kd2718 added 2 commits July 8, 2015 23:59

updated gitignore

ea9a06d

added bytes option to content to pull file in chunks.

0a32f72

updated Download_to

da6f9d8

jmoldow reviewed Jul 10, 2015
View reviewed changes

fixed a few spelling mistakes and implement check on array size

4cd98d4

jmoldow mentioned this pull request Dec 14, 2015

Implement Byte Range requests for file content downloads #95

Closed

jmoldow closed this Dec 14, 2015

mgrytsai mentioned this pull request Jul 1, 2022

Generator for chunks of content stream #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content bytes option #56

Content bytes option #56

kd2718 commented Jul 9, 2015

boxcla commented Jul 9, 2015

landscape-bot commented Jul 9, 2015

Jeff-Meadows commented Jul 9, 2015

kd2718 commented Jul 9, 2015

landscape-bot commented Jul 10, 2015

jmoldow Jul 10, 2015

jmoldow Jul 10, 2015

kd2718 commented Jul 14, 2015

Jeff-Meadows commented Jul 14, 2015

jmoldow commented Jul 14, 2015

kd2718 commented Jul 17, 2015

landscape-bot commented Jul 17, 2015

kd2718 commented Aug 26, 2015

jmoldow commented Dec 14, 2015

Content bytes option #56

Content bytes option #56

Conversation

kd2718 commented Jul 9, 2015

boxcla commented Jul 9, 2015

landscape-bot commented Jul 9, 2015

Jeff-Meadows commented Jul 9, 2015

kd2718 commented Jul 9, 2015

landscape-bot commented Jul 10, 2015

jmoldow Jul 10, 2015

Choose a reason for hiding this comment

jmoldow Jul 10, 2015

Choose a reason for hiding this comment

kd2718 commented Jul 14, 2015

Jeff-Meadows commented Jul 14, 2015

jmoldow commented Jul 14, 2015

kd2718 commented Jul 17, 2015

landscape-bot commented Jul 17, 2015

kd2718 commented Aug 26, 2015

jmoldow commented Dec 14, 2015