Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content bytes option #56

Closed
wants to merge 4 commits into from
Closed

Content bytes option #56

wants to merge 4 commits into from

Conversation

kd2718
Copy link

@kd2718 kd2718 commented Jul 9, 2015

Added the option to include bytes when requesting the contents of a file. This is a feature that is available in API calls, but not originally in the boxsdk for python.

Example:

bytes = [0,60]
client.file(file_id=id).content(bytes=bytes)

CLA signed

@boxcla
Copy link

boxcla commented Jul 9, 2015

Hi @kd2718, thanks for the pull request. Before we can merge it, we need you to sign our Contributor License Agreement. You can do so electronically here: http://opensource.box.com/cla

Once you have signed, just add a comment to this pull request saying, "CLA signed". Thanks!

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.46% when pulling 0a32f72 on kd2718:master into e4388ac on box:master.

@Jeff-Meadows
Copy link
Contributor

Hi, thanks for the PR. Before we can merge it, the unit tests will need to be updated, and the option should probably also be added to the download method.

Are you interested in doing that? If not, I can pick up where you've left off - just let me know.

@kd2718
Copy link
Author

kd2718 commented Jul 9, 2015

@Jeff-Meadows I see what I can finish over the next few days. Thanks!

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.39% when pulling da6f9d8 on kd2718:master into e4388ac on box:master.

"""
Get the content of a file on Box.

:peram bytes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: param, not peram.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, bytes is a built-in type in Python 3. Even though this is valid to do in Python, it still might be better to avoid that name. Something like byte_range or byte_range_set is more descriptive anyway.

@kd2718
Copy link
Author

kd2718 commented Jul 14, 2015

I updated @jmoldow suggestions.

I am not familiar with building unit tests. I could make a few guesses, but I felt it would be better if you all implemented them correctly. Once they are in, I can look over them for an idea for next time.

Thanks,
Kory

@Jeff-Meadows
Copy link
Contributor

Hi @kd2718 - thanks for the updates. I can merge this to a branch and work on some unit tests for it. In the meantime, could you make sure you've signed the CLA, and once you have, leave a comment here saying "CLA signed"?

@jmoldow
Copy link
Contributor

jmoldow commented Jul 14, 2015

I was originally going to suggest that the function should take a start and end position, something like

def content(self, first_byte_position=None, last_byte_position=None)

I find this to be more Pythonic, as it is similar to functions like range. I think we should avoid passing lists instead of using distinct parameters.

But then I realized that there are a few different scenarios that the RFC calls for:

  • "{first}-{last}"
  • "{first}-"
  • "-{suffix-length}" (not supported by Box right now)
  • multiple comma-separated byte ranges (not supported by Box right now)

We could do this within the function, but I think this would be hard to do in a way that is future-proof (in case the other forms become allowed in the future), and (if more forms become available in the future) hard to do in a way that is not confusing to users of the SDK. The docstring would need to explain the semantics of first_byte_position, last_byte_position, and suffix_length; and explain the three different valid combinations of parameters. And the function would need to figure out what to do depending on what parameters are passed.

Instead, I think we should move all of this into a new class with a smart constructor. My proposal is something like this:

class ByteRange(object):
  """Represents a byte range for a byte range retrieval request, from section 14.35 of W3 RFC 2616: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35."""

  def range_header_value(self):
    return 'bytes={}'.format(self.byte_range_set())

  @abc.abstractmethod
  def byte_range_set(self):
    pass

class ByteRangeSpec(ByteRange):
  """Represents a single, continuous byte range, where first_byte_position is the (0-indexed) position of the first byte of the entity that should be concluded, and last_byte_position is the position of the last byte that should be included (None is taken to mean EOF). E.g., ByteRangeSpec(2, 4) represents skipping bytes 0 and 1, including bytes 2, 3, and 4, and skipping all bytes beyond that. And ByteRangeSpec(2) includes all bytes except for 0 and 1.
  """
  def __init__(self, first_byte_position, last_byte_position=None):
    self.first_byte_position = first_byte_position
    self.last_byte_position = last_byte_position

  def byte_range_set(self):
    return '{}-{}'.format(self.first_byte_position, (self.last_byte_position or ''))

# In the hypothetical future.
#class SuffixByteRangeSpec(ByteRange):
#  def __init__(self, suffix_length):
#    self.suffix_length = suffix_length
#  def byte_range_set(self):
#    return '-{}'.format(self.suffix_length)

We can use these class's docstrings to explain all the semantics and behaviors, rather than cluttering the content and download_to functions. For example, we'll want to explain:

  • 0 <= start <= end
  • It is a closed interval. So [0, 5] will give you bytes 0, 1, 2, 3, 4, and 5 (6 bytes total). This is different than the typical Python function. I suppose we could make this more like a typical Python range function, but I don't think it's worth the potential confusion - mimicking the RFC is probably best.
  • The end range can be omitted (e.g. "bytes=50-"), in which case it will return everything after (and including) the 50th byte.

This way, we also don't have to duplicate logic in the content and download_to functions.

I guess one downside is that right now, only the ByteRangeSpec form is supported, so this seems a little heavy-handed, especially if none of the other forms ever get implemented. I still like it though.

@Jeff-Meadows @kd2718 what do you think?

@kd2718
Copy link
Author

kd2718 commented Jul 17, 2015

@jmoldow and @Jeff-Meadows
Sorry for the delay in my response. I had thought about making the start and stop bytes their own variables. Making them an object would help control what we get from the user.

This plays into an idea I had. If the file being downloaded was large, it may be more beneficial to download it in chunks, rather than all at once. What if Download_to function returned an an object that could be repeatedly called to get specified chunks from that file? However, at this point, we would have to make a new api call for each chunk that is downloaded.

If I am not making my self clear, Think of this as similar to fid.readlines(). But we would call file.get_bytes() and the object would keep track of where we are in the file.

let me know if this sounds ok.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.22% when pulling 4cd98d4 on kd2718:master into e4388ac on box:master.

@kd2718
Copy link
Author

kd2718 commented Aug 26, 2015

I got distracted and had to step away from this for a while. Is this still something you are still interested in? @Jeff-Meadows @jmoldow

@jmoldow
Copy link
Contributor

jmoldow commented Dec 14, 2015

I'm closing this PR for now. The work will be tracked in issue #95, which I've started work on. This will be a more generic and Pythonic implementation, along the lines of what I mentioned in my comment. Thanks a lot for the initial pass at this!

With regards to your other comment:

This plays into an idea I had. If the file being downloaded was large, it may be more beneficial to download it in chunks, rather than all at once. What if Download_to function returned an an object that could be repeatedly called to get specified chunks from that file? However, at this point, we would have to make a new api call for each chunk that is downloaded.

I definitely agree that we should support chunked downloading. However, we don't need to do this via Byte Ranges, as HTTP/1.0 already has support for receiving a non-byte-range request in chunks, and the requests library already exposes this in its API. I've opened a ticket for my idea in #96.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants