Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: suffix requests #5206

Closed
wants to merge 6 commits into from
Closed

Conversation

clbarnes
Copy link
Contributor

Which issue does this PR close?

Relates to #4611, but should not close it.

Rationale for this change

Certain use cases require suffix requests (see linked issue). Most stores support such requests directly; a workaround for Azure is trivial although not fast.

What changes are included in this PR?

pub trait GetSuffixClient, with an implementation for all clients except Azure. Additionally, ObjectStore::get_suffix, with a default implementation with the workaround (HEAD then GET), which is then overridden in all stores except Azure.

Also includes some of the infrastructure around full support for HTTP ranges, which is the direction this functionality should go in when we next make a breaking API change. That infra is probably overcomplicated for its use here, but gives a good foundation for full support later, e.g. 80cdf66

Are there any user-facing changes?

Downstream implementors of ObjectStore should override the default get_suffix method, if their clients support a direct form, preferably by implementing GetSuffixClient on their client.

Remaining questions

  • GetSuffixClient probably doesn't need to be a trait as it doesn't depend on any behaviour and isn't used in generics; get_suffix could just be implemented directly on the client structs. However, it will make it much easier to delete suffix-specific behaviour when a breaking API change is allowed.
  • This is a lot of code in a lot of places (much of it copy-pasted), but I don't think there's a way around that with the current API.
  • This probably needs checking for errors where the requested nbytes are longer than the resource. However, AFAICT that isn't currently done with current range requests outside of the local and memory stores.

Better solution

Per the linked issue, GetOptions should contain a full representation of an HTTP range (implementing From<Range<usize>>) and this functionality should be built into get_opts.

Ranges requested beyond the length of the resource can no longer
cause panics in the `coalesce_ranges` function.
With default implementation using a HEAD and then GET request.

See apache#4611.
@tustvold
Copy link
Contributor

Thank you for this, I'll try to take some time over the next week or so to see if we can't make this slightly less of a "bodge", to borrow your phrasing 😅

@clbarnes
Copy link
Contributor Author

Apologies, I do understand the pressures of maintaining a stable API, just don't want to add tech debt which people may be lumbered with in future.

The build failure seems to be from tokio-macro requiring rust 1.63, where the MSRV is 1.62.1.

@tustvold
Copy link
Contributor

I wonder how urgently you require this functionality, if you're happy to wait until the new year we can probably look to make a breaking release then?

@clbarnes
Copy link
Contributor Author

Not urgent at all! My zarr library isn't (yet) based on object_store, so nothing's being held up - I just didn't want to start the work of rebasing it if I didn't know whether the suffix feature would land.

@tustvold
Copy link
Contributor

What do you think then of updating this PR so that GetOptions holds an HttpRange, and we just look to release this as a breaking change in the new year?

@clbarnes
Copy link
Contributor Author

clbarnes commented Dec 15, 2023

Sure thing, I have a different branch which has most of that put together already.

With the possible advent of multipart ranges #4612 , what do you think about storing a multi-range type (which still implements From<Range<usize>>)? For now that could have limitations to prevent construction of multiranges (so that we don't need to build all the rope stuff I mentioned in that other issue), it would just be the right type for if/when multiranges were accessible.

Eh actually I guess that makes get_opts tricky because it should return a Response, so splitting it up into several requests would be painful to sort of merge back together.

@tustvold
Copy link
Contributor

With the possible advent of multipart ranges #4612

I am personally rather lukewarm on multipart ranges as they're not currently supported by any stores, and as you've articulated on the ticket, have rather opaque semantics. I say we cross that bridge when we come to it - it may be we support them outside the main ObjectStore trait.

@clbarnes
Copy link
Contributor Author

And before I go through all the implementations again, for the sake of object safety would you prefer get_range to take the new HttpRange object directly rather than taking T: Into<HttpRange> (which would allow people to continue to pass in Ranges)?

I have a preference for reading the GetResult.range from the response headers if available rather than assuming it's the same as the request, because (especially for the raw HTTP case) it's not impossible that the server sends us something weird and users should know about it sooner rather than later. Because we know the resource length at that point, we can use a regular Range there. This would also allow us to recover from a 200 (we just slice the whole response) although an error is probably more correct here because we don't want users to be accidentally fetching the whole resource over and over.

@tustvold
Copy link
Contributor

tustvold commented Dec 15, 2023

And before I go through all the implementations again, for the sake of object safety would you prefer get_range to take the new HttpRange object directly rather than taking T: Into (which would allow people to continue to pass in Ranges)?

I would leave get_range and get_ranges unchanged, and only change GetOptions, especially given get_ranges, in particular colaesce_ranges, won't work with partial ranges. This will also avoid breaking existing code that uses get_range/get_ranges.

for the sake of object safety

We cannot break object safety, it is critical to being able to abstract across stores based on runtime configuration.

pub(crate) enum HttpRange {
/// A range with a given first and last bytes.
/// If `last > first`, the response should be empty.
Range {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could avoid ambiguity and make it easier to port existing code by using an actual Range here, i.e. Range(Range<usize>), I appreciate this changes this to an exclusive bound but I think people are more familiar with these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the impl From<Range> makes porting existing code easy enough, could do a impl TryInto<Range> going the other way to complete the loop. By renaming it from HttpRange -> GetRange we aren't tied so closely to the HTTP spec so an exclusive bound would be fine.

Tuple variants all round does make construction easier, but with the different constructors, people shouldn't need to construct variants explicitly. I've been going back and forth about whether to have separate variants for ::Range and ::Offset or whether ::Range.last should just be optional; having tuple structs everywhere is a good argument in favour of all separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slight annoyance in using a Range object here is that it isn't Copy, although presumably the .clone() is very cheap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah IIRC this is just because it created a confusing situation when using Range as an iterator

///
/// At present, this is only used internally.
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub(crate) enum HttpRange {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub(crate) enum HttpRange {
pub(crate) enum GetRange {

Perhaps, as it isn't just for HTTP

}
}

pub(crate) fn concrete_range(range: Option<HttpRange>, size: usize) -> Range<usize> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be a member function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to cover the None case too because turning an Option<HttpRange> into a concrete Range is something we do quite a lot, saves seeing opt_range.map_or(0..len, |r| r.concrete_range(len)) in several places. But it's not too much of an issue.

Comment on lines +200 to +209
/// A range defined only by the first byte requested (requests all remaining bytes).
Offset {
/// Offset of the first byte requested.
first: usize,
},
/// A range defined as the number of bytes at the end of the resource.
Suffix {
/// Number of bytes requested.
nbytes: usize,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making these tuple variants instead, this makes them easier to construct and match on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already extra constructors for each variant, but you're right that matching would be easier. The structs are a little more explicit but with good docs it should be clear.

@tustvold tustvold marked this pull request as draft December 15, 2023 18:26
@tustvold
Copy link
Contributor

I've marked this PR as a draft to reflect the above discussion

@clbarnes
Copy link
Contributor Author

And before I go through all the implementations again, for the sake of object safety would you prefer get_range to take the new HttpRange object directly rather than taking T: Into<HttpRange> (which would allow people to continue to pass in Ranges)?

I have a preference for reading the GetResult.range from the response headers if available rather than assuming it's the same as the request, because (especially for the raw HTTP case) it's not impossible that the server sends us something weird and users should know about it sooner rather than later. Because we know the resource length at that point, we can use a std Range there. This would also allow us to recover from a 200 (we just slice the whole response) although an error is probably more correct there because we don't want users to be accidentally fetching the whole resource over and over.

@tustvold
Copy link
Contributor

As commented above we should only change GetOptions as this avoids breaking most existing workloads and object safety.

With regards to inspecting the response Range, returning an error if we get something else back makes sense to me

@clbarnes
Copy link
Contributor Author

get_ranges, in particular colaesce_ranges, won't work with partial ranges.

I have an implementation of coalesce_ranges which uses the new struct. There's a helper function which returns (Vec<HttpRange::Range | HttpRange::Offset>, Option<HttpRange::Suffix>), and so could feasibly cover the same bytes twice but that's on the user.

This will also avoid breaking existing code that uses get_range/get_ranges.

Should be fine, I can see the GetResult has public methods for easily getting the Bytes out of it.

@clbarnes
Copy link
Contributor Author

Thanks for all your feedback! I have raised a new PR with the suggested (breaking) changes; this can be closed at your convenience.

@tustvold
Copy link
Contributor

Closing in favor of #5222

@tustvold tustvold closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants