Retrieval Optimization To Singularity #143

hannahhoward · 2023-10-04T19:39:41Z

Currently, retrievial through singularity is implemented through a read seeker abstraction. Since the io.ReadSeeker abstraction does not know about the over range parameters of the incoming HTTP request to motion, each call to the io.Reader.Read method is implemented by making a seperate range request into Singularity for the exact parameters of that individual read operation.

Under the hood, the implementation of HTTP.ServeContent that we employ uses io.Copy, which in turn executes read operations in the neigborhood of 32K, and extremely small request. Worse, inside Singularity, we duplicate this issue, turning read requests into 32k Filecoin retrievals, which is obviously extremely inefficient.

The proposed optimization here is as follows:

modify Store.Get to take an optional list of configuration parameters
make the first one some kind of "Ranges" struct
parse the HTTP Range header -- this code to do this is surprisingly not easily available but there are examples in the HTTP ServeContent implementation (https://cs.opensource.google/go/go/+/refs/tags/go1.21.1:src/net/http/fs.go;l=887) and then I found a good example here (https://github.com/gotd/contrib/blob/v0.19.0/http_range/range.go).
- one caveat, you need to get the size of the overall blob first to parse ranges correctly. So I'd recommend the Ranges struct look like:

type Ranges struct {
   Header string
}

type Range {
  Start uint64
  End uint64
}

func (r Range) Parse(size uint64) ([]Range, error) {
...
}

Now you can pass your ranges to your read seeker, and change it's implementation as follows:
- on first read, check if your current offset lies within one of the ranges
  - if so, make a pipe reader
  - start a go routine and execute a range request to singularity for the current position until the end of the current range -- pass it the pipe writer for a destination (sidebar: it's super annoying that the Swagger generated API takes a writer instead of returning a reader)
  - now save the pipe reader on the struct and use its read method to serve the current read operation
  - for subsequent read operations if an existing reader is already present, read from it (remove it from the struct if you hit EOF)
  - otherwise follow the operations above to execute another larger range request
  - wipe the saved reader whenever there's a Seek and cancel the HTTP request in progress.

Sidebar: all this works but it seems like we're in the guts a bit, and I wonder if ReedSeeker + HTTP.ServeContent is the right abstraction any more.

hannahhoward · 2023-10-04T19:48:20Z

To achieve the actual benefits of this you need to also implement data-preservation-programs/singularity#366, which is nearly identical in nature, and there will likely be a lot of code you can share.

xinaxu · 2023-10-20T16:35:40Z

@gammazero currently fixing a critical issue
data-preservation-programs/singularity#388

The retrieval optimization work is on going

Optimize retrieval so that when requested retrieval ranges do not align with singularity file ranges, only the minimal number of retrieval requests are made. This is accomplished by creating a separate reader for each singularity file range. For reads that are larger than a range, multiple ranges are read until the read request is satisfied or until all data is read. For reads smaller than the amount of data remaining in the range, the range reader is maintained so that it can continue to be read from by subsequent reads. This approach associates a reader with each Singularity file range, and not the ranges requested via the API (in HTTP range header). This avoids needing to parse the range header in order to create readers where each reads some number of Singularity ranges. Rather, as arbitrary requested ranges are read, an existing reader for the corresponding singularity range(s) is reused if the requested range falls on a singularity range from a previous read. This also means that there is only a single retrieval for each singularity range, whereas if readers were associated with requested ranges then multiple readers could overlap the same singularity range and require multiple retrievals of the same range. Fixes #366 Fixes filecoin-project/motion#143 As an optimization, only one singularity range reader is maintained at a time. This works because once a new singularity range is selected by the requested range read, then it is highly unlikely that a subsequent read request will fall on a a singularity range that was already read from, previous to the new one. Additional changes: - The `filecoinReader` implementation supports the `io.WriteTo` interface to allow direct copying to an `io.Writer`. - The `FilecoinRetriever` interface supports the `RetrieveReader` function that returns an `io.ReadCloser` to read data from.

gammazero · 2023-10-27T21:25:13Z

Reopened - finishing optimization to minimize separate requests to singularity within the same range.

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

Pass retrieval request through to singularity. Singularity will handle range requests. Fixes #143

hannahhoward mentioned this issue Oct 4, 2023

Retrieval Optimization for Singularity + Filecoin data-preservation-programs/singularity#366

Closed

masih added the beta Feature for Beta release label Oct 5, 2023

masih assigned gammazero Oct 5, 2023

masih mentioned this issue Oct 5, 2023

[EPIC] Motion Beta Release #139

Closed

masih added the P1 label Oct 5, 2023

xinaxu mentioned this issue Oct 26, 2023

Optimize retrieval from Filecoin data-preservation-programs/singularity#404

Merged

gammazero closed this as completed in data-preservation-programs/singularity#404 Oct 27, 2023

gammazero reopened this Oct 27, 2023

gammazero added a commit that referenced this issue Oct 30, 2023

Optimize retrieval to do minimal fetcher from singularity

9d86eae

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

gammazero mentioned this issue Oct 30, 2023

Optimize retrieval to do minimal fetcher from singularity #195

Merged

gammazero added a commit that referenced this issue Oct 30, 2023

Optimize retrieval to do minimal fetcher from singularity

f249e19

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

gammazero added a commit that referenced this issue Oct 30, 2023

Optimize retrieval to do minimal fetcher from singularity

d0c744c

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

gammazero added a commit that referenced this issue Oct 31, 2023

Optimize retrieval to do minimal fetcher from singularity

bb874c8

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

gammazero added a commit that referenced this issue Nov 2, 2023

Optimize retrieval to do minimal fetcher from singularity

81dccd1

Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches. Fixes #143

gammazero closed this as completed in #195 Nov 3, 2023

gammazero added a commit that referenced this issue Nov 3, 2023

Optimize retrieval to do minimal fetcher from singularity (#195)

9218c3e

Pass retrieval request through to singularity. Singularity will handle range requests. Fixes #143

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Optimization To Singularity #143

Retrieval Optimization To Singularity #143

hannahhoward commented Oct 4, 2023

hannahhoward commented Oct 4, 2023

xinaxu commented Oct 20, 2023

gammazero commented Oct 27, 2023

Retrieval Optimization To Singularity #143

Retrieval Optimization To Singularity #143

Comments

hannahhoward commented Oct 4, 2023

hannahhoward commented Oct 4, 2023

xinaxu commented Oct 20, 2023

gammazero commented Oct 27, 2023