Skip to content
This repository has been archived by the owner on May 15, 2024. It is now read-only.

Retrieval Optimization To Singularity #143

Closed
Tracked by #139
hannahhoward opened this issue Oct 4, 2023 · 3 comments · Fixed by data-preservation-programs/singularity#404 or #195
Closed
Tracked by #139

Retrieval Optimization To Singularity #143

hannahhoward opened this issue Oct 4, 2023 · 3 comments · Fixed by data-preservation-programs/singularity#404 or #195
Assignees
Labels
beta Feature for Beta release P1

Comments

@hannahhoward
Copy link
Contributor

Currently, retrievial through singularity is implemented through a read seeker abstraction. Since the io.ReadSeeker abstraction does not know about the over range parameters of the incoming HTTP request to motion, each call to the io.Reader.Read method is implemented by making a seperate range request into Singularity for the exact parameters of that individual read operation.

Under the hood, the implementation of HTTP.ServeContent that we employ uses io.Copy, which in turn executes read operations in the neigborhood of 32K, and extremely small request. Worse, inside Singularity, we duplicate this issue, turning read requests into 32k Filecoin retrievals, which is obviously extremely inefficient.

The proposed optimization here is as follows:

type Ranges struct {
   Header string
}

type Range {
  Start uint64
  End uint64
}

func (r Range) Parse(size uint64) ([]Range, error) {
...
}
  • Now you can pass your ranges to your read seeker, and change it's implementation as follows:
    • on first read, check if your current offset lies within one of the ranges
      • if so, make a pipe reader
      • start a go routine and execute a range request to singularity for the current position until the end of the current range -- pass it the pipe writer for a destination (sidebar: it's super annoying that the Swagger generated API takes a writer instead of returning a reader)
      • now save the pipe reader on the struct and use its read method to serve the current read operation
      • for subsequent read operations if an existing reader is already present, read from it (remove it from the struct if you hit EOF)
      • otherwise follow the operations above to execute another larger range request
      • wipe the saved reader whenever there's a Seek and cancel the HTTP request in progress.

Sidebar: all this works but it seems like we're in the guts a bit, and I wonder if ReedSeeker + HTTP.ServeContent is the right abstraction any more.

@hannahhoward
Copy link
Contributor Author

To achieve the actual benefits of this you need to also implement data-preservation-programs/singularity#366, which is nearly identical in nature, and there will likely be a lot of code you can share.

@masih masih added the beta Feature for Beta release label Oct 5, 2023
@masih masih added the P1 label Oct 5, 2023
@xinaxu
Copy link
Collaborator

xinaxu commented Oct 20, 2023

@gammazero currently fixing a critical issue
data-preservation-programs/singularity#388

The retrieval optimization work is on going

gammazero added a commit to data-preservation-programs/singularity that referenced this issue Oct 27, 2023
Optimize retrieval so that when requested retrieval ranges do not align
with singularity file ranges, only the minimal number of retrieval
requests are made.

This is accomplished by creating a separate reader for each singularity
file range. For reads that are larger than a range, multiple ranges are
read until the read request is satisfied or until all data is read. For
reads smaller than the amount of data remaining in the range, the range
reader is maintained so that it can continue to be read from by
subsequent reads.

This approach associates a reader with each Singularity file range, and
not the ranges requested via the API (in HTTP range header). This avoids
needing to parse the range header in order to create readers where each
reads some number of Singularity ranges. Rather, as arbitrary requested
ranges are read, an existing reader for the corresponding singularity
range(s) is reused if the requested range falls on a singularity range
from a previous read. This also means that there is only a single
retrieval for each singularity range, whereas if readers were associated
with requested ranges then multiple readers could overlap the same
singularity range and require multiple retrievals of the same range.

Fixes
#366
Fixes filecoin-project/motion#143

As an optimization, only one singularity range reader is maintained at a
time. This works because once a new singularity range is selected by the
requested range read, then it is highly unlikely that a subsequent read
request will fall on a
a singularity range that was already read from, previous to the new one.

Additional changes:

- The `filecoinReader` implementation supports the `io.WriteTo`
interface to allow direct copying to an `io.Writer`.
- The `FilecoinRetriever` interface supports the `RetrieveReader`
function that returns an `io.ReadCloser` to read data from.
@gammazero gammazero reopened this Oct 27, 2023
@gammazero
Copy link
Collaborator

Reopened - finishing optimization to minimize separate requests to singularity within the same range.

gammazero added a commit that referenced this issue Oct 30, 2023
Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches.

Fixes #143
gammazero added a commit that referenced this issue Oct 30, 2023
Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches.

Fixes #143
gammazero added a commit that referenced this issue Oct 30, 2023
Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches.

Fixes #143
gammazero added a commit that referenced this issue Oct 31, 2023
Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches.

Fixes #143
gammazero added a commit that referenced this issue Nov 2, 2023
Read HTTP range headers and do fetch for entire range instead of allowing the io.CopyN, used by http.ServeContent, to do multiple small fetches.

Fixes #143
gammazero added a commit that referenced this issue Nov 3, 2023
Pass retrieval request through to singularity. Singularity will handle range requests.

Fixes #143
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
beta Feature for Beta release P1
Projects
Status: Done
4 participants