Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling deduplication when retrieving overlapping selectors #224

Open
tchardin opened this issue Jul 5, 2021 · 7 comments
Open

Enabling deduplication when retrieving overlapping selectors #224

tchardin opened this issue Jul 5, 2021 · 7 comments

Comments

@tchardin
Copy link
Contributor

tchardin commented Jul 5, 2021

I have observed when I execute multiple transfers with the same root and different selectors, overlapping blocks are re-transferred. This means clients may pay for the same blocks multiple times.
I am wondering if deduplication should be achieved at the transport level or if simply exposing the doNotSendCids list in user land could let me gather some CIDs with a quick traversal before opening the pull data request so the provider doesn't charge me for the same blocks again. Thanks for your help!
cc @dirkmc @hannahhoward

@dirkmc
Copy link
Contributor

dirkmc commented Jul 13, 2021

The do not send CIDs list is used in the case where a data transfer must be restarted because the connection goes down. When the connection comes back up, the recipient sends a list of CIDs to the sender so that the sender doesn't resend them. The use case is similar but not exactly the same as what you're proposing.

The use case you describe is somewhat unusual because typically I wouldn't expect a user to download overlapping parts of the DAG. I would expect them to either download separate parts of the DAG (eg when navigating a tree structure like Wikipedia pages) or download the entire DAG (eg when retrieving a whole file).

If the DAG is very large (eg Wikipedia) you probably don't want to send a list of CIDs of all blocks that you've already downloaded with each new request.

@tchardin
Copy link
Contributor Author

Thanks for the reply @dirkmc. I'm not sure I understand how to think with selectors.

In the case of wikipedia, would it be right to think of it as an IPLD Map interface (prob sharded), in which case the keys would be the slugs of each page. With the root CID of Wikipedia, I would use an ExploreFields selector to say: only visit this slug and its children. When I use this selector, I find that the root Map node is selected every time so that block would be transferred each time I retrieve a different page. My use case wouldn't be to send all the CIDs of blocks selected by previous selectors but only the CIDs from the Map or any intermediary node before the selected fields.

Would your solution then be to first pull all the key-values and then use the links as root CIDs for subsequent transfers?

@dirkmc
Copy link
Contributor

dirkmc commented Jul 14, 2021

Would your solution then be to first pull all the key-values and then use the links as root CIDs for subsequent transfers?

That's how I would imagine it working, but I'm not an expert in IPLD.

Could you describe your use case?

@tchardin
Copy link
Contributor Author

We can take a simple blog or static website for example, the content publisher can add all the articles/pages as a unixfs DAG for which we persist the root CID in a DNS record.

Now I'd like to retrieve this blog from a Filecoin secondary retrieval market, so my client node queries a content routing service with the root CID and a selector for all the children. Once I find a provider who's storing the content, I get an offer containing the total size of the content for the blog and the price so I can load a payment channel between my client and the provider with funds to pay for retrieving the entire blog. Then during the user session, the client node will progressively retrieve only the pages and content needed without interrupting the flow to wait for a blockchain confirmation when adding more funds to the channel.

This means all my pages can be loaded using a CID and a field selector for the name of the unixfs entry. All my web content is loaded with relative paths that can also be progressively retrieved say as the user scrolls down the page etc. When the client stops requesting content for a given period of time, the provider can settle the channel and the client gets their unused funds back.

My initial thinking was to use the data transfer out of the box but I see now that it might require an additional abstraction on top of it to resolve keys and make data requests with different roots. Maybe the new DAG service will solve that?

@dirkmc
Copy link
Contributor

dirkmc commented Jul 19, 2021

The data systems team probably has more insight than I do but I think you're right that you'd want to build an abstraction on top of the data transfer module.

@hannahhoward does that sound right to you?

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jul 24, 2021

@tchardin

Yes DoNotSend CIDS is one way. There's nothing inherently preventing exposing it in user land.

Another way, which is disabled in Filecoin, is that Graphsync will deduplicate concurrent requests from the same peer. Note that deduplication lasts only as long requests are in progress from the same peer. This only works when you're working from the same blockstore for both requests. This is disabled by filecoin by default cause it sends to different stores and the deduplication logic on the receiving side gets super complicated. If you look closely at the wire transfers in Filecoin, you'll see the ExtensionDeDupByKey (https://github.com/ipfs/go-graphsync/blob/master/graphsync.go#L44) by is used so two transfers for overlapping data from the same client and peer do not deduplicate. I don't think this will make a difference overall for your scenario since you're not sending concurrent requests. It wouldn't make sense to do anything other than ephemeral tracking of deduplication of concurrent requests on the responding side -- otherwise you end up effecitvely maintaining information about another person's state.

I believe though ultimately you just want to start your next request from subsequent roots. This should be doable with carefully chosen selectors for each request (i.e. limiting your recursion levels). And I think you just want multiple data transfers no? You should be able to reuse a payment channel across data transfers without settling or collecting. My only anxiety is that currently I think in default filecoin retrieval protocol the payment channel manager may automatically submit a chain message to deposit more funds, even if there's already enough funds in the channel? I'm not sure.

Are you guys using default go-fil-markets retrieval code or your own?

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jul 24, 2021

oh there are various other proposals for improving upon graphsync deduplication.

We've thought about various variations on DoNotSendCIDs. One is do not send overlapping selector -- but that's very hard to implement efficiently. There's also stop traversal at a CID (different than DoNotSendCids that just drops the individual CID -- this would drop the CID and everything under it. Again, all future stuff.

I've thought a lot about CID only requests -- requests that wouldn't translate data but would tell you everything you need to know about the CIDs that are part of a traversal. You can read about these here:
ipld/specs#354
ipld/specs#355
And if any of these excite you I will happily accept a PR!

This was referenced Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants