Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

Spam protection of Swarm through modifying GC rules #757

Open
nagydani opened this issue Jun 28, 2018 · 11 comments
Open

Spam protection of Swarm through modifying GC rules #757

nagydani opened this issue Jun 28, 2018 · 11 comments

Comments

@nagydani
Copy link

Nodes get paid through Swap only when they serve a chunk upon a retrieval request. Thus, Swap incentivizes them to store profitable chunks, i.e. ones that can be expected to be requested soon.
Therefore, it is both in the node's best interest and, as argued below, that of the network, not to treat fresh uploads and syncs identically to retrieval requests, when deciding about which chunk to garbage-collect in the event of saturated storage.

One common method of keeping chunks ordered by expected profitability is to keep them in a queue, and move a chunk that has been requested to the head of that queue, whereas garbage collection is performed on the tail. With a suitable data structure, such as a doubly linked list, all aforementioned operations have O(1) complexity. Newly synced or uploaded chunks (the two are indistinguishable) can be inserted at the k=int(αn)th position where n is the size of the queue and α is a constant real parameter between 0 and 1. Note that while finding the kth element in a queue is an O(k) complexity operation, tracking the kth element requires O(1) operations at each update of the queue.

Even an arbitrarily powerful DDoS attack flooding the network with bogus chunks cannot force a guaranteed fraction (namely α) of the most popularly requested chunks out of Swarm. Yet, new uploads are not immediately garbage collected, thus a simple flooding attack won't make uploads impossible either, especially if the uploader is willing to pay the Swap price of moving their content to the front of the queue.

Of course, the latter option is available to the DDoS attacker as well, but it is not free. The larger the Swarm and the more storage space the nodes have, the more expensive it becomes to effectively DDoS it and a large part of the costs gets directly transfered from the attacker to honest nodes.

@cobordism
Copy link

this makes sense to me.

The only other enhancement I'd suggest is that (if it can be done) retrieval requests that originate on the local node do not move chunks to the top of the queue either.
My thinking is, that if I download a 20GB movie through swarm on my local node, I don't want to have my entire chunk store (of proximate chunks) flushed out because a 20GB retrieval request passed through my node. There is no reason to think that these chunks will be profitable in future, because the retrieval requests were not sent to me due to my location in the network, nor due to the chunks' popularity.

On the other hand, if I use a dapp often, I would want the dapp data cached locally...

In the end I think we may have to explore keeping a separate chunk queue for locally requested chunks. (i.e. any chunks requested as a result of an http request enter the local chunk queue and any chunk received over bzz enter the network chunk queue).
Does that make sense?

@cobordism
Copy link

or as we just discussed in the standup, we can treat chunks that were requested locally through HTTP requests similarly to synced chunks. That is, we don't insert them at the top of the chunk queue but at some α (either same as for syncing or different α).
This prevents "local binge watching of swarmflix" to completely flush out the local swarm cache of popular/proximate chunks.

@holisticode
Copy link
Contributor

holisticode commented Jul 1, 2018

It is not clear to me why we first say

and move a chunk that has been requested to the head of that queue,

but then later:

Newly synced or uploaded chunks (the two are indistinguishable) can be inserted at the k=int(αn)th position where n is the size of the queue and α is a constant real parameter between 0 and 1.

In this context it's not clear why *α* is needed at all, where it comes from and what determines it (why between 0 and 1? What's the difference between a value 0.2 and 0.8?.

Does it mean that a constant number of chunks will never be evicted, i.e. the actual "configurable" capacity is smaller than the actual capacity? More clarity please, be aware that there are math morons like me :)

So the queue would look like:

     |   R1  |  R2  |  R3 | ... |  Rα  |  SU1  |  SU2 |  SU3 | ... | SUc |

where Rn are requested chunks and SU n up to c (queue capacity) are synced/uploaded chunks?

@holisticode
Copy link
Contributor

It sounds like the following:

if the uploader is willing to pay the Swap price of moving their content to the front of the queue.

isn't the usual insurance (or is it), but rather a new concept of premium for GC priority. This is interesting but if so, should be explicitly emphasized as a new feature.

@holisticode
Copy link
Contributor

@homotopycolimit I think your

separate chunk queue for locally requested chunks.

proposal is very good. A local cache. It would also solve the "binge" swarmflix issue. But probaly requires more work.

@janos
Copy link
Member

janos commented Jul 2, 2018

A single queue with one constant α for synced chunks and another constant α' for uploaded chunks for gc prioritization compared to chunks that are requested should be the good enough. Keeping two or three queues would more likely put more work on each gc run. The proposal is very good and does make sense to have it implemented.

@nagydani
Copy link
Author

nagydani commented Jul 2, 2018

Yes, I agree with the latter option. Should I amend the text of the ticket?

@cobordism
Copy link

In this context it's not clear why α is needed at all, where it comes from and what determines it (why between 0 and 1? What's the difference between a value 0.2 and 0.8?.

0 means the head of the queue and 1 means the end of the queue. Everything else means: somewhere in the middle.

Does it mean that a constant number of chunks will never be evicted.

No, because everything that is at the head of the queue can eventually move down as other chunks move to the front of the queue through retrieval requests. But it does mean that at any one time, there are chunks (the entire head of the queue up to alpha * n) that cannot be evicted through new syncing / uploads.

@holisticode
Copy link
Contributor

It does mean though that both caches (up to alpha for RetrieveRequests for RetrieveRequests and from alpha to end for SyncRequests) become smaller.

I wonder if it makes sense to make alpha configurable for the user (a user may be running a node solely for profitability)

@cobordism
Copy link

There still is only one cache and it stays the same size. The question is just, when the cache is full, what gets garbage collected first?

@cobordism
Copy link

Our main site at theswarm.eth got GC'd. We need to implement these "anti-spam" rules soon

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants