Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merkel-CRDTs for data consistency #43

Open
TheRook opened this issue Dec 12, 2022 · 15 comments
Open

Merkel-CRDTs for data consistency #43

TheRook opened this issue Dec 12, 2022 · 15 comments

Comments

@TheRook
Copy link

TheRook commented Dec 12, 2022

How does peerbit make sure that a user can add a data record to the database, and make sure no one else removes it or modifies it? If i am not mistaken peerbit and orbitdb expects users to be trusted when modifying a shared collection.

A Merkel-CRDT can be used to make it hard to deny that data was ever seen and provides some source of truth without a full blockchain:
https://research.protocol.ai/publications/merkle-crdts-merkle-dags-meet-crdts/psaras2020.pdf
https://github.com/ipfs/go-ds-crdt

@marcus-pousette
Copy link
Member

The main idea is that every peer need to run their own checks for every commit to verify whether a party can modify a state or not. This means in practice if every party is running the same ACL logic the state would end up to be the same for everyone. However if you don't assume that we can trust parties then there are multiple answers to this question depending on what kind of reliability you want. One way is to define an ACL database with Peerbit itself, then you bind this database with the canAppend and canRead hooks that you can access on setup with your database in mind, like this:

@variant("document")
class Document {
    @field({ type: "string" })
    id: string;

    constructor(props: { id: string }) {
            this.id = props.id;
    }
}

@variant("test_store")
class TestStore extends Program {
    @field({ type: Documents })
    store: Documents<Document>;

    @field({ type: IdentityAccessController })
    accessController: IdentityAccessController; // IdentityAccessController is part of Peerbit standard lib

    constructor(properties: {
        id?: string;
        identity: Identity;
        accessControllerName?: string;
    }) {
        super(properties);
        this.store = new Documents({
            index: new DocumentIndex({
                indexBy: "id",
                query: new RPC(),
            }),
        });
        this.accessController = new IdentityAccessController({
            id: properties.accessControllerName || "test-acl",
            rootTrust: properties.identity.publicKey,
        });
    }

    async setup() {
        await this.accessController.setup();
        await this.store.setup({
            type: Document,
            canRead: this.accessController.canRead.bind(this.accessController),
            canAppend: this.accessController.canRead.bind(this.accessController),
        });
    }
}

Now you might think: If the ACL of one database is defined by another database, what defines the ACL for that database?

The IdentityAccessController database/program is uniquely defined by its rootTrust publicKey, which is basically the creator of the access controller. When someone wants to add a permission to IdentityAccessController it either has to be signed by the root trust, or it has to be signed by an identity that has been trusted by the root trust (trust chain).

The hash of the program TestStore will uniquely be defined by the rootTrust of the accessController if you keep other properties constant, this means, anyone who opens TestStore by its CID will be sure that everyone else that also have it open, will observe the same root trust for the access controller. If someone tries to change the rootTrust for the ACL then all programs that rely on it will change because the hash will change, which means that it is impossible to do that without changing the address for the whole program itself.

With this way of dealing with trust, you can ~ be sure ~, or at least assume that everyone who is writing to TestStore that are part of the ACL canAppend also respects the access controller. In extension, this means that if you search for data that is defined in TestStore and make sure that you verify that the responses are signed by members of the ACL, then you can be "somewhat sure" that query responses are accurate. If you want to make sure that a party is not behaving maliciously just for your particular query, you would aggregate responses from many peers that are part of the ACL and make sure that there is a consensus about what the response should be for a particular query.

A simple example is a database that is a counter. If you query the database to check what the counter is, you could wait for at least 70% of the network to send you a response, then you could take the median value as the true value for your particular query/observation.

If you want to write an application that does not assume any kind of trust, you need to replicate all content by yourself, on your devices. If you are writing an application where resilience is not that important you can just make sure that you approve and reject commits as they come with your local ACL logic.

@marcus-pousette
Copy link
Member

Other than that Peerbit uses Merkel CRDTs for consistency (Operation-based CRDTs also called commutative replicated data types, or CmRDTs )

@TheRook
Copy link
Author

TheRook commented Dec 15, 2022 via email

@marcus-pousette
Copy link
Member

I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place.

In this case users are incentivised to lie in order to boost their own
post and to obscure or deboost the competition. There are already enough
fake posts on social media, we don't also need to allow these users to
affect other content.

Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason).

If it was trust-less, except for trusting the root. Then the root could
define super peers that are trusted to provide an accurate count, or
popularity listing of or more genetically a specific order of records.
Using an append-only collection with signed additions or subtractions
(upvotes or downvotes) would work, but it would have to be sorted locally
and if you have millions of tweets going out that doesn't seem feasible.

In terms of the database, you could define a collection with row-level
security which is a drop in REST replacement for some application
features. So long as you don't have a malicious peer that can obscure or
alter records. For sort order and a trusted counter, I can't think of any
other way than a trusted super peer.

Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand.

The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then).

..
An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys.

@TheRook
Copy link
Author

TheRook commented Dec 17, 2022 via email

@TheRook
Copy link
Author

TheRook commented Dec 18, 2022

An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys.

One solution necessitates the other. If someone was upmost concerned it could be done on chain and use approved wallets who have undergone KYC/AML. Assume that a social media app still has the same security requirements for spam & abuse and fake account detection. We should assume that the app forces validation with a phone number or something more difficult to forge - the unsolved problem is making it non-trivial to boost content and repercussions for cheating. A ZK one-time update I think is needed for shared counters - prevent as much bad behavior as possible.

@marcus-pousette
Copy link
Member

marcus-pousette commented Dec 20, 2022

Apache has two shard-management projects. One of the OtterTune is a top rated project by the Apache foundation, and monitors the health of a sharded database cluster and orchestrates its configuration in realtime. Carnegie Mellon's Database lab first published research and the code that created https://ottertune.com/ . There is also https://github.com/apache/shardingsphere which I think is more primitive. So if each database supernode reported back health information on a logging channel, then a trusted node run by the admin could host OtterTune to make sure that the system is efficient. Or even the databases themselves could make informed decisions on what shard key to replicate. Filecoin runs into the oracle problem of relying on an agent to give out reliable information. One of the benefits we have here over filecoin is the use of cryptography to enforce some level of consistency. Filecoin's methods are open source, and there really isn't any magic in this oracle, there is an agent that randomly checks to make sure the data is there and valid. Another option is quorum computing to detect sybil attacks, I think this topic needs more research. The homomorphic world has some interesting research around privacy preserving targeted advertising that is applicable to social networks. In that - lets say a collection of social media posts are shared by the #topic name, and the client can choose what collections to subscribe to, which can be used by the client to build their own feed. The smallest amount of information that holds the most amount of trust is the reactions, views and other ranking metrics - which could be hosted by a dedicated K/V instance - which could be sharded and run on cheap hardware, still being compatible with the "totally free" hacker-ethos. https://crypto.stanford.edu/adnostic/adnostic.pdf https://eprint.iacr.org/2021/1032.pdf

On Sat, Dec 17, 2022 at 11:56 AM Marcus Pousette @.> wrote: I am interested in the same thing.. The idea of building private and social media application is one of the reason why this project started in the first place. In this case users are incentivised to lie in order to boost their own post and to obscure or deboost the competition. There are already enough fake posts on social media, we don't also need to allow these users to affect other content. Sybil resistance is a hard problem, but there are solution you can apply in the distributed settings. You can have psybil resistance as a service provided by different networks (that have different root trusts) and the user could choose to rely on one of them ( for example Worldcoin or some telephone number verification service). Then when doing an upvote, you would sign the message with multiple signatures: 1. Your signature, 2. The service that verifiies that you are a human (and perhaps have not made a vote prior). 3. A service that verifies that the timestamp is accurate. (Peerbit as of now supports multi-signatures for this very reason). If it was trust-less, except for trusting the root. Then the root could define super peers that are trusted to provide an accurate count, or popularity listing of or more genetically a specific order of records. Using an append-only collection with signed additions or subtractions (upvotes or downvotes) would work, but it would have to be sorted locally and if you have millions of tweets going out that doesn't seem feasible. In terms of the database, you could define a collection with row-level security which is a drop in REST replacement for some application features. So long as you don't have a malicious peer that can obscure or alter records. For sort order and a trusted counter, I can't think of any other way than a trusted super peer. Yes, this is true and make sense. You can not circumvent scaling problems for a trusted counter. A trusted super peer that can rely trust onto other peers to distribute work is a solution that you can easily understand. The main bottle neck I would say, is that if you shard the post database into thousands of shards (to be able to handle millions of posts per second), you are going to pay for the overhead of coordinating search on all the shards and replication reorganization when peers join and leave. If you don't do sharding, then you are going to be bounded by the capabilities of a single machine and network. I do not believe in the federation approach to solve this problem. Instead, I believe this should be a smooth function based on what you are posting. If you just want to post to your 10 subscribers, you should not pay for the computational cost of making the post available for millions of people. However if, for some reason, you would to get greater reach, you should have the opportunity to do so (and only pay the price for that by then). .. An additional note, there are ways to do voting with ZK proofs to make sure a key can only vote once, however, this does not solve the problem that a user could have multiple keys. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7MN256NLCW2D2QFRIVJDWNYLIFANCNFSM6AAAAAAS4QFQ54 . You are receiving this because you authored the thread.Message ID: @.>

Thanks for the references. I am going to check it out.

Yes filecoin among other storage providers is going with the trust-less approach and it has its benefits and drawbacks. I like to imagine the permissioned approach can give cost benifits in terms of storage and speed. Though I would still like to in the end see some tokenomics systems that would allow people will low electricity costs that maybe have a mobile phone laying around to use it participate and get some revenue overtime.

The smallest amount of
information that holds the most amount of trust is the reactions, views and
other ranking metrics - which could be hosted by a dedicated K/V instance -
which could be sharded and run on cheap hardware, still being compatible
with the "totally free" hacker-ethos.

Yes. If the data amount is small enough and there are at least 5 persons online at all times, you don't even need a dedicated kv instance since data will always be available from some peer to share.

One solution necessitates the other. If someone was upmost concerned it could be done on chain and use approved wallets who have undergone KYC/AML. Assume that a social media app still has the same security requirements for spam & abuse and fake account detection. We should assume that the app forces validation with a phone number or something more difficult to forge - the unsolved problem is making it non-trivial to boost content and repercussions for cheating. A ZK one-time update I think is needed for shared counters - prevent as much bad behavior as possible.

yes

@TheRook
Copy link
Author

TheRook commented Dec 20, 2022 via email

@marcus-pousette
Copy link
Member

marcus-pousette commented Dec 20, 2022

That is actually super cool, didn't know that about early Blizzard (and Skype and KaZaa). It is actually really exciting to think about how cost efficient and performant things we can build considering how capable our hand held devices and cheap laptops are and all the improved networking capabilities that exist today.

Having a smaller set of semi-trusted nodes means there are fewer nodes to
verify. If every response from the database is reputable - then at some
predefined rate you could sample the same query against a collection
of nodes - and if the responses are signed and there is a method of
reputation gathering then clients can report sybil attacks.

Yes, this was exactly how I am imagine it. But I would maybe even go further a see trust as a continuous thing where we at one end have the "trustless" nodes governed through some PoS and on the other end have devices with unreliable network and storage capacity (mobile phones etc). Nodes that are actively malicious and nodes that performs badly will have kind of the same outcome, they will degrade the network, create data-loss or malformed data and unresponsive services if people rely on them. These might be mitigated with different measures, but no matter what, you need to determine whether you are to trust a device or not, and the parameters that you have to consider is not just whether the device owner is trustworthy, but also if the device itself can be relied upon.

Regarding this, I have started working a little bit on a sharding algorithm that improves itself through mathematical optimization. You would have set of parameters that you map with a function (that you optimize) to a value that would define the probability (or slot hash width) that a peer would be elected as a leader for storing and maintaining a state. I imagine that you could condition this function on various things, like device capacity, battery life, mean network speeds, and how trusted this node is by direct peers etc.

Trust carries its own value and in this system it incurs a performance
penalty or currency cost. Expecting an app owner to run a server agent on
their laptop or an old phone plugged into a wall outlet in the corner is a
low bar and provides a central authority. This agent could be used to run
something like ShardingSphere and a reputation+diagnostic system for
supernodes. This can give the app owner a dashboard to view and manage the
health of the cluster. An admin could manually promote other peers to
admin or supernode roles or add paid resources to the cluster.

Yes. And you need this low bar if you want adoption in the first place, because many apps today still need to run with a "server client" mindset with a central authority. What you describe later is basically a way for a developer with "server client" way of working to smoothly migrate to a "peer to peer" way of working

@TheRook
Copy link
Author

TheRook commented Dec 21, 2022 via email

@marcus-pousette
Copy link
Member

marcus-pousette commented Dec 21, 2022

Very interesting. Having, shared management is a great feature. I'll have
to take a peek at your implantation.

I have not written the implementation for yet, but will let you know when I have anything meaningful

There should be no single type of supernode. Some have a lot of bandwidth,
or storage or uptime. Pairing a node with good uptime with one of a lot of
bandwidth for example can make a better combined supernode. If a
collection is heavily using pub/sub for big data then it may need more read
replicas for example.

Ah, yes, basically you would also then condition the function on the work that is to be performed also. That makes sense!

It would be great to set some kind of record, like 'searching the largest
distributed dataset' something along those lines. I am working with
Decentaralied Science (DeSci) which is apart of LabDAO on a distributed
citizen sciences project on sharing biological data between citizen science
projects.

String search is useful for bioinformatics, and being able to perform other
searches - like regex searches would enable scientists to make use of a
vast data set. Time series data and collection of sensor data from IoT
devices is another use case.

Cool that you are involved with that! I am going to read more about it.

Yes, that is actually a real good use case for Peerbit since its is heavily optimized for creating sharded Document store that could distributed content and index it on a large amount of nodes, and then allow everything to be searchable at once. There is some work to improve the indexing parts of it (which we have discussed briefly before).
Timeseries are also interesting, and its not hugely different from livestreaming and collecting video (which I am currently optimizing the project to be able to handle in terms of performance). Right now the video demo I am working on is actually storing Webm Matroska elements in independent "documents" in a Document store, just like how you would store text documents, so that you could shard the video segments onto multiple nodes. This could also be applied for arbitrary use-cases where if you consider the time-series events to be independent of each other. What is interesting is that for popular content, like video segments, or documents, music etc that a few people always are actively listen on, you would not need that many super nodes that replicate the content as a backup, if you trust your peers to "seed" then the content will be available somewhere.

@TheRook
Copy link
Author

TheRook commented Dec 22, 2022 via email

@marcus-pousette
Copy link
Member

marcus-pousette commented Dec 22, 2022

Maybe it would be best to focus on search related features and if you are
interested I can have you meet with the LabDAO leadership to talk about
distributed infrastructure for big data in the sciences.

We also have grants available for science tooling.

Exploring different kinds of applications early on will help you understand what limitations you are building into the protocol, and will let you mitigate issues that are hard to change later. But I agree with you that it is important to focus on one thing, and do it really good if you want to bring value to the dev community at all.

I am very interested to have a meeting with the LabDAO leadership and discuss and learn about your problem domain more.

Grants are something that would help a lot at this stage!

How should we go on about this meeting? I just joined your Discord (username: marcus-pousette#1106)

@TheRook
Copy link
Author

TheRook commented Dec 22, 2022 via email

@marcus-pousette
Copy link
Member

Yes, sound good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants