-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion: add support for packing many files into a single transaction #81
Comments
@shea256 thanks for this writeup. The problem I see is that the proposed method requires there being a relationship between the "bucket" name and the files inside of it: I need to know the key In #79, we will be hashing all of the events in our system at an interval, to parallel your example:
The problem is that there is no relationship between the timestamp-based key and the array of canonical identifiers in the json file, so there would need to be an index somewhere else. Question:
|
Can you elaborate on this?
Yes. In my example, looking up
I suppose in theory we could support this but what is the difference between this and using the subdomain method? Can the timestamp-based key have forced uniqueness based on calculation? As in, is there a hash-derived component and if not, can it be added to it? If it can be added, then we would now have unique keys for each image entry, not enforced by the namespace but by the hashing function. If the answer is no to the above (and you're using something like UUIDs), can we have timestamp-based keys for groups of images, and have the unique key for an image be the timestamp-based key combined with the local key: |
A potential hash-derived component is illustrated in ipfs-inactive/faq#15, but it seems infeasible as a key because of the amount of collisions possible when you have a very big dataset. Welcome your feedback on that issue in ipfs. Adding the time-based key to the local key sounds like it should work, since now you know which bucket to look in. Will think some more about this. |
What do you mean by collisions? Collisions with hashing functions shouldn't even be a concern.
Cool, let me know what you think when you've gotten to mull on it for a bit. I think this might be the way to go. |
@denisnazarov Any updates to how you're thinking about this? |
@denisnazarov @shea256 @muneeb-ali
I'm guessing the reason you ask about decentralization is because the reader wants assurances that you're not lying when you resolve "de305d54-75b4-431b-adb2-eb6b9e546013" to this metadata? If so, then if we can set it up so that you can't convincingly lie anyway, then I'd argue that the degree of centralization doesn't matter here.
Conceptually, what if you just treated each content's metadata like a file in a git repository? I'm not suggesting actually using git, but the point is that you would maintain the history of changes to each metadata record as a Merkle tree. Then, once the reader had the latest Merkle root for the metadata, they could "clone" it. The challenge for the reader is in getting an authentic and fresh Merkle root for a given content metadata record. What you could do is pull the commit histories for each metadata record that changed in the interval of time between snapshots, build a Merkle tree out of them, and include the root in a blockchain transaction. This basically attests that "These are the sequences of edits to all pieces of content that changed in this time interval." Once you do this, you could run an arbitrary index service, since the reader would be able to catch you if you were lying. The reader uses the service to resolve the CCID to the Merkle root of the snapshot in the blockchain (which it can verify is authentic by auditing the blockchain). The reader verifies that this a valid Merkle root by "cloning" the metadata (performing the integrity check), and verifying that it has the right CCID. The reader verifies that this is the latest Merkle root by auditing your snapshots in the blockchain. The snapshot root would resolve to the snapshot's Merkle tree, and each hash in the snapshot Merkle tree would resolve to the sequence of new Merkle roots observed for the metadata at that snapshot's point in time. The reader simply confirms that the latest Merkle root in the sequence is the same as the Merkle root of the metadata's edit history that it just cloned. You could store the index itself as well as the JSON documents in the DHT. Would this work? |
From what I'm reading here it seems like the system has changed significantly since this discussion. I'm going to close this. Reopen of I'm wrong! |
Currently, you can register a key-value pair on blockstore and perform lookups on the pair by entering the key and getting back the value.
This works great for certain applications like registering user identities, but it runs into problems when other types of objects need to be registered in the blockchain.
For example, in order to validate/timestamp/prove the existence of a billion documents in the blockchain, this would require a lot of transactions, a lot of data embedding, and a lot of money, to such a ridiculous extent that this process would not be feasible.
A way to get around this would be to pack the registration of multiple objects into a single transaction. A single transaction/key-value pair registration could have a single unique key associated with the merkle root hash of a merkle tree, where each item in the merkle tree is the hash of an object to be registered.
So to register 128 image files on the blockchain, you could do the following:
To perform lookups, you would need to provide the domain and the subdomain of the packed file.
NOTE: the files can't realistically be stored in the DHT as that would lead to an insane amount of DHT bloat (imagine storing 1B 100KB avg. images). We're thinking of controlling bloat by imposing a restriction of a bitcoin transaction for each stored item (with a max size), and packing multiple files into a single transaction wouldn't be compatible with this system.
Here's an example of how this would work:
Let's say I have an album of 128 images from my birthday party and I want to register them in the blockchain.
And let's say their filenames are as follows:
To do this, I first register "ryan-may2015" in a namespace for image storage.
Next, I create a JSON file and include that as the value associated with the name (where the JSON file goes into the DHT and the hash of the file is put in the blockchain):
Last, I make sure to store my files in my personal content-addressed fileserver.
That means my file
2015-05-19-birthday-1
would be stored here:Then, I can perform lookups as follows:
...and I would get back the raw file.
This new blockstore lookup would do the following:
Yay, I just registered an entire photo album in the blockchain!
I think we could realistically store 100-1K files in each transaction, and maybe 10K if we pushed the data limits of the DHT.
Thoughts @denisnazarov @muneeb-ali @jessewalden @moudy?
The text was updated successfully, but these errors were encountered: