-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the way posts data are stored #61
Comments
I am against this approach. I agree the binary media data should be on an external storage service like IPFS but not the content of the post and the metadata. They should be on the chain as they are the result of consensus and the values of the users would like to store under their accounts. If we only store the CID of the IPFS, it's lack of sovereignty and ownership of the original content creator. Do you mean the chain software should cater the file upload to IPFS? Do the node operators need to maintain IPFS nodes and pay the fee for the entire process? Unless we have another layer in the protocol which is for storing the content only in the system like data replicator in Solana, provider in Akash and storage node in Oasis, I do not agree on this approach. I don't see that text-based storage is a severe issue. Yes maybe in a long run the storage can be 3 times larger than just storing CID but solely because of reducing storage size while losing sovereignty is not what I would expect. Some longer terms solution could be data compression in KVstore (I'm not sure if Tendermint is already doing this as I know goleveldb can have data compression), or have some old data in archive mode. We can take Steem as an example. Here is one of their data stored on chain. https://steemd.com/fiction/@stormlight24/change-of-tide-university-path-short-fantasy-story-part-20 There are a lot of data but still, they are text only (markdown). As long as the data is storing in binary and compressed, it won't be too heavy I think. |
I don't think that's what Riccardo meant.
What do you mean with archive mode? Compress them after some time they were on chain? I agree that media data should be cater on IPFS, I really don't know what will happen when we would have a huge amount of posts. |
Archived the block data so they won't be accessible directly. Maybe this is not eligible in our case.
I agree with this. I think there are different scenarios. For a tweet-like message, Facebook-like post, they should be stored on chain. For a Medium-like blog, the full text body storing on IPFS and only store the metadata on chain is a better solution. This matches the case of storing photos and musics. For example, if I want to build an IG-like app, I would store the photos on IPFS and only store the description in the post and like it to the CID of the photo. Maybe we should think about limiting the length of text in a post. This helps limiting the size of an tx an gas use. And then we build some demos on how Desmos can work with IPFS to build some interesting applications. |
I think it's a good idea, it avoids chain's pollution and in the meantime it let tracks of user's posts on chain and preserve sovereignty.
Yes we could do that, sounds good to me! What do @RiccardoM you think about all this? |
I personally think that this solution might be the best one:
I go as far as to say that the maximum post length should be something like 500 characters, which is 1.78 times a Tweet length.
This can be done inside Dwitter too, which can use IPFS for posts longer than 500 characters. |
I'm ok for 500 characters. Does it matter if it has to be a number which is a power of 2? As the limit of a memo in a tx is 256. I think Dwitter (Mooncake) should not go for the IPFS long post. That should be another dapp which is aimed for blog posting. We should keep the function of each dapp focus to a single problem. |
No, it doesn't really matter. We can fix any length we want 😄
Ok perfect! |
Closing this in favor of #67 |
Context
During the same conversation already mentioned inside #60 that I had with @angelorc yesterday, another possible problem that he raised about our chain is regarding the storage of posts.
He told me that in Bitsong they have been testing through all the year different methods of storing songs data and metadata so that the chain can always perform good without becoming larger than it should be. He told me that the best way they found out in the end was storing all the metadata on IPFS and linking the CID (the IPFS hash of the file containing such data) on the chain itself.
After such conversation, I've done some hypotheses on how large our chain could become if we keep all the posts contents on-chain. This is what I got.
Chain space analysis
Premises
All the following analysis have been done referring to the JSON format of posts, which is the one used when exporting them inside the genesis or importing them during an upgrade process. This has been done due to the fact that is simpler to verify how large a JSON file is instead of how much space it occupies inside KV stores.
Current status
Currently Posts are stored using the following JSON structure:
In this case I've put an average length value for each field, considering the
optional_data
value to have only one key inside, which should be the average case. This JSON has a size on disk of 378 bytes.Now, with the introduction of media-supporting posts (#36), an average JSON file of such time could look like the following:
Due to the higher amount of information stored, such JSON has an on-disk size of 768 bytes.
Let's now consider the case in which Desmos gains 1/1000th of Twitter usage. This could lead to 2 millions posts per each (source of Twitter usage stats: InternetLiveStats).
Considering an average of 1/5 posts being a media post, this would lead to 400.000 media posts/year and 1.600.000 text posts/year created. Considering 378 bytes per each text post and 768 bytes per each media post, this would sum up to approximately 0.92GB of posts per year.
If we scale up to 1/100th of Twitter usage, we then would collect 9.2GB/year of only posts. This amount is crazily high and I personally think that we should reduce it as in the long run it might become a problem considering that there are a lot of posts type we still need to define (e.g. #14).
Solution
The solution of storing all the post content into an IPFS file could be a good way to decrease the size of the chain on-disk. What we could do is define a new Post structure that is stored like the following:
Inside the IPFS file reachable using the
content
reference, we can then have different JSON structures based on the type of the post itself. As an example, we could haveText post
Media post
Of course this are just generic examples and better schemes should be defined more in depth and maybe event stored on chain for a generic reference.
On-disk space changes
Thanks to this approach, all different posts would have an on-disk size of just 166 bytes, which would reduce the above yearly disk-space increase from 0.92GB/year to 0.332GB/year (−69,456%) considering 2.000.000 posts/year.
The best thing is that this approach keeps a linear increase of the disk space that is not based on the posts contents at all. If all users suddenly started creating media posts, on chain they would weight the same as text posts so it's also a spam prevention system.
Querying
Thank to the system proposed inside #60, the parser could simply ready the posts from the chain and then get the real content from IPFS, index it and putting into the database allowing for easier queries.
Consideration on IPFS
IPFS has proven to be reliable and we can even improve its availability by creating a cluster or a private network so that posts content will be always available, although I don't think this will be needed.
Cons of this
The only con of this is that the chain would not be able to check the contents of the posts during the transaction processing. However, I don't know if that might be a problem for us or not.
Also, clients would have to upload the content before the transaction, so we are moving onto them this responsibility. This can however be fixed by creating a REST API that performs such operation for the client.
Conclusion
I personally think this should be the way to follow that would allow us to:
focus less on the development of the posts module.
Once you define the specification for each post, you do not need to edit the way they are stored on chain.
achieve a higher scalability of the chain itself.
As posts are lighter on disk, the chain can scale better and transactions will be processed faster.
I would love to hear @kwunyeung and @bragaz feedback on this and what it might be improved or should change on such approach.
The text was updated successfully, but these errors were encountered: