Change the way posts data are stored #61

RiccardoM · 2019-12-20T09:55:54Z

Context

During the same conversation already mentioned inside #60 that I had with @angelorc yesterday, another possible problem that he raised about our chain is regarding the storage of posts.

He told me that in Bitsong they have been testing through all the year different methods of storing songs data and metadata so that the chain can always perform good without becoming larger than it should be. He told me that the best way they found out in the end was storing all the metadata on IPFS and linking the CID (the IPFS hash of the file containing such data) on the chain itself.

After such conversation, I've done some hypotheses on how large our chain could become if we keep all the posts contents on-chain. This is what I got.

Chain space analysis

Premises

All the following analysis have been done referring to the JSON format of posts, which is the one used when exporting them inside the genesis or importing them during an upgrade process. This has been done due to the fact that is simpler to verify how large a JSON file is instead of how much space it occupies inside KV stores.

Current status

Currently Posts are stored using the following JSON structure:

{
  "id": "134",
  "parent_id": "156",
  "message": "This is an estimated average long post message",
  "created": "566879",
  "last_edited": "899978225",
  "allows_comments": false,
  "subspace": "dwitter",
  "optional_data": {
    "local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
  },
  "owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0"
}

In this case I've put an average length value for each field, considering the optional_data value to have only one key inside, which should be the average case. This JSON has a size on disk of 378 bytes.

Now, with the introduction of media-supporting posts (#36), an average JSON file of such time could look like the following:

{
  "id": "134",
  "parent_id": "156",
  "message": "This is an estimated average long post message",
  "created": "566879",
  "last_edited": "899978225",
  "allows_comments": false,
  "subspace": "dwitter",
  "optional_data": {
    "local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
  },
  "owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0",
  "medias": [
    {
      "uri": "http://ipfs.pics/QmXr1iejP2zHptFFDr3hycZvbaXaQNwrK6VVXYbxFAYQ7x",
      "mime_type": "image/png"
    },
    {
      "uri": "http://ipfs.pics/QmTbx9HLaN7gsKiunNc5NWvNRytydKn5uWWmpUzAHPU3NS",
      "mime_type": "image/png"
    },
    {
      "uri": "http://ipfs.pics/QmQZWZxfSHB3FpU8w4EMfX9bF52wUyPCiVNjtjf6jeNBBp",
      "mime_type": "image/png"
    }
  ]
}

Due to the higher amount of information stored, such JSON has an on-disk size of 768 bytes.

Let's now consider the case in which Desmos gains 1/1000th of Twitter usage. This could lead to 2 millions posts per each (source of Twitter usage stats: InternetLiveStats).

Considering an average of 1/5 posts being a media post, this would lead to 400.000 media posts/year and 1.600.000 text posts/year created. Considering 378 bytes per each text post and 768 bytes per each media post, this would sum up to approximately 0.92GB of posts per year.

If we scale up to 1/100th of Twitter usage, we then would collect 9.2GB/year of only posts. This amount is crazily high and I personally think that we should reduce it as in the long run it might become a problem considering that there are a lot of posts type we still need to define (e.g. #14).

Solution

The solution of storing all the post content into an IPFS file could be a good way to decrease the size of the chain on-disk. What we could do is define a new Post structure that is stored like the following:

{
  "id": "87556",
  "created": "566879",
  "content": "QmP8jTG1m9GSDJLCbeWhVSVgEzCPPwXRdCRuJtQ5Tz9Kc9",
  "owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0"
}

Inside the IPFS file reachable using the content reference, we can then have different JSON structures based on the type of the post itself. As an example, we could have

Text post

{
  "type": "post/Text",
  "parent_id": "156",
  "message": "This is an estimated average long post message",
  "allows_comments": false,
  "subspace": "dwitter",
  "optional_data": {
    "local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
  }
}

Media post

{
  "type": "post/Media",
  "parent_id": "156",
  "message": "This is an estimated average long post message",
  "allows_comments": false,
  "subspace": "dwitter",
  "optional_data": {
    "local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
  },
  "medias": [
    {
      "uri": "http://ipfs.pics/QmXr1iejP2zHptFFDr3hycZvbaXaQNwrK6VVXYbxFAYQ7x",
      "mime_type": "image/png"
    },
    {
      "uri": "http://ipfs.pics/QmTbx9HLaN7gsKiunNc5NWvNRytydKn5uWWmpUzAHPU3NS",
      "mime_type": "image/png"
    },
    {
      "uri": "http://ipfs.pics/QmQZWZxfSHB3FpU8w4EMfX9bF52wUyPCiVNjtjf6jeNBBp",
      "mime_type": "image/png"
    }
  ]
}

Of course this are just generic examples and better schemes should be defined more in depth and maybe event stored on chain for a generic reference.

On-disk space changes

Thanks to this approach, all different posts would have an on-disk size of just 166 bytes, which would reduce the above yearly disk-space increase from 0.92GB/year to 0.332GB/year (−69,456%) considering 2.000.000 posts/year.

The best thing is that this approach keeps a linear increase of the disk space that is not based on the posts contents at all. If all users suddenly started creating media posts, on chain they would weight the same as text posts so it's also a spam prevention system.

Querying

Thank to the system proposed inside #60, the parser could simply ready the posts from the chain and then get the real content from IPFS, index it and putting into the database allowing for easier queries.

Consideration on IPFS

IPFS has proven to be reliable and we can even improve its availability by creating a cluster or a private network so that posts content will be always available, although I don't think this will be needed.

Cons of this

The only con of this is that the chain would not be able to check the contents of the posts during the transaction processing. However, I don't know if that might be a problem for us or not.

Also, clients would have to upload the content before the transaction, so we are moving onto them this responsibility. This can however be fixed by creating a REST API that performs such operation for the client.

Conclusion

I personally think this should be the way to follow that would allow us to:

focus less on the development of the posts module.
Once you define the specification for each post, you do not need to edit the way they are stored on chain.
achieve a higher scalability of the chain itself.
As posts are lighter on disk, the chain can scale better and transactions will be processed faster.

I would love to hear @kwunyeung and @bragaz feedback on this and what it might be improved or should change on such approach.

The text was updated successfully, but these errors were encountered:

kwunyeung · 2019-12-23T11:42:36Z

I am against this approach.

I agree the binary media data should be on an external storage service like IPFS but not the content of the post and the metadata. They should be on the chain as they are the result of consensus and the values of the users would like to store under their accounts. If we only store the CID of the IPFS, it's lack of sovereignty and ownership of the original content creator. Do you mean the chain software should cater the file upload to IPFS? Do the node operators need to maintain IPFS nodes and pay the fee for the entire process? Unless we have another layer in the protocol which is for storing the content only in the system like data replicator in Solana, provider in Akash and storage node in Oasis, I do not agree on this approach. I don't see that text-based storage is a severe issue. Yes maybe in a long run the storage can be 3 times larger than just storing CID but solely because of reducing storage size while losing sovereignty is not what I would expect. Some longer terms solution could be data compression in KVstore (I'm not sure if Tendermint is already doing this as I know goleveldb can have data compression), or have some old data in archive mode.

We can take Steem as an example. Here is one of their data stored on chain.

https://steemd.com/fiction/@stormlight24/change-of-tide-university-path-short-fantasy-story-part-20

There are a lot of data but still, they are text only (markdown). As long as the data is storing in binary and compressed, it won't be too heavy I think.

leobragaz · 2020-01-02T16:59:39Z

Do you mean the chain software should cater the file upload to IPFS? Do the node operators need to maintain IPFS nodes and pay the fee for the entire process?

I don't think that's what Riccardo meant.
He propose to move this responsibility to the clients, they would manage the upload to IPFS.

Some longer terms solution could be data compression in KVstore (I'm not sure if Tendermint is already doing this as I know goleveldb can have data compression), or have some old data in archive mode.

What do you mean with archive mode? Compress them after some time they were on chain?

I agree that media data should be cater on IPFS, I really don't know what will happen when we would have a huge amount of posts.
Will these data slow the system dramatically? And if they would cause such a problem, how will we able to face them? By Change the whole system? It would be crazy.
For now I would probably agree with the IPFS solution for both media and text posts to ensure a faster service in the long term.
I remember once while talking with Alessio from cosmos that he warned us to not put much data on chain and that it would be better for us to keep the chain as light as possibile, but this is only his opinion, we could ask to anyone from cosmos/tendermint team if they are thinking about that.

kwunyeung · 2020-01-03T07:45:50Z

What do you mean with archive mode? Compress them after some time they were on chain?

Archived the block data so they won't be accessible directly. Maybe this is not eligible in our case.

I remember once while talking with Alessio from cosmos that he warned us to not put much data on chain and that it would be better for us to keep the chain as light as possibile

I agree with this. I think there are different scenarios. For a tweet-like message, Facebook-like post, they should be stored on chain. For a Medium-like blog, the full text body storing on IPFS and only store the metadata on chain is a better solution. This matches the case of storing photos and musics. For example, if I want to build an IG-like app, I would store the photos on IPFS and only store the description in the post and like it to the CID of the photo.

Maybe we should think about limiting the length of text in a post. This helps limiting the size of an tx an gas use. And then we build some demos on how Desmos can work with IPFS to build some interesting applications.

leobragaz · 2020-01-03T08:38:17Z

For a tweet-like message, Facebook-like post, they should be stored on chain. For a Medium-like blog, the full text body storing on IPFS and only store the metadata on chain is a better solution. This matches the case of storing photos and musics. For example, if I want to build an IG-like app, I would store the photos on IPFS and only store the description in the post and like it to the CID of the photo.

I think it's a good idea, it avoids chain's pollution and in the meantime it let tracks of user's posts on chain and preserve sovereignty.

Maybe we should think about limiting the length of text in a post. This helps limiting the size of an tx an gas use. And then we build some demos on how Desmos can work with IPFS to build some interesting applications.

Yes we could do that, sounds good to me!

What do @RiccardoM you think about all this?

RiccardoM · 2020-01-07T06:37:52Z

I personally think that this solution might be the best one:

Maybe we should think about limiting the length of text in a post.

I go as far as to say that the maximum post length should be something like 500 characters, which is 1.78 times a Tweet length.

And then we build some demos on how Desmos can work with IPFS to build some interesting applications.

This can be done inside Dwitter too, which can use IPFS for posts longer than 500 characters.

kwunyeung · 2020-01-07T07:36:53Z

I'm ok for 500 characters. Does it matter if it has to be a number which is a power of 2? As the limit of a memo in a tx is 256.

I think Dwitter (Mooncake) should not go for the IPFS long post. That should be another dapp which is aimed for blog posting. We should keep the function of each dapp focus to a single problem.

RiccardoM · 2020-01-07T07:50:18Z

I'm ok for 500 characters. Does it matter if it has to be a number which is a power of 2? As the limit of a memo in a tx is 256.

No, it doesn't really matter. We can fix any length we want 😄

I think Dwitter (Mooncake) should not go for the IPFS long post. That should be another dapp which is aimed for blog posting. We should keep the function of each dapp focus to a single problem.

Ok perfect!

RiccardoM · 2020-01-07T07:54:00Z

Closing this in favor of #67

RiccardoM added kind/enhancement Enhance an already existing feature; no "New feature" to add x/posts Post module status/specification This feature is currently in the specification process labels Dec 20, 2019

RiccardoM added this to To do in Desmos via automation Dec 20, 2019

RiccardoM mentioned this issue Jan 7, 2020

Limit the length of posts #67

Closed

RiccardoM closed this as completed Jan 7, 2020

Desmos automation moved this from To do to Done Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the way posts data are stored #61

Change the way posts data are stored #61

RiccardoM commented Dec 20, 2019

kwunyeung commented Dec 23, 2019

leobragaz commented Jan 2, 2020 •

edited

kwunyeung commented Jan 3, 2020

leobragaz commented Jan 3, 2020

RiccardoM commented Jan 7, 2020

kwunyeung commented Jan 7, 2020

RiccardoM commented Jan 7, 2020

RiccardoM commented Jan 7, 2020

Change the way posts data are stored #61

Change the way posts data are stored #61

Comments

RiccardoM commented Dec 20, 2019

Context

Chain space analysis

Premises

Current status

Solution

On-disk space changes

Querying

Consideration on IPFS

Cons of this

Conclusion

kwunyeung commented Dec 23, 2019

leobragaz commented Jan 2, 2020 • edited

kwunyeung commented Jan 3, 2020

leobragaz commented Jan 3, 2020

RiccardoM commented Jan 7, 2020

kwunyeung commented Jan 7, 2020

RiccardoM commented Jan 7, 2020

RiccardoM commented Jan 7, 2020

leobragaz commented Jan 2, 2020 •

edited