Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIP: Improve storage variable layout #769

Closed
daejunpark opened this issue Apr 9, 2018 · 12 comments
Closed

VIP: Improve storage variable layout #769

daejunpark opened this issue Apr 9, 2018 · 12 comments
Labels
VIP: Approved VIP Approved

Comments

@daejunpark
Copy link
Contributor

daejunpark commented Apr 9, 2018

Preamble

VIP: #769
Title: Improve storage variable layout
Author: Daejun Park
Type: Standard Track 
Status: Draft
Created: 2018-04-09
Requires (*optional): <VIP number(s)>
Replaces (*optional): <VIP number(s)>

Simple Summary

Improve the layout of dynamically-sized state variables in the storage, adopting that of Solidity.

Specification

A map entry m[k] is stored at the location hash(k . index(m)), where "." is byte-concatenation.

A nested map entry's location can be defined recursively. For example, m[k1][k2] is stored at hash(k2 . hash(k1 . index(m)))

A formal specification can be found here (in progress).


Below is a quick reference of loc that computes the storage location.

Lists:

loc(a[i])    =   #(a) + i
loc(a[i][j]) = #(#(a) + i) + j

Structs:

loc(s.x)   =   #(s) + x
loc(s.x.y) = #(#(s) + x) + y

Maps:

loc(m[k])    =   #(m . k)
loc(m[k][l]) = #(#(m . k) . l)

In general: (E is an arbitrary nested data structure, e.g., a list of structs of maps ...):

loc(E[i]) = #(loc(E)) + i    // list of ...
loc(E.x)  = #(loc(E)) + x    // struct of ...
loc(E[k]) = #(loc(E) . k)    // map of ...

loc(a) = a                   // list name
loc(s) = s                   // struct name
loc(m) = m                   // map name

NOTE:

  • # denotes the keccak256 hash.
  • . is byte-concatenation.
  • Assume there is the implicit index conversion for the list name a, the struct name s, the struct field names x and y, and the map name m.

Rationale:

The number of elements (or fields) of a list (or a struct) is expected to be much smaller than the size of the storage address space. Thus, even if the elements (or the fields) are stored in a consecutive region of the storage, it is very unlikely the region is overlapped with the other regions, since the starting locations of the regions will be well distributed (well spread across the entire storage) by keccak256.

Motivation

The main difference from the current scheme is the use of . instead of + to compute the offset. This change is critical for avoiding potential collisions, since . is injective, while + is not due to the modulo arithmetic. (Note that this is orthogonal to the potential hash collision of keccak256.)

@jacqueswww jacqueswww added the VIP: Approved VIP Approved label Apr 9, 2018
@jacqueswww
Copy link
Contributor

As we've discussed this across many meetings, I have marked this as approved ;)

@jacqueswww jacqueswww added this to Backlog in Vyper - Final Countdown via automation Apr 9, 2018
@jacqueswww jacqueswww moved this from Backlog to In Progress in Vyper - Final Countdown Apr 11, 2018
@jacqueswww
Copy link
Contributor

jacqueswww commented Apr 11, 2018

@daejunpark I started working on the above, can we just update the spec, the resulting lll comes out as follows (just order of concat is different):

self.a[1][3] = 131313
 [sstore, [sha3_64, [sha3_64, 0 <self.a>, 1], 3], 131313],

Also please mention that list and structs also need to use the new method ;)

@daejunpark
Copy link
Contributor Author

daejunpark commented Apr 11, 2018

@jacqueswww that seems more natural to me. Indeed, I have no good idea why Solidity uses the swapped order. It would be great if you know Solidity developers and can ask them to confirm that there is no critical reason for choosing the swapped order.

I'll update the spec as you suggested. Thanks!

@daejunpark
Copy link
Contributor Author

@jacqueswww I think you can keep the current layout scheme for lists and structs.

Below is a quick reference of loc that computes the storage location.

Lists:

loc(a[i])    =   #(a) + i
loc(a[i][j]) = #(#(a) + i) + j

Structs:

loc(s.x)   =   #(s) + x
loc(s.x.y) = #(#(s) + x) + y

Maps:

loc(m[k])    =   #(m . k)
loc(m[k][l]) = #(#(m . k) . l)

In general: (E is an arbitrary nested data structure, e.g., a list of structs of maps ...):

loc(E[i]) = #(loc(E)) + i    // list of ...
loc(E.x)  = #(loc(E)) + x    // struct of ...
loc(E[k]) = #(loc(E) . k)    // map of ...

loc(a) = a                   // list name
loc(s) = s                   // struct name
loc(m) = m                   // map name

NOTE:

  • # denotes the keccak256 hash.
  • Assume there is the implicit index conversion for the list name a, the struct name s, the struct field names x and y, and the map name m.

Rationale:

The number of elements (or fields) of a list (or a struct) is expected to be much smaller than the size of the storage address space. Thus, even if the elements (or the fields) are stored in a consecutive region of the storage, it is very unlikely the region is overlapped with the other regions, since the starting locations of the regions will be well distributed (well spread across the entire storage) by keccak256.


Question.

Could you confirm that the above scheme for lists and structs is the same with the current implementation?

@jacqueswww
Copy link
Contributor

jacqueswww commented Apr 13, 2018

@daejunpark I actually went and did the work on all structures yesterday. The other one to consider is a ByteArray, which uses the same scheme. Are we 100% it's safe to have those structures not to use the
#(#(m).k) approach. What happens when we store a list, bytearray or struct within a map? Feels like in that scenario the risk is the same as using map of #(m) + k ?
If we are 100% abouth using it for maps only; I'll adapt it ;) as there is obviously a higher gas cost.

@daejunpark
Copy link
Contributor Author

daejunpark commented Apr 13, 2018

@jacqueswww I discussed with @yzhang90, we think that it seems to be safe, but not 100% sure to be honest (because I'm not an expert for the kaccak256 hash). Indeed, the above scheme is similar to that of Solidity.

I think we need to have an extra discussion and confirmation of its safety with other experts (e.g., Solidity developers). Do you think the Solidity gitter channel is a good place for it? If so, I can initiate the discussion there.

(BTW, you're so quick to develop. Sorry for making you go back and forth. I think you can leave what you already implemented for now, just in case you change the scheme again.)

@jacqueswww
Copy link
Contributor

I agree, let's get this sorted before I develop it further ;) gitter channel should help yes. Also look for @chriseth in our channel directly I know he works on solidity as well.

@daejunpark
Copy link
Contributor Author

daejunpark commented Apr 18, 2018

Just for a record, copy the gitter communication here:


Daejun Park @daejunpark 12:16
Hi @chriseth
I have a question regarding the layout of state variables in storage:
http://solidity.readthedocs.io/en/v0.4.21/miscellaneous.html#layout-of-state-variables-in-storage

(It seems weird to ask about Solidity in Vyper channel, but this will help to improve the Vyper's layout scheme.)

The current scheme, for example, works as follows:

  • A dynamically-sized array element, a[i], is stored at keccak256(slot(a)) + i.
  • A mapping entry, m[k] is stored at keccak256(k . slot(m)), where . is concatenation.

Now, I have two questions:

Q1. Is there a critical reason that the location scheme of dynamic arrays is different from that of mappings? In other word, what will happen if a[i] is stored at keccak256(i . slot(a))? I know it will consume more gas, but will it affect any security (or hash collision probability)?

Q2. For a map entry location, is there any critical difference (in terms of security or hash distribution) between the following two?
keccak256(k . slot(m))
vs
keccak256(slot(m) . k)


chriseth @chriseth 13:21
@daejunpark dynamic arrays use a different scheme than mappings for efficiency. Using the same scheme as mappings reduces collision probability. We have to disallow large arrays because you can easily find collisions in storage otherwise. Another reason is also that if you use the same scheme, there is not a big reason to have arrays in general.

for the collisions see https://chriseth.github.io/notes/talks/safe_solidity/#/8

For Q2, I hope that keccak256 ensures that the order does not matter.

@daejunpark
Copy link
Contributor Author

@jacqueswww It turns out (thanks to @chriseth ) that the current scheme for lists is somewhat necessary, otherwise there is no reason to have lists in addition to maps.

But we need to have a compile-time check to reject a very large list, which can be problematic for the same reason we discussed before.

It would be good to check the size of structs as well (although it is very unlikely to declare such a large struct).

So, are lists, structs, and maps all we have?

@chriseth
Copy link
Contributor

In Solidity we issue warnings for large statically-sized arrays and we plan to remove the ability to arbitrarily increase the size of dynamically-sized arrays. If you have structures whose size in storage scales linearly with the amount of symbols required in source code, you should be fine, since there is still a lot of space in 2**256.

@chriseth
Copy link
Contributor

Perhaps to clarify a little more: If you can only increase the length of a dynamically-sized structure by a single element, then the gas costs for that operation keep the structure small enough until the end of the universe.

@daejunpark
Copy link
Contributor Author

Thanks @chriseth for your help! Now, we're clear what to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
VIP: Approved VIP Approved
Projects
No open projects
Development

No branches or pull requests

3 participants