Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIP-0041? | UPLC Serialization Optimizations #314

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
327 changes: 327 additions & 0 deletions CIP-optimized-uplc-serialization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
---
CIP: "???"
Title: optimized UPLC serialization
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>
Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com>

Discussions-To: TBD
Comments-Summary: TBD
Comments-URI: TBD
Status: Draft
Type: Standards Track
Created: <date created on, in ISO 8601 (yyyy-mm-dd) format>
License: CC-BY-4.0
License-Code: Apache-2.0
Post-History: <dates of postings to Cardano Dev Forum, or link to thread>
---

## Abstract

this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to clean up the formatting, e.g. use capital case at the beginning of sentences.


## Motivation

all Untyped Plutus Core Programs do need to be serialized in order to be included in a transaction (therefore stored on-chain);

the current serialization algorithm is based on the [flat](http://quid2.org/docs/Flat.pdf) algorithm;

this allows for a bit-oriented serialized program; but there are aspects that can be improved to further reduce the size of the serialized program;

the major benefit of a uplc-optimized algorithm is the possibility to store more complex logic (or more in general, more complex transactions) on chain without the need to increase the transaction size limit.

## Specification

### Table of contents

- [```pad0``` should add nothing](#pad0)
- [constant's types list should work as Integer ones](#constT)
- [remove the "type application" tag for constant types](#tyApp)
- [```ByteString```s encoded as unsinged integer followed by the bytes](#ByteStrings)
- [```data``` serialization
](#data)

<a name="pad0"></a>

### ```pad0``` should add nothing rather than an entire byte

sometimes it is necessary to allign the serialized program (partial or total) to fit into a byte;

in order to do so, some padding is added based on the current bit-length of the program;

the needed padding can be obtained as follows
```ts
pad( 8 - (currBitLength % 8) )
```
where the ```pad``` function could be defined as follows in typescript
```ts
pad( missingBits: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 ): string
{
switch( missingBits )
{
case 0: return "00000001";
case 1: return "1";
case 2: return "01";
case 3: return "001";
case 4: return "0001";
case 5: return "00001";
case 6: return "000001";
case 7: return "0000001";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What language is this? Why not describe this in Haskell?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typescript just because I feel the description is more intuitive using it

it translates to haskell as

pad :: Int -> String
pad 0 = "00000001"
pad 1 = "1"
pad 2 = "01"
pad 3 = "001"
pad 4 = "0001"
pad 5 = "00001"
pad 6 = "000001"
pad 7 = "0000001"
pad _ = undefined

```
the case in which the ```missingBits``` is 0 implies that the current serialized program is already byte-alligned and, since this padding carries no usefull informations, the current ```pad( 0 )``` adds a useless byte each time a padding is needed and the number of used bits is a multiple of 8.

An alternative implementation would be
```ts
pad( missingBits: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 ): string
{
switch( missingBits )
{
case 0: return "";
case 1: return "1";
case 2: return "01";
case 3: return "001";
case 4: return "0001";
case 5: return "00001";
case 6: return "000001";
case 7: return "0000001";
}
}
```

this modification would make the padding deserialization contex-dependent (just like the serialization already is) but saves a byte 1 time every 8 ( probably more ofthen due to the high presence of tags and items of which serialization has a bit-length of a multiple of 4 )

<a name="constT"></a>

### constant's types list should work as Integer ones

citing the draft plutus core specification paper of june 2022

> (Appendix D, Section D.3.3, page 31 )
> We define the encoder and decoder for [ constant ] types by combining 𝖾 𝗍𝗒𝗉𝖾 and 𝖽 𝗍𝗒𝗉𝖾 with 𝖤 and decoder for lists of four-bit integers (See section D.2)

> (Appendix D, Section D.2.2, page 31 )
> Suppose that we have a set 𝑋 for which we have defined an encoder 𝖤 𝑋 and a decoder 𝖣 𝑋 ; we define an 𝑋 which encodes lists of elements of 𝑋 by emitting the encodings of the elements of the list, encoder 𝖤 each preceded by a 𝟷 bit, then emitting a 𝟶 bit to mark the end of the list.

from what cited above we can understand that the current constant's types serialization works exactly as the list one, in particoular a list of 4 bit values.

this encoding takes care also of the "empty list" case which is never the case for well formed list of types.

this means that an empty list encodes as ```0``` (nil constructor + no data) and a generic (simple) type encodes as ```1 xxxx 0``` ( cons construcotr + tag + nil constructor )

for lists that are non empty ( such as 7 bits lists used for unsigned integers serialization ) the constructor always carries data (since of course is non-empty) and this allows to remove the last bit.

the same should be applied to constant's types so that a generic (simple) type serializes as ```0 xxxx```( nil construcotr + tag )

#### some examples

```
int -> 0 0000
# before 1 0000 0
```

> **NOTE**: see "[remove the "type application" tag for constant types](#tyApp)" below
```
list int -> 1 0101 0 0000
# before 1 0111 1 0101 0 0000
```

> **NOTE**: see "[remove the "type application" tag for constant types](#tyApp)" below
```
pair int bytestring -> 1 0110 1 0000 0 0001
# before 1 0111 1 0111 1 0110 1 0000 1 0001 0
```

<a name="tyApp"></a>

### remove the "type application" tag for constant types

as per specification constant terms are typed using tags.

these tags are serialized as 4 bits and currently so 16 options ( 0 to 15 inclusive ) of wich 9 are used as follows

type | binary | decimal
------------|--------|--------
integer | 0000 | 0
bytestring | 0001 | 1
string | 0010 | 2
unit | 0011 | 3
bool | 0100 | 4
list | 0101 | 5
pair | 0110 | 6
type app | 0111 | 7
data | 1000 | 8

tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding;

tags ```list``` and ```pair``` are the only reason the ```type application``` tag is presente, and those tags always require some other type in order to be valid, so the type application is implcit and it should be removed.

during serialization this saves 4 bits for each ```lsit``` tag ( ```0111 0101``` becomes ```0101``` ) and 8 bits for pair tags which requires 2 type applications to be satisfied ( ```0111 0111 0110``` becomes ```0110``` )

to help backwards compatibitlity with exsisting serializaations algoritms the new table would look like

type | binary | decimal
------------|--------|--------
integer | 0000 | 0
bytestring | 0001 | 1
string | 0010 | 2
unit | 0011 | 3
bool | 0100 | 4
list | 0101 | 5
pair | 0110 | 6
data | 1000 | 8

even if the remotion of the tag would allow to reduce the size of every tag, further reducing the size of constants by 1 bit per constant at the cost of missing bits for expansion in the evenience new constant types will be added

if the cost is acceptable ( needs to be discussed with the comunity ) an even better soultion would be to encode constant types as follows

type | binary | decimal
------------|--------|--------
integer | 000 | 0
bytestring | 001 | 1
string | 010 | 2
unit | 011 | 3
bool | 100 | 4
list | 101 | 5
pair | 110 | 6
data | 111 | 7

<a name="ByteStrings"></a>

### ```ByteString```s encoded as unsinged integer followed by the bytes

ByteStrings are probably the values that have the most important impact on the serialized output;

Bytestrings are used to represent a wide variety of types (```TokenName```, ```ValidatorHash```, ```PubKeyHash```, ```CurrencySymbol```, etc. are all ```ByteString``` wrappers ) ence widely used in smart contracts.

citing the plutus core specificaiton draft released in june 2022 (latest specification at the time of writing)

> Bytestrings are encoded by dividing them into nonempty blocks of up to 255 bytes and emitting each block in sequence.
>
> Each block is preceded by a single unsigned byte containing its length, and the end of the encoding is marked by a zero-length block (so the empty bytestring is encoded just as a zero-length block).
>
> Before emitting a bytestring, the preceding output is padded so that its length (in bits) is a multiple of 8;
>
> if this is already the case a single padding byte is still added; this ensures that contents of the bytestring are aligned to byte boundaries in the output.

given the importance of bytestrings too much hoverhead is introduced; currently a ```ByteString``` serializes as

```
1) some padding that, depending on the context, may take up to 1 byte
2) 1 (byte * chunk) indicating how many bytes are in the very next chunk
3) 0 to 255 meaningful (bytes * chunk)
4) 1 useless byte indicating no more chiunks are following
```
this implies ```(~1) + #chunks + 1``` meaningless bytes are added per ```ByteString```

in the descripton above
- step ```1``` allows for an easy serialization and deserialization but doesn't carries any meaningful informations; given the importance of ByteStrings it should be removed at the cost of an added shift while serializing/deserializing the value
Copy link
Contributor

@kwxm kwxm Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might slow things down significantly, since I think we'd need to shift every byte in a bytestring if it didn't start at a byte boundary (or have I got that wrong?). It's important that on-chain deserialisation be as fast as possible, so the extra expense might be problematic. I'd like to see some benchmark results for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to implement deserialization without shifting bytes for efficiency

I'll leave here the plu-ts implementation of the readNBits function on which the deserialization process is based

I believe a very similar result can be achieved in Haskell using the Data.Binary.Get monad

- step ```4``` can be removed as it is simply useless
- step ```3``` can be optimized using unsigned integer encoding

the new algorithm for ```ByteString``` serialization would consist of two parts:
```
1) Unsigned (indefinite length) Integer indicating how many bytes ( or chunks of 8 bits ) are following
2) as many bytes as specified in the Unsigned Integer at step 1
```

using the new algorithm the ```ByteString``` serialization space complexity goes form ```O(n)``` to ```O(log n)``` where ```n``` is the number of bytes in the ```ByteString```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I need to think more carefully about this. It's a while since I've thought about bytestring encodings and I'll need to remind myself of the details of how we currently do it.


<a name="data"></a>

### ```data``` serialization

All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be good to have a more efficient encoding of Data rather than just wrapping the CBOR representation. However, I think the reason for this is that Data is used elsewhere on the chain (including things that are passed to validator scripts as parameters), and CBOR is the preferred format there. It might be difficult to switch because of this.

Copy link
Author

@HarmonicPool HarmonicPool Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversion from CBOR to a specific data encoding can be done by the node prior to being passed as argument;

CBOR encoding is definitely too expansive to be included in a script


citing the june 2022 plutus core specification ( Appendix B, Section B.2, Note 1; page 23 )

> The ```serialiseData``` function takes a 𝚍𝚊𝚝𝚊 object and converts it into a CBOR (Bormann and Hoffman [2020]) object. The encoding is based on the Haskell Data type described in Section A.1.
>
> A detailed description of the encoding will appear here at a later date, but for the time being see the Haskell code in
> [plutus-core/plutus-core/src/PlutusCore/Data.hs](https://github.com/input-output-hk/plutus/blob/master/plutus-core/plutus-core/src/PlutusCore/Data.hs) ([```encodeData``` line](https://github.com/input-output-hk/plutus/blob/9ef6a65067893b4f9099215ff7947da00c5cd7ac/plutus-core/plutus-core/src/PlutusCore/Data.hs#L139))
> in the [Plutus GitHub repository](https://github.com/input-output-hk/plutus) IOHK [2019] for a definitive implementation.

so currently ```Data``` types are encoded by converting the value to ```CBOR``` and the using the ```ByteString``` encoder [using chunks of 64 bytes max each](https://github.com/input-output-hk/plutus/blob/9ef6a65067893b4f9099215ff7947da00c5cd7ac/plutus-core/plutus-core/src/PlutusCore/Data.hs#L84)

this implementation is very inconvinient considering that ```Data``` has only [5 construtors](https://github.com/input-output-hk/plutus/blob/9ef6a65067893b4f9099215ff7947da00c5cd7ac/plutus-core/plutus-core/src/PlutusCore/Data.hs#L40).

```hs
data Data =
Constr Integer [Data]
| Map [(Data, Data)]
| List [Data]
| I Integer
| B ByteString
```

a more efficient implementation could be obtained by introducing some tags for the various construcors, each taking 3 bits:

data constructor | binary | decimal
-----------------|--------|--------
Constr | 000 | 0
Map | 001 | 1
List | 010 | 2
I | 011 | 3
B | 100 | 4

directly followed by the respective serialized value; in particoular

- ```Constr``` is followed by an **unsigned** integer and the a list of ```Data```
- ```Map``` is followed by a list of ```Data``` pairs; where the paris are serialized as in the constant pair specification (only the value)
- ```List``` is folloed by a list of ```Data``` elements
- ```I``` is followed by a **signed** integer
- ```B``` is followed by a ```ByteString```

An important **alternative design** that should be considered and discussed with the comunity would see the ```Data``` type requiring only 4 tags which would then be:

data tag | binary | decimal
----------|--------|--------
data-pair | 00 | 0
data-list | 01 | 1
data-int | 10 | 2
data-bs | 11 | 3

and the constructors would map to:

- ```Constr``` becomes a ```data-pair``` expecting an unsigned integer and a list of ```Data```
- ```Map``` becomes a ```data-list``` containing 2 ```Data``` per element
- ```List``` becomes a ```data-list``` containing 1 ```Data``` per element
- ```I``` becomes a ```data-int``` tag expecting a signed integer
- ```B``` becomes a ```data-bs``` tag expecting a ```ByteString```

this design is unambigous since ```data-list``` that maps back to two of the constructors contains different values depending on what Constructor is intended.

the only edge case for the ```data-list``` constructor is the empty list, but the problem can be easly solved by expecting one single bit only in the case of the empty list indicating the intended constructor.

so the empty ```Map``` would serialize as
```
01 # data-list tag
0 # list nil construcor ( empty list )
0 # bit indicating Map data constructor
```
and the empty ```List``` as
```
01 # data-list tag
0 # list nil construcor ( empty list )
1 # bit indicating List data constructor
```

the second design also implies no more ```Data``` constructors will be added in the future, or that for those added it will be possible to find an unambigous representation using the ones specified.

## Rationale

This proposal suggest ways to reduce serialized scripts size.

the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale section is a bit thin, though there's a bit of rationale sprinkled in various places of the specification. It'd be nice to have a summary of what motivates each design choices. This can most likely be done by reworking a bit how the specification is written (that is, write the specification without justifications, and move the justifications to the rationale section).

For example, most of the "data serialization" section is about explaining what is currently wrong with the Data CBOR serialization. Instead, leave the specification focused to what the proposal is about, and move the rationale to this section.

It'd be nice to consider some benchmarks? Taking some reference UPLC and serializing them side-by-side with the two methods to see how much gain one can expect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KtorZ; will refactor as suggested ( and correct the various typos 😄 )


## Backward Compatibility

the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used;

In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program should be checked first.


since the new algorithm introduces changes that are breaking the major number in the version should change;

in particoular programs using versions that match ```0.*.* <= version <= 1.*.*``` will be serialized using the old specification, while programs with version ```2.*.* <= version``` will be serialized using the modifications specified in this CIP.

## Copyright

This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)