Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIP-0041? | UPLC Serialization Optimizations #314

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

HarmonicPool
Copy link

this CIP proposes some changes to the UPLC AST serialization in order to reduce the size of the serialized output


This proposal suggest ways to reduce serialized scripts size.

the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale section is a bit thin, though there's a bit of rationale sprinkled in various places of the specification. It'd be nice to have a summary of what motivates each design choices. This can most likely be done by reworking a bit how the specification is written (that is, write the specification without justifications, and move the justifications to the rationale section).

For example, most of the "data serialization" section is about explaining what is currently wrong with the Data CBOR serialization. Instead, leave the specification focused to what the proposal is about, and move the rationale to this section.

It'd be nice to consider some benchmarks? Taking some reference UPLC and serializing them side-by-side with the two methods to see how much gain one can expect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KtorZ; will refactor as suggested ( and correct the various typos 😄 )


## Abstract

this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to clean up the formatting, e.g. use capital case at the beginning of sentences.

case 6: return "000001";
case 7: return "0000001";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What language is this? Why not describe this in Haskell?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typescript just because I feel the description is more intuitive using it

it translates to haskell as

pad :: Int -> String
pad 0 = "00000001"
pad 1 = "1"
pad 2 = "01"
pad 3 = "001"
pad 4 = "0001"
pad 5 = "00001"
pad 6 = "000001"
pad 7 = "0000001"
pad _ = undefined

@KtorZ KtorZ changed the title CIP-optimized-uplc-serialization CIP-0041? | UPLC Serialization Optimizations Aug 17, 2022
---
CIP: "???"
Title: optimized UPLC serialization
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>
Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com>


the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used;

In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program should be checked first.

@KtorZ
Copy link
Member

KtorZ commented Sep 6, 2022

cc @michaelpj

@michaelpj
Copy link
Contributor

This could probably have started life as a plutus issue, but this is also fine!

@kwxm can you take a look?

@KtorZ
Copy link
Member

KtorZ commented Oct 25, 2022

Note

We'll be reviewing this proposal briefly in today's editor meeting but since this change concerns Plutus internals, the ultimate decision and approval is up to the current Plutus core team in accordance with CIP-0035.

cc @michaelpj @kwxm

@michaelpj
Copy link
Contributor

Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant.

```
the case in which the ```missingBits``` is 0 implies that the current serialized program is already byte-alligned

since this padding carries no usefull informations, the current ```pad( 0 )``` adds a useless byte each time a padding is needed and the number of used bits is a multiple of 8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the reason for this is so that the decoder can easily discard padding at the end without having to check whether it's at a byte boundary, and this makes it faster. The proposed change would on average save one bit per bytestring, so I'm not sure if it's worth it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the proposed changes are accepted the only place where padding will be used is at the end of the script; since bytestring will no longer require to be byte alligned

the check for pad( 0 ) then becomes just lengthInBits `mod` 8 == 0


tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding;

tags ```list``` and ```pair``` are the only tags that do require some other tag in order to be a defined type; since the twos always require some other type in order to be valid, the type application is implcit and it should be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a bit annoying: it's like that because it reflects how the internal representations of types work. The extra generality might permit us to do some more complex things in future, but it's not clear if we'll ever actually need that. Simplifying this might be a good thing to do, since it complicates the encoding/decoding process as well as taking up extra room.


### ```data``` serialization

All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be good to have a more efficient encoding of Data rather than just wrapping the CBOR representation. However, I think the reason for this is that Data is used elsewhere on the chain (including things that are passed to validator scripts as parameters), and CBOR is the preferred format there. It might be difficult to switch because of this.

Copy link
Author

@HarmonicPool HarmonicPool Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversion from CBOR to a specific data encoding can be done by the node prior to being passed as argument;

CBOR encoding is definitely too expansive to be included in a script

@HarmonicPool
Copy link
Author

HarmonicPool commented Oct 25, 2022

Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant.

@michaelpj I will work on a CIP implementation these days to have a comparison in terms of size

this implies ```(~1) + #chunks + 1``` meaningless bytes are added per ```ByteString```

in the descripton above
- step ```1``` allows for an easy serialization and deserialization but doesn't carries any meaningful informations; given the importance of ByteStrings it should be removed at the cost of an added shift while serializing/deserializing the value
Copy link
Contributor

@kwxm kwxm Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might slow things down significantly, since I think we'd need to shift every byte in a bytestring if it didn't start at a byte boundary (or have I got that wrong?). It's important that on-chain deserialisation be as fast as possible, so the extra expense might be problematic. I'd like to see some benchmark results for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to implement deserialization without shifting bytes for efficiency

I'll leave here the plu-ts implementation of the readNBits function on which the deserialization process is based

I believe a very similar result can be achieved in Haskell using the Data.Binary.Get monad

2) as many bytes as specified in the Unsigned Integer at step 1
```

using the new algorithm the ```ByteString``` serialization space complexity goes form ```O(n)``` to ```O(log n)``` where ```n``` is the number of bytes in the ```ByteString```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I need to think more carefully about this. It's a while since I've thought about bytestring encodings and I'll need to remind myself of the details of how we currently do it.

Copy link
Contributor

@kwxm kwxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideas here aren't nonsensical, but I'd like to see some benchmark results to see how they affect (a) the space taken up by serialised scripts, and (b) deserialisation times. I think that the benefits would have to be quite big in order to accept this.

I'll add that we're aware that there's room for improving things, but the reason that things are the way that they currently are is because we're using a pre-existing library and that's the way that it does things. We're definitely not averse to changing things, but then we'd need to maintain our own version of the flat library,

Apologies for a slightly hurried review: this had slipped off the bottom of my list. I'll think a bit more about this, in particular the bytestring encoding.

@KtorZ KtorZ added the Waiting for Author Proposal showing lack of activity or participation from their authors. label Nov 30, 2022
@KtorZ KtorZ added Category: Plutus Proposals belonging to the 'Plutus' category. and removed Candidate CIP labels Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Plutus Proposals belonging to the 'Plutus' category. Waiting for Author Proposal showing lack of activity or participation from their authors.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants