Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocol hardfork: fixed-length OutputID replacing Outpoint #417

Closed
wants to merge 26 commits into from

Conversation

oleganza
Copy link
Contributor

@oleganza oleganza commented Jan 20, 2017

REPLACED BY #421

Problem

Outpoint is a variable-length structure <txid>:<index> which is 33-40 bytes long (33 bytes for most transactions). It is used by transaction inputs to identify exact output in the UTXO set ("Assets Merkle Tree"). The tree leafs contain SHA3(output) which allows save space and requires transactions to carry redundant copies of spent outputs to perform validation (otherwise nodes would have to store the entire outputs instead of their hashes — over 2x more data, and the ratio is much bigger in protocol 2). Also, for HSM-friendliness the TXSIGHASH must contain a redundant output's hash: SHA3(txid || input index || SHA3(output)).

Solution:

We define two new terms:

  • OutputID = SHA3(TxHash || OutputIndex)
  • UnspentID = SHA3(OutputID || SHA3(OutputCommitment))

How are these used:

  1. Transaction input contains OutputID to identify the output being spent. This is a unique identifier of the output.
  2. Transaction input uses second serialization flag to indicate if it contains the entire previous Output Commitment, or its hash (instead of empty place).
  3. UTXO set becomes a proper set containing UnspentIDs instead of {Outpoint -> SHA3(OutputCommitment)}.

When a node validates a transaction, it computes UnspentID using provided OutputID and previous OutputCommitment. If the given unspent ID is present in the UTXO set, then previous output is proved to be both authentic and available for spending.

Upsides:

  1. The outputID is constant-size and shorter: 32 bytes instead of 33-40 bytes. This simplifies merkle tree design, transaction data structure and all pieces of software that need to handle outpoints.
  2. All outputs (via unspentIDs) in the transaction are randomized across the Assets Merkle Tree instead of being crammed inside a common subpath <txid>||....
  3. Inputs automatically commit directly to the spent outputs, so TXSIGHASH does not need to do that and can be simplified to SHA3(txid || input index). HSM is able to verify which output this input commits to without having access to the entire parent transaction.
  4. We keep the term outpoint to mean a pair (txid, index), but is internal to Chain Core to support random access to UTXOs. Validation protocol no longer uses outpoints.
  5. UTXO takes 2x less RAM because it only contains unpent IDs (32 bytes) instead of a key-value pair (64+ bytes).
  6. When we get to tx entries design, we'll generalize the idea of OutputID to EntryID, so that any entry can have a unique identifier.

Downsides:

  1. OutputID no longer indicates the transaction ID which makes it impossible to navigate the chain of transactions without also having a mapping outpoint -> txid:index. UTXO tree is not enough as it's only reflecting the latest state of the chain and throws away spent outpoints. Note that in order to navigate the transactions in practice one still needs the mapping txid -> tx, so maintaining one more index might not be a significant increase in complexity. Chain is doing this indexing already and we keep that mapping.
  2. Chain Core no longer returns (txid,position) pair for annotated txinputs (called spent_output:{transaction_id:String,position:Int}), but instead returns output_id (spent_output_id:String). To maintain full compatibility, we'd need to make an additional request to locate the previous output's txid and position, but I'm not sure any application actually relies on such historical data. For spending (locating unspents), we fully maintain compatibility with clients using (txid,position) pairs.

This is a part of a package of breaking changes in P1: #239

REPLACED BY #421

@oleganza
Copy link
Contributor Author

@@ -93,11 +93,10 @@ func (a *spendAction) Build(ctx context.Context, b *txbuilder.TemplateBuilder) e
return nil
}

func (m *Manager) NewSpendUTXOAction(outpoint bc.Outpoint) txbuilder.Action {
func (m *Manager) NewSpendUTXOAction(outputid bc.OutputID) txbuilder.Action {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/outputid/outputID/

@@ -89,24 +99,23 @@ func (m *Manager) indexAccountUTXOs(ctx context.Context, b *bc.Block) error {
}

// Delete consumed account UTXOs.
deltxhash, delindex := prevoutDBKeys(b.Transactions...)
deloutputids := prevoutDBKeys(b.Transactions...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/deloutputids/delOutputIDs/

prevoutHashes = append(prevoutHashes, outpoint.Hash[:])
prevoutIndexes = append(prevoutIndexes, outpoint.Index)
oid := in.OutputID()
prevoutOIDs = append(prevoutOIDs, oid[:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OID might get confused with Postgres's object identifiers. Maybe rename to prevoutIDs?

@@ -7,6 +7,7 @@

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is Postgres 9.6 only. should probably delete this since some of us are still on 9.5

import (
"chain/encoding/blockchain"
"io"
"strconv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import grouping

@oleganza
Copy link
Contributor Author

Thanks, @jbowens. Addressed all nits.

Copy link
Contributor

@kr kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment on the main description for the PR: that text will be used as the commit message, so the wording should probably be tweaked so it's not presented as a "proposal" but instead as "here's what we did".

}
],
"env": {
"GO_INSTALL_PACKAGE_SPEC": "./cmd/...",
"CMD": "cored"
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this file didn't actually change, so ideally it shouldn't appear in the diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, reverted.

return txbuilder.MissingFieldsError("output_id")
}
outid = *a.OutputID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should collapse the error cases into one:

if a.OutputID == nil && (a.TxHash == nil || a.TxOut == nil) {
    return txbuilder.MissingFieldsError("output_id")
}
if a.OutputID == nil {
    outid = *a.OutputID
} else {
    outid = bc.ComputeOutputID(*a.TxHash, *a.TxOut)
}

If someone is using the old interface, and using it incorrectly, such that they would have got a MissingFieldsError for transaction_id or position, they'll have to update their code anyway, at which point they might as well start using output_id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's harder to parse what's being meant here... But I get the point, how about this flat list of 3 options: 36ad7c0

@@ -51,8 +51,14 @@ func (m *Manager) indexAnnotatedAccount(ctx context.Context, a *Account) error {
})
}

type output struct {
type outputWithOutpoint struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the names output and outputWithOutpoint side by side, it seems like outputWithOutpoint would have a superset of the field of output, but actually it's the opposite. How do you feel about something like accountOutput and output (respectively) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Renamed to rawOutput and accountOutput: 31e2a03

@@ -107,4 +107,12 @@ var migrations = []migration{
{Name: "2017-01-19.0.asset.drop-mutable-flag.sql", SQL: `
ALTER TABLE assets DROP COLUMN definition_mutable;
`},
{Name: "2017-01-20.0.core.add-output-id-to-outputs.sql", SQL: `
ALTER TABLE annotated_outputs ADD COLUMN output_id bytea NOT NULL;
ALTER TABLE ONLY annotated_outputs ADD CONSTRAINT annotated_outputs_unique_output_id UNIQUE (output_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are equivalent to:

ALTER TABLE annotated_outputs
	ADD COLUMN output_id bytea UNIQUE NOT NULL;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed: 3e96e6b

ALTER TABLE account_utxos ADD COLUMN output_id bytea NOT NULL;
ALTER TABLE ONLY account_utxos ADD CONSTRAINT account_utxos_unique_output_id UNIQUE (output_id);
ALTER TABLE account_utxos ADD COLUMN unspent_id bytea NOT NULL;
ALTER TABLE ONLY account_utxos ADD CONSTRAINT account_utxos_unique_unspent_id UNIQUE (unspent_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do more than one thing in a single ALTER TABLE ... statement. Consider this:

ALTER TABLE account_utxos
	ADD COLUMN output_id bytea UNIQUE NOT NULL,
	ADD COLUMN unspent_id bytea UNIQUE NOT NULL;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed: 3e96e6b

)

// OutputID identifies previous transaction output in transaction inputs.
type OutputID Hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a case where embedding the other type would be preferable, since we want to keep all the methods.

// OutputID identifies previous transaction output in transaction inputs.
type OutputID struct{ Hash }

// UnspentID identifies and commits to unspent output.
type UnspentID struct{ Hash }

If we did that, converting from a Hash to OutputID and back would look like:

outputID = OutputID{hash}
hash = outputID.Hash

The memory layout is the same either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't know about this feature. I've also tried to use outid.Bytes() instead of tmp := outid.Hash; tmp[:] when I need a byteslice. Is that the right way? 9042b59

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also tried to use outid.Bytes() instead of tmp := outid.Hash; tmp[:] when I need a byteslice. Is that the right way?

I see nothing wrong with writing outid.Hash[:], but I guess when the expression outid is actually a function call, it can't be written without a temporary variable. The Bytes method seems reasonable to me! But I wouldn't get dogmatic about using it everywhere; if outid.Hash[:] in one line is also an option, that's fine too IMO.

}

// WriteTo writes p to w.
// It assumes w has sticky errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence isn't true; it should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

o.WriteTo(w)
return b.Bytes()
func OutputKey(o bc.UnspentID) (bkey []byte) {
// TODO(oleg): check if we no longer need this buffer writing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we don't need it. The buffer writing was just to serialize the outpoint. Since an UnspentID's internal representation is the actual bytes we want to put in the tree, we can just return them as you're doing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, maybe we should inline (get rid of) this whole function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. And added convenience Bytes()->[]byte methods to avoid creating local var all the time to do a slice. Is it a good idea? 7327600

@@ -193,7 +193,7 @@ func opIndex(vm *virtualMachine) error {
return vm.pushInt64(int64(vm.inputIndex), true)
}

func opOutpoint(vm *virtualMachine) error {
func opOutputid(vm *virtualMachine) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: opOutputID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@kr
Copy link
Contributor

kr commented Jan 22, 2017

wording should probably be tweaked

Actually this might just consist of replacing the one word "proposal" with "solution" or something along those lines.

@oleganza
Copy link
Contributor Author

Replaced by #421

@oleganza oleganza closed this Jan 23, 2017
iampogo pushed a commit that referenced this pull request Jan 26, 2017
**Problem**

Outpoint is a variable-length structure `<txid>:<index>` which is 33-40
bytes long (33 bytes for most transactions). It is used by transaction
inputs to identify exact output in the UTXO set ("Assets Merkle Tree"). The
tree leafs contain `SHA3(output)` which allows save space and requires
transactions to carry redundant copies of spent outputs to perform
validation (otherwise nodes would have to store the entire outputs instead
of their hashes — over 2x more data, and the ratio is much bigger in
protocol 2). Also, for HSM-friendliness the TXSIGHASH must contain a
redundant output's hash: `SHA3(txid || input index || SHA3(output))`.

**Solution:**

We define two new terms:

* `OutputID = SHA3(TxHash || OutputIndex)`
* `UnspentID = SHA3(OutputID || SHA3(OutputCommitment))`

How are these used:

1. Transaction input contains **OutputID** to identify the output being
spent. This is a unique identifier of the output.
2. Transaction input uses **second serialization flag** to indicate if it
contains the entire previous Output Commitment, or its hash (instead of
empty place).
3. UTXO set becomes a **proper set** containing **UnspentIDs** instead of
`{Outpoint -> SHA3(OutputCommitment)}`.

When a node validates a transaction, it computes `UnspentID` using provided
`OutputID` and previous `OutputCommitment`. If the given unspent ID is
present in the UTXO set, then previous output is proved to be both
authentic and available for spending.

**Upsides:**

1. The outputID is constant-size and shorter: 32 bytes instead of 33-40
bytes. This simplifies merkle tree design, transaction data structure and
all pieces of software that need to handle outpoints.
2. All outputs (via unspentIDs) in the transaction are randomized across
the Assets Merkle Tree instead of being crammed inside a common subpath
`<txid>||...`.
3. Inputs automatically commit directly to the spent outputs, so TXSIGHASH
does not need to do that and can be simplified to `SHA3(txid || input
index)`. HSM is able to verify which output this input commits to without
having access to the entire parent transaction.
4. We keep the term _outpoint_ to mean a pair `(txid, index)`, but is
internal to Chain Core to support random access to UTXOs. Validation
protocol no longer uses outpoints.
5. UTXO takes 2x less RAM because it only contains unpent IDs (32 bytes)
instead of a key-value pair (64+ bytes).
6. When we get to _tx entries_ design, we'll generalize the idea of
OutputID to EntryID, so that any entry can have a unique identifier.

**Downsides:**

1. OutputID no longer indicates the transaction ID which makes it
impossible to navigate the chain of transactions without also having a
mapping `outpoint -> txid:index`. UTXO tree is not enough as it's only
reflecting the latest state of the chain and throws away spent outpoints.
Note that in order to navigate the transactions in practice one still needs
the mapping `txid -> tx`, so maintaining one more index might not be a
significant increase in complexity. Chain is doing this indexing already
and we keep that mapping.
2. Chain Core no longer returns (txid,position) pair for annotated txinputs
(called `spent_output:{transaction_id:String,position:Int}`), but instead
returns output_id (`spent_output_id:String`). To maintain full
compatibility, we'd need to make an additional request to locate the
previous output's txid and position, but I'm not sure any application
actually relies on such historical data. For spending (locating unspents),
we fully maintain compatibility with clients using (txid,position) pairs.

This is a part of a package of breaking changes in P1: #239

See previous reviews: #417


Closes #421
@oleganza oleganza deleted the protocol-outputid branch January 26, 2017 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants