Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] EIP-7594: PeerDAS protocol #3574

Merged
merged 29 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d6a37ec
Copied from Danny's ethresearch post
hwwhww Nov 9, 2023
93dddd1
wip
hwwhww Nov 14, 2023
504b4f9
Migrating to latest crypto APIs
hwwhww Jan 11, 2024
696d443
Merge branch 'dev' into peer-das
hwwhww Jan 19, 2024
2cc7c87
Fix conflict
hwwhww Jan 19, 2024
665e6fa
Add `RowIndex`, `ColumnIndex` custom types in crypto doc
hwwhww Jan 19, 2024
9553d54
fix typo
hwwhww Jan 19, 2024
a72ece8
Apply suggestions from code review
hwwhww Jan 19, 2024
65be5b0
Make `CUSTODY_REQUIREMENT` unit be subnets; move some depended helper…
hwwhww Jan 19, 2024
55db861
Apply suggestions from code review
hwwhww Jan 20, 2024
4477cc6
Fix column computation
hwwhww Jan 20, 2024
56e6a98
`verify_data_column_sidecar_kzg_proof` -> `verify_data_column_sidecar…
hwwhww Jan 20, 2024
edeef07
toc
hwwhww Jan 28, 2024
b2a4657
Merge branch 'peer-das-req-subnet-count' into peer-das
hwwhww Jan 29, 2024
7aab577
Merge branch 'dev' into peer-das
hwwhww Jan 29, 2024
170dae5
Apply suggestions from code review
hwwhww Jan 29, 2024
547460c
Apply PR feedback
hwwhww Jan 30, 2024
d23452d
Deprecate `blob_sidecar_{subnet_id}`
hwwhww Jan 31, 2024
87e9702
Fix `DataColumnSidecarsByRoot`
hwwhww Jan 31, 2024
428c166
Apply suggestions from code review
hwwhww Feb 1, 2024
c47d5f3
Add `recover_matrix` and remove unused `FlatExtendedMatrix` type
hwwhww Feb 1, 2024
91dbbb3
Implement `compute_extended_matrix`
hwwhww Feb 1, 2024
e7c0d5f
Update specs/_features/eip7594/das-core.md
hwwhww Feb 2, 2024
8150f76
Apply @cskiraly's suggestion
hwwhww Feb 20, 2024
bb33f90
Change List length of `DataColumn` from `MAX_BLOBS_PER_BLOCK` to `MAX…
hwwhww Feb 20, 2024
1acb1ff
minor arrange
hwwhww Feb 20, 2024
cebf78a
Apply PR feedback
hwwhww Feb 27, 2024
8728561
Merge branch 'dev' into peer-das
hwwhww Apr 4, 2024
5535e6a
fix conflict
hwwhww Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions configs/mainnet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ EIP7002_FORK_EPOCH: 18446744073709551615
# WHISK
WHISK_FORK_VERSION: 0x06000000 # temporary stub
WHISK_FORK_EPOCH: 18446744073709551615
# EIP7594
EIP7594_FORK_VERSION: 0x06000001
EIP7594_FORK_EPOCH: 18446744073709551615


# Time parameters
Expand Down Expand Up @@ -156,5 +159,4 @@ WHISK_EPOCHS_PER_SHUFFLING_PHASE: 256
WHISK_PROPOSER_SELECTION_GAP: 2

# EIP7594
EIP7594_FORK_VERSION: 0x06000001
EIP7594_FORK_EPOCH: 18446744073709551615
DATA_COLUMN_SIDECAR_SUBNET_COUNT: 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For initial parametrization, I'd default to these values being equal. I'd only deviate from that len(columns) == len(subnets) if simulation/analysis shows we need to to support whatever we are targeting.

7 changes: 4 additions & 3 deletions configs/minimal.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ EIP7002_FORK_EPOCH: 18446744073709551615
# WHISK
WHISK_FORK_VERSION: 0x06000001
WHISK_FORK_EPOCH: 18446744073709551615

# EIP7594
EIP7594_FORK_VERSION: 0x06000001
EIP7594_FORK_EPOCH: 18446744073709551615

# Time parameters
# ---------------------------------------------------------------
Expand Down Expand Up @@ -155,5 +157,4 @@ WHISK_EPOCHS_PER_SHUFFLING_PHASE: 4
WHISK_PROPOSER_SELECTION_GAP: 1

# EIP7594
EIP7594_FORK_VERSION: 0x06000001
EIP7594_FORK_EPOCH: 18446744073709551615
DATA_COLUMN_SIDECAR_SUBNET_COUNT: 32
4 changes: 4 additions & 0 deletions presets/mainnet/eip7594.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@
# ---------------------------------------------------------------
# `uint64(2**6)` (= 64)
FIELD_ELEMENTS_PER_CELL: 64
# uint64(floorlog2(get_generalized_index(BeaconBlockBody, 'blob_kzg_commitments'))
KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH: 4
# `uint64((FIELD_ELEMENTS_PER_BLOB * 2) // FIELD_ELEMENTS_PER_CELL)` (= 128)
NUMBER_OF_COLUMNS: 128
4 changes: 4 additions & 0 deletions presets/minimal/eip7594.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@
# ---------------------------------------------------------------
# `uint64(2**6)` (= 64)
FIELD_ELEMENTS_PER_CELL: 64
# uint64(floorlog2(get_generalized_index(BeaconBlockBody, 'blob_kzg_commitments'))
KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH: 4
# `uint64((FIELD_ELEMENTS_PER_BLOB * 2) // FIELD_ELEMENTS_PER_CELL)` (= 128)
NUMBER_OF_COLUMNS: 128
5 changes: 2 additions & 3 deletions pysetup/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,8 +208,7 @@ def dependency_order_class_objects(objects: Dict[str, str], custom_types: Dict[s
for item in [dep, key] + key_list[key_list.index(dep)+1:]:
objects[item] = objects.pop(item)


def combine_ssz_objects(old_objects: Dict[str, str], new_objects: Dict[str, str], custom_types) -> Dict[str, str]:
def combine_ssz_objects(old_objects: Dict[str, str], new_objects: Dict[str, str]) -> Dict[str, str]:
"""
Takes in old spec and new spec ssz objects, combines them,
and returns the newer versions of the objects in dependency order.
Expand All @@ -231,7 +230,7 @@ def combine_spec_objects(spec0: SpecObject, spec1: SpecObject) -> SpecObject:
config_vars = combine_dicts(spec0.config_vars, spec1.config_vars)
ssz_dep_constants = combine_dicts(spec0.ssz_dep_constants, spec1.ssz_dep_constants)
func_dep_presets = combine_dicts(spec0.func_dep_presets, spec1.func_dep_presets)
ssz_objects = combine_ssz_objects(spec0.ssz_objects, spec1.ssz_objects, custom_types)
ssz_objects = combine_ssz_objects(spec0.ssz_objects, spec1.ssz_objects)
dataclasses = combine_dicts(spec0.dataclasses, spec1.dataclasses)
return SpecObject(
functions=functions,
Expand Down
17 changes: 16 additions & 1 deletion pysetup/md_doc_paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@
BELLATRIX: "sync/optimistic.md"
}

DEFAULT_ORDER = (
"beacon-chain",
"polynomial-commitments",
)


def is_post_fork(a, b) -> bool:
"""
Expand Down Expand Up @@ -62,15 +67,25 @@ def get_fork_directory(fork):
raise FileNotFoundError(f"No directory found for fork: {fork}")


def sort_key(s):
for index, key in enumerate(DEFAULT_ORDER):
if key in s:
return (index, s)
return (len(DEFAULT_ORDER), s)


def get_md_doc_paths(spec_fork: str) -> str:
md_doc_paths = ""

for fork in ALL_FORKS:
if is_post_fork(spec_fork, fork):
# Append all files in fork directory recursively
for root, dirs, files in os.walk(get_fork_directory(fork)):
for root, _, files in os.walk(get_fork_directory(fork)):
filepaths = []
for filename in files:
filepath = os.path.join(root, filename)
filepaths.append(filepath)
for filepath in sorted(filepaths, key=sort_key):
if filepath.endswith('.md') and filepath not in IGNORE_SPEC_FILES:
md_doc_paths += filepath + "\n"
# Append extra files if any
Expand Down
9 changes: 8 additions & 1 deletion pysetup/spec_builders/eip7594.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,14 @@ def imports(cls, preset_name: str):
'''

@classmethod
def hardcoded_custom_type_dep_constants(cls, spec_object) -> Dict[str, str]:
def hardcoded_custom_type_dep_constants(cls, spec_object) -> str:
return {
'FIELD_ELEMENTS_PER_CELL': spec_object.preset_vars['FIELD_ELEMENTS_PER_CELL'].value,
'NUMBER_OF_COLUMNS': spec_object.preset_vars['NUMBER_OF_COLUMNS'].value,
}

@classmethod
def hardcoded_func_dep_presets(cls, spec_object) -> Dict[str, str]:
return {
'KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH': spec_object.preset_vars['KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH'].value,
}
262 changes: 262 additions & 0 deletions specs/_features/eip7594/das-core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
# EIP-7594 -- Data Availability Sampling Core

**Notice**: This document is a work-in-progress for researchers and implementers.

## Table of contents

<!-- TOC -->
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

- [Custom types](#custom-types)
- [Configuration](#configuration)
- [Data size](#data-size)
- [Networking](#networking)
- [Custody setting](#custody-setting)
- [Containers](#containers)
- [`DataColumnSidecar`](#datacolumnsidecar)
- [Helper functions](#helper-functions)
- [`get_custody_columns`](#get_custody_columns)
- [`compute_extended_data`](#compute_extended_data)
- [`compute_extended_matrix`](#compute_extended_matrix)
- [`get_data_column_sidecars`](#get_data_column_sidecars)
- [Custody](#custody)
- [Custody requirement](#custody-requirement)
- [Public, deterministic selection](#public-deterministic-selection)
- [Peer discovery](#peer-discovery)
- [Extended data](#extended-data)
- [Column gossip](#column-gossip)
- [Parameters](#parameters)
- [Reconstruction and cross-seeding](#reconstruction-and-cross-seeding)
- [Peer sampling](#peer-sampling)
- [Peer scoring](#peer-scoring)
- [DAS providers](#das-providers)
- [A note on fork choice](#a-note-on-fork-choice)
- [FAQs](#faqs)
- [Row (blob) custody](#row-blob-custody)
- [Subnet stability](#subnet-stability)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->
<!-- /TOC -->

## Custom types

We define the following Python custom types for type hinting and readability:

| Name | SSZ equivalent | Description |
| - | - | - |
| `DataColumn` | `List[Cell, MAX_BLOBS_PER_BLOCK]` | The data of each column in EIP-7594 |
| `ExtendedMatrix` | `List[Cell, MAX_BLOBS_PER_BLOCK * NUMBER_OF_COLUMNS]` | The full data of one-dimensional erasure coding extended blobs (in row major format) |
| `FlatExtendedMatrix` | `List[BLSFieldElement, MAX_BLOBS_PER_BLOCK * FIELD_ELEMENTS_PER_BLOB * NUMBER_OF_COLUMNS]` | The flattened format of `ExtendedMatrix` |
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

## Configuration

### Data size

| Name | Value | Description |
| - | - | - |
| `NUMBER_OF_COLUMNS` | `uint64((FIELD_ELEMENTS_PER_BLOB * 2) // FIELD_ELEMENTS_PER_CELL)` (= 128) | Number of columns in the extended data matrix. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `NUMBER_OF_COLUMNS` | `uint64((FIELD_ELEMENTS_PER_BLOB * 2) // FIELD_ELEMENTS_PER_CELL)` (= 128) | Number of columns in the extended data matrix. |
| `NUMBER_OF_COLUMNS` | `uint64(FIELD_ELEMENTS_PER_EXT_BLOB // FIELD_ELEMENTS_PER_CELL)` (= 128) | Number of columns in the extended data matrix. |

Low priority, feel free to merge without this.


### Networking

| Name | Value | Description |
| - | - | - |
| `DATA_COLUMN_SIDECAR_SUBNET_COUNT` | `32` | The number of data column sidecar subnets used in the gossipsub protocol |

### Custody setting

| Name | Value | Description |
| - | - | - |
| `SAMPLES_PER_SLOT` | `8` | Number of `DataColumn` random samples a node queries per slot |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to define a config/preset value for the probability of the data being available/unavailable and derive SAMPLES_PER_SLOT from that number? I'm not clear on how we arrived at this number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this should be related with probablity, however i would prefer to have this as the config/preset constant, probablity analysis vs sample size could be attached with the specs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would nice to have a justification for the number on why this is optimal and no more/less is needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to define a config/preset value for the probability of the data being available/unavailable and derive SAMPLES_PER_SLOT from that number? I'm not clear on how we arrived at this number.

If >50% of the columns are missing, the probability that is_data_available is true is simply at most 0.5^(SAMPLE_PER_SLOT)

We have 8 because there would be expectedly at most 1/256 of the validators that have is_data_available to be true when it should be false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that SAMPLES_PER_SLOT should be as low as 8. Sampling isn't the bandwidth bottleneck (distribution through subnets is), and 8 samples leaves a very-far-from-negligible probability of a full node experiencing a safety violation (chain looks available and confirmed, but it's not), even without a targeted attack.
I would advocate for SAMPLES_PER_SLOT = 16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just brainstorming about resiliency without adding significant security risk, some alternatives came to mind:

  • Is there any downside of collecting samples from a fixed number of peers? (e.g. getting 16 samples from 8 peers, instead of potentially up to 16 peers) I think the current proposal is more flexible (in terms of param tweaking) and covers sampling against more peers than this, however I think there's a tradeoff on robustness, i.e. successfully getting all samples from all 16 peers could be harder than getting from 8 peers.
  • Is there any security risk if we randomly select 8 connected peers that are assigned different columns, rather than selecting 8 columns then look for the peers if we don't already connect to them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is there any downside of collecting samples from a fixed number of peers? (e.g. getting 16 samples from 8 peers, instead of potentially up to 16 peers) I think the current proposal is more flexible (in terms of param tweaking) and covers sampling against more peers than this, however I think there's a tradeoff on robustness, i.e. successfully getting all samples from all 16 peers could be harder than getting from 8 peers.

It depends what kind of security you are concerned with. From the network's perspective, it makes no difference how you get your samples, it is only important that you do. Everyone could get them from a single source and it wouldn't make any difference.

From an individual node's perspective, it does matter somewhat, because getting them from multiple nodes (assuming they are actually unrelated) would make you less susceptible to targeted release of data. Without getting into details too much, I don't think this is something to be particularly concerned about, and I don't think there is much of a reason to try hard to get each sample from a different peer.

Is there any security risk if we randomly select 8 connected peers that are assigned different columns, rather than selecting 8 columns then look for the peers if we don't already connect to them?

Yes, there is. The columns you choose must be random. You should not treat the node ids of your peers as random.

| `CUSTODY_REQUIREMENT` | `1` | Minimum number of subnets an honest node custodies and serves samples from |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @dankrad, I think this should be higher than 1, because it is the most important parameter in determining how strong are the pre-sampling guarantees we get, i.e., how hard it is for an unavailable block to get a majority of the votes and become canonical, even when employing the trailing fork-choice. Ideally, we would want to ensure that only a small fraction of the validators can be tricked into voting for an unavailable block.

Imo, a suitable value would be >= 6. We might then also want to increase the number of subnets to 64, to keep the bandwidth consumption low. For reference, (CUSTODY_REQUIREMENT = 6, DATA_COLUMN_SIDECAR_SUBNET_COUNT = 64) and everything else unchanged means that the total bandwidth consumption with MAX_BLOBS_PER_BLOCK = 16 is 60% of 4844. With 32 blobs, it's 120% of 4844. It can also be lowered by increasing the subnet count further if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is currently subnet custody which has many columns depending on the parametrization (4 currently if the parametrization is 128 columns across 32 subnets (128/32))

I find the way we are doing this parametrization a bit confusing or at least difficult to parse (as seen by the above misunderstanding)

The name should at least say explicitly if it's COLUMN_ or SUBNET_CUSTODY_REQUIREMENT

Copy link
Contributor

@fradamt fradamt Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the name should be SUBNET_CUSTODY_REQUIREMENT. Still, custodying a single subnet is not the same as custodying 4 random columns, and in fact gives you almost no pre-sampling guarantees.

For example, someone can make 49% of the subnets available (not making a subnet available here means not publishing any of the columns for that subnet), and 49% of validators will still succeed in downloading all columns they need. If on the other end the custody requirement had been 4 random columns, making 49% of columns available would only trick 1/16 of the validators. This is also the case if validators custody 4 subnets.

Another way to think about it is: whatever unit we choose for custody ends up being the relevant sampling unit for the purpose of pre-peer-sampling security. If we choose subnet custody, we are doing subnet sampling during distribution and column sampling later. If we want subnet sampling to be meaningfully secure by itself, we need the subnet custody requirement to be high enough.

| `TARGET_NUMBER_OF_PEERS` | `70` | Suggested minimum peer count |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `TARGET_NUMBER_OF_PEERS` | `70` | Suggested minimum peer count |
| `TARGET_NUMBER_OF_PEERS` | `70` | Suggested minimum peer count to support peer sampling |


### Containers

#### `DataColumnSidecar`

```python
class DataColumnSidecar(Container):
hwwhww marked this conversation as resolved.
Show resolved Hide resolved
index: ColumnIndex # Index of column in extended matrix
column: DataColumn
kzg_commitments: List[KZGCommitment, MAX_BLOB_COMMITMENTS_PER_BLOCK]
kzg_proofs: List[KZGProof, MAX_BLOB_COMMITMENTS_PER_BLOCK]
signed_block_header: SignedBeaconBlockHeader
kzg_commitments_inclusion_proof: Vector[Bytes32, KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH]
```

### Helper functions

#### `get_custody_columns`

```python
def get_custody_columns(node_id: NodeID, custody_subnet_count: uint64) -> Sequence[ColumnIndex]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have a function that maps node-id -> subnets and I think we plan to also do node-id -> sync-committees etc.
@Nashatyrev had an idea called Network Sharding, which just generalizes this concept.

The idea is we just have a single function that maps node-id -> shard.
Then we assign attestation subnets, sync-committees and now data columns to shards.

The benefit here is:

  • We have a single function that does the mapping and we don't have to make three different forms of it, making the implementation easier and making it easier to debug
  • If we implement it like in: Modify long-lived attestation subnet calculation to allow pre-determined calculation #3545, then we make it faster for us to discover peers on columns/subnets/sync committees
  • The specs I think might simplify, because this function would just be a mapping from column -> shard.

If others agree, I can draft a spec PR that implements this idea, perhaps it would be more explicit that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 64 shards but we want to split the blob data in 32 parts, do we map each of the 32 parts to two shards?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The number of shards would be the maximum granularity of splitting nodes. Like we could have 128 shards, in case we wanted to divide the network into 128 and assign subnets etc. I think 64 makes sense at the moment tho, because we currently subscribe to two subnets per node, so 2 shards there also.

However if we want to have 128 data columns and split this up equally, we might want to have 128 shards, then subnets would be mapped to 4 shards (or vice versa)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is we just have a single function that maps node-id -> shard

Can it be extended to consider custody_subnet_count? i.e., to map node-id to multiple shards.

assign attestation subnets, sync-committees and now data columns to shards

Would it separate the long-lived subscriptions and the mid-term-lived subscriptions? We may add a PeerDAS assignment period into the deterministic function.

I can draft a spec PR that implements this idea, perhaps it would be more explicit that way.

I think it would be good to see a PR for it. One minor suggestion: I like the "Network Sharding" naming. However, it might be ambiguous with the data sharding we're mostly talking about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the ethresearch post of network sharding is here https://ethresear.ch/t/network-shards-concept-idea/17417

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If others agree, I can draft a spec PR that implements this idea, perhaps it would be more explicit that way.

@AgeManning please feel free to ping me if you want my assistance or maybe want to delegate that to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be extended to consider custody_subnet_count? i.e., to map node-id to multiple shards.

yep that should be easy to do (If I understand your question).

Would it separate the long-lived subscriptions and the mid-term-lived subscriptions? We may add a PeerDAS assignment period into the deterministic function.

This is a good point I'd need to think about. The period of rotation might need to be made a parameter of the function. This would mean that nodes will be on different shards based on this parameter. I guess we shouldn't force rotation periods for different use-cases to be the same.
In this case, I guess the benefit still is that we have a single function that handles the mappings in a discovery-efficient way.

@AgeManning please feel free to ping me if you want my assistance or maybe want to delegate that to me

Thanks. I have been traveling and have a conference next week, but I will endeavor to get a PR up for this that we can work from. Feel free to make one earlier if you have the time tho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in support of moving in this direction. Open a PR to this one. We'll likely get this merged in the meantime and can review the mapping proposal in isolation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. Here is the proposal: #3623

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the proposal: #3623

Great! Thanks @AgeManning

If this PR is adopted then I'd suggest to make DATA_COLUMN_SIDECAR_SUBNET_COUNT equal to NUMBER_OF_COLUMNS. I.e. one subnet per one column. (that should potentially work well even if number of columns (and thus number of subnets) is of order 2048)

The above would allow push sampling approach: a sampling node could be connected to all (or a subset) of network shards (without subscribing to any column subnets). Some short period (sub-slot) prior to sampling a node subscribes to selected subnets and immediately unsubscribes upon sample receiving. This push approach would allow to

  • avoid extra throughput for sampling: node receives only required samples (columns)
  • almost preserves unlinkability of queries: the gossip subscribe request could be issued as shortly before expected sample as a second or two
  • a node may always fall back to pull method (req/resp data_column_sidecars_by_root) if fails to get a sample via push

assert custody_subnet_count <= DATA_COLUMN_SIDECAR_SUBNET_COUNT

subnet_ids = []
i = 0
while len(subnet_ids) < custody_subnet_count:
subnet_id = (
bytes_to_uint64(hash(uint_to_bytes(uint64(node_id + i)))[0:8])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just sanity checking that this 256-bit to 64-bit casting is well-defined in the spec (or at least done somewhere else)

My initial check shows that we are just using the ssz-typing lib here which will be opinionated but not in a way that is actually defined in the SSZ spec (no mention of casting as it's a serialization spec)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cast this down to 64-bit? why not just pass in the 256-bit value to hash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we're doing it under the hood in places with bytes_to_uint64 so I guess it's well defined

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cast this down to 64-bit? why not just pass in the 256-bit value to hash?

I think it's because you need to do the + i thing. You need to cast it to 64-bit first to do the math. I guess doing the math under uint256 is not supported by all the programming languages.

However, I think it should be uint64(node_id) + i rather than uint64(node_id + i)

% DATA_COLUMN_SIDECAR_SUBNET_COUNT
)
if subnet_id not in subnet_ids:
subnet_ids.append(subnet_id)
i += 1
assert len(subnet_ids) == len(set(subnet_ids))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it shouldn't be here. It's an assertion we should have in our tests, but that it's not a possible condition here given the if subnet_id not in subnet_ids condition above for set append. and we generally don't have non-triggerable asserts in our functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this. This assertion is always false. The intention is only to catch a bug.


columns_per_subnet = NUMBER_OF_COLUMNS // DATA_COLUMN_SIDECAR_SUBNET_COUNT
return [
ColumnIndex(DATA_COLUMN_SIDECAR_SUBNET_COUNT * i + subnet_id)
for i in range(columns_per_subnet)
for subnet_id in subnet_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit pythonic, the double list comprehension

Some of the difficulty to parse is in relation to the double list comprehension, but I think more of it comes from the weird column vs subnet custody

]
```

#### `compute_extended_data`

```python
def compute_extended_data(data: Sequence[BLSFieldElement]) -> Sequence[BLSFieldElement]:
# TODO
# pylint: disable=unused-argument
...
```

#### `compute_extended_matrix`

```python
def compute_extended_matrix(blobs: Sequence[Blob]) -> FlatExtendedMatrix:
matrix = [compute_extended_data(blob) for blob in blobs]
return FlatExtendedMatrix(matrix)
```

#### `get_data_column_sidecars`

```python
def get_data_column_sidecars(signed_block: SignedBeaconBlock,
blobs: Sequence[Blob]) -> Sequence[DataColumnSidecar]:
signed_block_header = compute_signed_block_header(signed_block)
block = signed_block.message
kzg_commitments_inclusion_proof = compute_merkle_proof(
block.body,
get_generalized_index(BeaconBlockBody, 'blob_kzg_commitments'),
)
cells_and_proofs = [compute_cells_and_proofs(blob) for blob in blobs]
blob_count = len(blobs)
cells = [cells_and_proofs[i][0] for i in range(blob_count)]
proofs = [cells_and_proofs[i][1] for i in range(blob_count)]
sidecars = []
for column_index in range(NUMBER_OF_COLUMNS):
column = DataColumn([cells[row_index][column_index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we usually use this syntax and general prefer:

var = Stuff([
    value[i]
    for i in list
])

for row_index in range(blob_count)])
kzg_proof_of_column = [proofs[row_index][column_index]
for row_index in range(blob_count)]
sidecars.append(DataColumnSidecar(
index=column_index,
column=column,
kzg_commitments=block.body.blob_kzg_commitments,
kzg_proofs=kzg_proof_of_column,
signed_block_header=signed_block_header,
kzg_commitments_inclusion_proof=kzg_commitments_inclusion_proof,
))
return sidecars
```

## Custody

### Custody requirement

Each node downloads and custodies a minimum of `CUSTODY_REQUIREMENT` subnets per slot. The particular columns that the node is required to custody are selected pseudo-randomly (more on this below).
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

A node *may* choose to custody and serve more than the minimum honesty requirement. Such a node explicitly advertises a number greater than `CUSTODY_REQUIREMENT` via the peer discovery mechanism -- for example, in their ENR (e.g. `custody_lines: 4` if the node custodies `4` subnets each slot) -- up to a `DATA_COLUMN_SIDECAR_SUBNET_COUNT` (i.e. a super-full node).
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

A node stores the custodied columns for the duration of the pruning period and responds to peer requests for samples on those columns.

### Public, deterministic selection

The particular columns that a node custodies are selected pseudo-randomly as a function (`get_custody_columns`) of the node-id and custody size -- importantly this function can be run by any party as the inputs are all public.

*Note*: increasing the `custody_size` parameter for a given `node_id` extends the returned list (rather than being an entirely new shuffle) such that if `custody_size` is unknown, the default `CUSTODY_REQUIREMENT` will be correct for a subset of the node's custody.

## Peer discovery

At each slot, a node needs to be able to readily sample from *any* set of columns. To this end, a node should find and maintain a set of diverse and reliable peers that can regularly satisfy their sampling demands.
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

A node runs a background peer discovery process, maintaining at least `TARGET_NUMBER_OF_PEERS` of various custody distributions (both `custody_size` and column assignments). The combination of advertised `custody_size` size and public node-id make this readily and publicly accessible.

`TARGET_NUMBER_OF_PEERS` should be tuned upward in the event of failed sampling.

*Note*: while high-capacity and super-full nodes are high value with respect to satisfying sampling requirements, a node should maintain a distribution across node capacities as to not centralize the p2p graph too much (in the extreme becomes hub/spoke) and to distribute sampling load better across all nodes.

*Note*: A DHT-based peer discovery mechanism is expected to be utilized in the above. The beacon-chain network currently utilizes discv5 in a similar method as described for finding peers of particular distributions of attestation subnets. Additional peer discovery methods are valuable to integrate (e.g., latent peer discovery via libp2p gossipsub) to add a defense in breadth against one of the discovery methods being attacked.

## Extended data

In this construction, we extend the blobs using a one-dimensional erasure coding extension. The matrix comprises maximum `MAX_BLOBS_PER_BLOCK` rows and fixed `NUMBER_OF_COLUMNS` columns, with each row containing a `Blob` and its corresponding extension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with each row containing a Blob and its corresponding extension.

It would be nice if there were a defined type for this, such as ExtendedBlob:

Name SSZ equivalent Description
ExtendedBlob ByteVector[2 * BYTES_PER_BLOB] An extended blob in EIP-7594

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will ExtendedBlob be used? I guess the new type will be rarely used in the code. I'm not sure. I may be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine ExtendedBlob would be used as an intermediate type before converting the data into columns. So it would be used when publishing and reconstructing. You're right though, it might not be very useful. I don't have a strong opinion about having this type or not.


## Column gossip

### Parameters

For each column -- use `data_column_sidecar_{subnet_id}` subnets, where each column index maps to the `subnet_id`. The sidecars can be computed with `get_data_column_sidecars(signed_block: SignedBeaconBlock, blobs: Sequence[Blob])` helper.
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

To custody a particular column, a node joins the respective gossip subnet. Verifiable samples from their respective column are gossiped on the assigned subnet.

### Reconstruction and cross-seeding

If the node obtains 50%+ of all the columns, they can reconstruct the full data matrix via `recover_samples_impl` helper.

If a node fails to sample a peer or fails to get a column on the column subnet, a node can utilize the Req/Resp message to query the missing column from other peers.

Once the node obtain the column, the node should send the missing columns to the column subnets.
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

*Note*: A node always maintains a matrix view of the rows and columns they are following, able to cross-reference and cross-seed in either direction.

*Note*: There are timing considerations to analyze -- at what point does a node consider samples missing and choose to reconstruct and cross-seed.

*Note*: There may be anti-DoS and quality-of-service considerations around how to send samples and consider samples -- is each individual sample a message or are they sent in aggregate forms.

## Peer sampling

At each slot, a node makes (locally randomly determined) `SAMPLES_PER_SLOT` queries for samples from their peers via `DataColumnSidecarByRoot` request. A node utilizes `get_custody_columns` helper to determine which peer(s) to request from. If a node has enough good/honest peers across all rows and columns, this has a high chance of success.

## Peer scoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with poor/no response of requested samples, if a peer isn't subscribed to the required computed column subnets they should be disconnected and banned


Due to the deterministic custody functions, a node knows exactly what a peer should be able to respond to. In the event that a peer does not respond to samples of their custodied rows/columns, a node may downscore or disconnect from a peer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be way better implementation wise if the spec defines a particular time within the slot when we should do the sampling.
Otherwise, there's no distinction between the scenario where a peer who is supposed to custody column X

  1. hasn't yet received the full sample from the block producer
  2. has received but isn't serving

which would complicate peer scoring considerably implementation wise.

If the trailing fork choice filter described below is viable security wise, then maybe we only sample at the 4 second mark where validators have to attest to slot N

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think since it's for the past slot, so it should defintely be completed by before you import the block, one could try doing it in the last 3 seconds of previous block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, timing is TBD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo it would make sense to start sampling as soon as possible, e.g. as soon as you receive either the block or a sidecar (which has all of the commitments). I am not sure how this affect peer scoring. Perhaps a node could downscore a peer only if they fail to provide a sample after some deadline, and if the node has good certainty that this sample should be available? E.g. for samples for slot n, you could wait until you receive attestations from slot n, and only downscore peers if the block is canonical


## DAS providers

A DAS provider is a consistently-available-for-DAS-queries, super-full (or high capacity) node. To the p2p, these look just like other nodes but with high advertised capacity, and they should generally be able to be latently found via normal discovery.

DAS providers can also be found out-of-band and configured into a node to connect to directly and prioritize. Nodes can add some set of these to their local configuration for persistent connection to bolster their DAS quality of service.

Such direct peering utilizes a feature supported out of the box today on all nodes and can complement (and reduce attackability and increase quality-of-service) alternative peer discovery mechanisms.

## A note on fork choice

*Fork choice spec TBD, but it will just be a replacement of `is_data_available()` call in Deneb with column sampling instead of full download. Note the `is_data_available(slot_N)` will likely do a `-1` follow distance so that you just need to check the availability of slot `N-1` for slot `N` (starting with the block proposer of `N`).*
hwwhww marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more information on how the trailing slot fork choice works during network partitions or longer forks ? During any event where you might temporarily have your peers with split views(hard fork, consensus bug), the peers that you are connected to for the particular column subnets will neither have the block or the data. So when sampling for n-1 from these peers, they will return nothing and you would mark the data as unavailable.

How does sampling work when the network is undergoing stress and possible churn ? ( constant peer disconnections and new connections to find peers with the desired column subnets) . Is there a mode where we can fall back to so that a block and its data which was correctly broadcasted and propagated is marked as available rather than unavailable during such a situation.


The fork choice rule (essentially a DA filter) is *orthogonal to a given DAS design*, other than the efficiency of a particular design impacting it.

In any DAS design, there are probably a few degrees of freedom around timing, acceptability of short-term re-orgs, etc.

For example, the fork choice rule might require validators to do successful DAS on slot N to be able to include block of slot `N` in its fork choice. That's the tightest DA filter. But trailing filters are also probably acceptable, knowing that there might be some failures/short re-orgs but that they don't hurt the aggregate security. For example, the rule could be — DAS must be completed for slot N-1 for a child block in N to be included in the fork choice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, the fork choice rule might require validators to do successful DAS on slot N to be able to include block of slot `N` in its fork choice. That's the tightest DA filter. But trailing filters are also probably acceptable, knowing that there might be some failures/short re-orgs but that they don't hurt the aggregate security. For example, the rule could be — DAS must be completed for slot N-1 for a child block in N to be included in the fork choice.
For example, the fork choice rule might require validators to do successful DAS on the parent's slot to be able to include block of slot `N` in its fork choice. That's the tightest DA filter. But trailing filters are also probably acceptable, knowing that there might be some failures/short re-orgs but that they don't hurt the aggregate security. For example, the rule could be — DAS must be completed for slot N-1 for a child block in N to be included in the fork choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example is correct for DAS in the same slot? (tightest DA filter)


Such trailing techniques and their analysis will be valuable for any DAS construction. The question is — can you relax how quickly you need to do DA and in the worst case not confirm unavailable data via attestations/finality, and what impact does it have on short-term re-orgs and fast confirmation rules.

## FAQs

### Row (blob) custody

In the one-dimension construction, a node samples the peers by requesting the whole `DataColumn`. In reconstruction, a node can reconstruct all the blobs by 50% of the columns. Note that nodes can still download the row via `blob_sidecar_{subnet_id}` subnets.
hwwhww marked this conversation as resolved.
Show resolved Hide resolved

The potential benefits of having row custody could include:

1. Allow for more "natural" distribution of data to consumers -- e.g., roll-ups -- but honestly, they won't know a priori which row their blob is going to be included in in the block, so they would either need to listen to all rows or download a particular row after seeing the block. The former looks just like listening to column [0, N) and the latter is req/resp instead of gossiping.
hwwhww marked this conversation as resolved.
Show resolved Hide resolved
2. Help with some sort of distributed reconstruction. Those with full rows can compute extensions and seed missing samples to the network. This would either need to be able to send individual points on the gossip or would need some sort of req/resp faculty, potentially similar to an `IHAVEPOINTBITFIELD` and `IWANTSAMPLE`.

However, for simplicity, we don't assign row custody assignments to nodes in the current design.

### Subnet stability

To start with a simple, stable backbone, for now, we don't shuffle the subnet assignments via the deterministic custody selection helper `get_custody_columns`. However, staggered rotation likely needs to happen on the order of the pruning period to ensure subnets can be utilized for recovery. For example, introducing an `epoch` argument allows the function to maintain stability over many epochs.
Loading
Loading