Extensible forward- and backward-compatible HashIndex

Currently:

```
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
| 32 byte Key | ValueA | Value B | Value C |
```

Problems:

(1) Can't add fields (e.g. csize to Repository) without breaking compatibility
(2) Load factor has a large associated factor (key+value length
    40-44 bytes)
(3) Resizing moves a lot of stuff around and requires a lot of extra memory
(4) Iteration moves over uninteresting data, e.g. summarize_chunks doesn't care about
    the key, yet wastes 70 % of its memory accesses on them.

Instead:

```
index array, length = num_buckets, indexed by hash % num_buckets.
indices = uin32_t, thus low load factors are basically free.

| index |
| index |
| index |
| index |
| index |
| index |

separate, compact arrays for each column/field

| key |
| key |
| key |
| key |
| key |

| ValueA |
| ValueA |
| ValueA |
| ValueA |
| ValueA |

...

All key/value arrays are coindexed by the index from the index array
```

On disk-format has to tell the number of value arrays and their geometry
(length is implicit = num_entries, width must be noted).

Thus, an older version can load an index with unknown columns with no issues.
When saving, it would discard the extra columns (or preserve them... but then
a bit mask is needed to tell which columns were treated properly, e.g. updated).

This could technically be tacked onto an existing HashIndex, although the advantages
would be fewer.

- Lowering the load factor and thus collisions is a lot more viable, since the
  space factor is much lower (only 4 bytes per index entry). (2)
- Lookups require one extra pointer indirection (index array -> keys array),
  which may be mitigated by making the index much more sparse for basically free.
- Reading all fields has higher memory concurrency for the same amount of work done,
  which I think should not be an issue in this instance.
- Quicker iteration when only interested in e.g. the values,
  such as in summarize_chunks (similar to columnar stores). (4)
- Removes in-band signalling for DELETED/EMPTY
- Since only the index array is resized when resizing the table,
  less memory is required to do so. (3)
- Saving to disk can require back-shifting in the value/key arrays to make them compact. Or you save the free-list/array.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extensible forward- and backward-compatible HashIndex #2757

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Extensible forward- and backward-compatible HashIndex #2757

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions