Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: add multi-arg transform support #8579

Merged
merged 6 commits into from
Jan 25, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 71 additions & 20 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,9 +296,9 @@ Data files are stored in manifests with a tuple of partition values that are use

Tables are configured with a **partition spec** that defines how to produce a tuple of partition values from a record. A partition spec has a list of fields that consist of:

* A **source column id** from the table’s schema
* A **source column id** or a list of **source column ids** from the table’s schema
* A **partition field id** that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs.
* A **transform** that is applied to the source column to produce a partition value
* A **transform** that is applied to the source column(s) to produce a partition value
* A **partition name**

The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
Expand All @@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any | Source type |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`bucket[N]`** | Hash of value(s), mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, weather we should simply add a new bucketV2 partition transform to distinguish the single-arg one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to keep just bucket as long as we would return UnknownTransform in old readers/writers. If keeping it as bucket leads to exceptions, then we have to consider another name.

| **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string` | Source type |
| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
Expand All @@ -329,19 +329,35 @@ The `void` transform may be used to replace the transform in an existing partiti

#### Bucket Transform Details

Bucket partition transforms use a 32-bit hash of the source value. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
Bucket partition transforms use a 32-bit hash of the source value(s). The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.

Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` must produce a positive value by first discarding the sign bit of the hash value. In pseudo-code, the function is:

```
def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
```

When bucket transform is applied on a list of values, the hash is applied on the concatenated byte representations of all values. In pseudo-code, the function is:

```
def murmur3_x86_32_hashes(x1, x2, x3, ...) = {
byte[] bytes;
for (x in [x1, x2, x3, ...]) {
if (x != null) bytes.append(bytesOf(x))
}
return murmur3_x86_32_hash(bytes)
}

def bucket_N(x1, x2, x3, ...) = (murmur3_x86_32_hashes(x1, x2, x3, ...) & Integer.MAX_VALUE) % N
```

Notes:

1. Changing the number of buckets as a table grows is possible by evolving the partition spec.
2. `murmur3_x86_32_hashes` produces the same result as `murmur3_x86_32_hash` when applied on a single value.
3. NULL input in the list of values is ignored when computing the hash. If all the input values are NULL, NULL should be produced.

For hash function details by type, see Appendix B.
For hash function details by type and bytes representation of each type, see Appendix B.


#### Truncate Transform Details
Expand Down Expand Up @@ -383,8 +399,8 @@ Users can sort their data within partitions by columns to gain performance. The

A sort order is defined by a sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of:

* A **source column id** from the table's schema
* A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms).
* A **source column id** or a list of **source column ids** from the table's schema
* A **transform** that is used to produce values to be sorted on from the source column(s). This is the same transform as described in [partition transforms](#partition-transforms).
* A **sort direction**, that can only be either `asc` or `desc`
* A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last`

Expand Down Expand Up @@ -1060,6 +1076,27 @@ The types below are not currently valid for bucketing, and so are not hashed. Ho
| **`float`** | `hashLong(doubleToLongBits(double(v))` [4]| `1.0F` → `-142385009`, `0.0F` → `1669671676`, `-0.0F` → `1669671676` |
| **`double`** | `hashLong(doubleToLongBits(v))` [4]| `1.0D` → `-142385009`, `0.0D` → `1669671676`, `-0.0D` → `1669671676` |

For multiple arguments, hashBytes() is applied on the concatenated byte representation of each argument:
advancedxy marked this conversation as resolved.
Show resolved Hide resolved

| Primitive type | Bytes representation |
|----------------------|------------------------------------------------|
| **`int`** | `littleEndianBytes(long(v))` |
| **`long`** | `littleEndianBytes(v)` |
| **`decimal(P,S)`** | `minBigEndian(unscaled(v))` |
| **`date`** | `littleEndianBytes(daysFromUnixEpoch(v))` |
| **`time`** | `littleEndianBytes(microsecsFromMidnight(v))` |
| **`timestamp`** | `littleEndianBytes(microsecsFromUnixEpoch(v))` |
| **`timestamptz`** | `littleEndianBytes(microsecsFromUnixEpoch(v))` |
| **`timestamp_ns`** | `littleEndianBytes(nanosecsFromUnixEpoch(v))` |
| **`timestamptz_ns`** | `littleEndianBytes(nanosecsFromUnixEpoch(v))` |
| **`string`** | `utf8Bytes(v)` |
| **`uuid`** | `uuidBytes(v)` |
| **`fixed(L)`** | `v` |
| **`binary`** | `v` |

For example, the hash representation of `(a:int, b:string)` will be `hashBytes(concatenation(littleEndianBytes(long(v)), utf8Bytes(b))`
advancedxy marked this conversation as resolved.
Show resolved Hide resolved


Notes:

1. Integer and long hash results must be identical for all integer values. This ensures that schema evolution does not change bucket partition values if integer types are promoted.
Expand Down Expand Up @@ -1119,21 +1156,30 @@ Partition specs are serialized as a JSON object with the following fields:

Each partition field in the fields list is stored as an object. See the table for more detail:

|Transform or Field|JSON representation|Example|
|--- |--- |--- |
|**`identity`**|`JSON string: "identity"`|`"identity"`|
|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`|
|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`|
|**`year`**|`JSON string: "year"`|`"year"`|
|**`month`**|`JSON string: "month"`|`"month"`|
|**`day`**|`JSON string: "day"`|`"day"`|
|**`hour`**|`JSON string: "hour"`|`"hour"`|
|**`Partition Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
| Transform or Field | JSON representation | Example |
advancedxy marked this conversation as resolved.
Show resolved Hide resolved
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **`identity`** | `JSON string: "identity"` | `"identity"` |
| **`bucket[N]`** | `JSON string: "bucket[<N>]"` | `"bucket[16]"` |
| **`bucket[N]`** (multi-arg bucket [1]) | `JSON string: "bucketV2[<N>]"` | `"bucketV2[16]"` |
advancedxy marked this conversation as resolved.
Show resolved Hide resolved
| **`truncate[W]`** | `JSON string: "truncate[<W>]"` | `"truncate[20]"` |
| **`year`** | `JSON string: "year"` | `"year"` |
| **`month`** | `JSON string: "month"` | `"month"` |
| **`day`** | `JSON string: "day"` | `"day"` |
| **`hour`** | `JSON string: "hour"` | `"hour"` |
| **`Partition Field`** [2] | `JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}` | `{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}` |
| **`Partition Field with multi-arg transform`** [3] | `JSON object: {`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}` | `{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": [1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}` |

In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.

The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details.

Notes:

1. For multi-arg bucket, the serialized form is `bucketV2[N]` instead of `bucket[N]` to distinguish it from the single-arg bucket transform. Therefore, old readers/writers will identify this transform as an unknown transform, old writer will stop writing the table if it encounters this transform, but old readers would still be able to read the table by scanning all the partitions.
advancedxy marked this conversation as resolved.
Show resolved Hide resolved
This makes adding multi-arg transform a forward-compatible change, but not a backward-compatible change.
2. For partition fields with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted.
advancedxy marked this conversation as resolved.
Show resolved Hide resolved
3. For partition fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.

### Sort Orders

Sort orders are serialized as a list of JSON object, each of which contains the following fields:
Expand All @@ -1145,9 +1191,14 @@ Sort orders are serialized as a list of JSON object, each of which contains the

Each sort field in the fields list is stored as an object with the following properties:

|Field|JSON representation|Example|
|--- |--- |--- |
|**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br />&nbsp;&nbsp;` "transform": "bucket[4]",`<br />&nbsp;&nbsp;` "source-id": 3,`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}`|
| Field | JSON representation | Example |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **`Sort Field`** [1] | `JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}` | `{`<br />&nbsp;&nbsp;` "transform": "bucket[4]",`<br />&nbsp;&nbsp;` "source-id": 3,`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}` |
| **`Sort Field with multi-arg transform`** [2] | `JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br />&nbsp;&nbsp;`"direction": <direction string>,`<br />&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}` | `{`<br />&nbsp;&nbsp;` "transform": "bucketV2[4]",`<br />&nbsp;&nbsp;` "source-id": -1,`<br />&nbsp;&nbsp;` "source-id": [1,2],`<br />&nbsp;&nbsp;` "direction": "desc",`<br />&nbsp;&nbsp;` "null-order": "nulls-last"`<br />`}` |

Notes:
1. For sort fields with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted.
advancedxy marked this conversation as resolved.
Show resolved Hide resolved
2. For sort fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is writing an explicit source-id required to avoid exceptions compared to not writing that at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I believe an explicit source-id is required to avoid exception in old versions.

See

int sourceId = JsonUtil.getInt(SOURCE_ID, element);

and
int sourceId = JsonUtil.getInt(SOURCE_ID, element);


The following table describes the possible values for the some of the field within sort field:

Expand Down