Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add documentation for Row Tracking #2939

Merged
merged 4 commits into from
Jun 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/source/delta-row-tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
description: Learn how <Delta> row tracking allows tracking how rows change across table versions.
orphan: 1
---

# Use row tracking for Delta tables

Row tracking allows <Delta> to track row-level lineage in a <Delta> table. When enabled on a <Delta> table, row tracking adds two new metadata fields to the table:

- **Row IDs** provide rows with an identifier that is unique within the table. A row keeps the same ID whenever it is modified using a `MERGE` or `UPDATE` statement.

- **Row commit versions** record the last version of the table in which the row was modified. A row is assigned a new version whenever it is modified using a `MERGE` or `UPDATE` statement.
Comment on lines +10 to +12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth quickly highlighting here that these can be accessed via _metadata.row_id and _metadata.row_commit_version` even if there's more detail about it below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to keep this first section a little bit more high-level.


.. note:: This feature is available in <Delta> 3.2.0 and above. This feature is in experimental support mode with [_](#limitations).

## Enable row tracking

You must explicitly enable row tracking using one of the following methods:

- **New table**: Set the table property `delta.enableRowTracking = true` in the `CREATE TABLE` command.

```sql
-- Create an empty table
CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES ('delta.enableRowTracking' = 'true');

-- Using a CTAS statement
CREATE TABLE course_new
TBLPROPERTIES ('delta.enableRowTracking' = 'true')
AS SELECT * FROM course_old;

-- Using a LIKE statement to copy configuration
CREATE TABLE graduate LIKE student;

-- Using a CLONE statement to copy configuration
CREATE TABLE graduate CLONE student;
```
tomvanbussel marked this conversation as resolved.
Show resolved Hide resolved

- **Existing table**: Set the table property `'delta.enableRowTracking' = 'true'` in the `ALTER TABLE` command.

```sql
ALTER TABLE grade SET TBLPROPERTIES ('delta.enableRowTracking' = 'true');
```
tomvanbussel marked this conversation as resolved.
Show resolved Hide resolved

- **All new tables**: Set the configuration `spark.databricks.delta.properties.defaults.enableRowTracking = true` for the current session in the `SET` command.

.. code-language-tabs::
```sql
SET spark.databricks.delta.properties.defaults.enableRowTracking = true;
```

```python
spark.conf.set("spark.databricks.delta.properties.defaults.enableRowTracking", True)
```

```scala
spark.conf.set("spark.databricks.delta.properties.defaults.enableRowTracking", true)
```

.. important:: Because cloning a <Delta> table creates a separate history, the row ids and row commit versions on cloned tables do not match that of the original table.

.. important:: Enabling row tracking on existing table will automatically assign row ids and row commit versions to all existing rows in the table. This process may cause multiple new versions of the table to be created and may take a long time.

.. warning:: Tables created with row tracking enabled have the row tracking <Delta> table feature enabled at creation and use <Delta> writer version 7. Table protocol versions cannot be downgraded, and tables with row tracking enabled are not writeable by <Delta> clients that do not support all enabled <Delta> writer protocol table features. See [_](/versioning.md).
tomvanbussel marked this conversation as resolved.
Show resolved Hide resolved

### Row tracking storage

Enabling row tracking may increase the size of the table. <Delta> stores row tracking metadata fields in hidden metadata columns in the data files. Some operations, such as insert-only operations do not use these hidden columns and instead track the row ids and row commit versions using metadata in the <Delta> log. Data reorganization operations such as `OPTIMIZE` and `REORG` cause the row ids and row commit versions to be tracked using the hidden metadata column, even when they were stored using metadata.

## Read row tracking metadata fields

The row ids and row commit versions metadata fields are not automatically included when reading the table. Instead, these metadata fields must be manually selected from the hidden `_metadata` column which is available for all tables in <AS>.

.. code-language-tabs::
```sql
SELECT _metadata.row_id, _metadata.row_commit_version, * FROM table_name;
```

```python
spark.read.table("table_name") \
.select("_metadata.row_id", "_metadata.row_commit_version", "*")
```

```scala
spark.read.table("table_name")
.select("_metadata.row_id", "_metadata.row_commit_version", "*")
```

## What is the schema of the row tracking metadata fields?

Row tracking adds the following metadata fields that can be accessed when reading a table:

| Column name | Type | Values |
|--------------------------------|------|------------------------------------------------------------------------------|
| `_metadata.row_id` | Long | The unique identifier of the row. |
| `_metadata.row_commit_version` | Long | The table version at which the row was last inserted or updated. |
tomvanbussel marked this conversation as resolved.
Show resolved Hide resolved

## Disable row tracking

Row tracking can be disabled to reduce the storage overhead of the metadata fields. After disabling row tracking the metadata fields remain available, but all rows always get assigned a new id and commit version whenever they are touched by an operation.

```sql
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = false);
```

.. important:: Disabling row tracking does not remove the corresponding table feature and does not downgrade the table protocol version.

## Limitations
tomvanbussel marked this conversation as resolved.
Show resolved Hide resolved

The following limitations exist:

- The row ids and row commit versions metadata fields cannot be accessed while reading the [Change data feed](/delta/delta-change-data-feed.md).

.. include:: /shared/replacements.md
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ This is the documentation site for <Delta>.
delta-clustering
delta-deletion-vectors
delta-drop-feature
delta-row-tracking
delta-apidoc
delta-storage
delta-uniform
Expand Down
2 changes: 2 additions & 0 deletions docs/source/versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ The following <Delta> features break forward compatibility. Features are enabled
V2 Checkpoints, [Delta Lake 3.0.0](https://github.com/delta-io/delta/releases/tag/v3.0.0),[V2 Checkpoint Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#v2-spec)
Domain metadata, [Delta Lake 3.0.0](https://github.com/delta-io/delta/releases/tag/v3.0.0),[Domain Metadata Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#domain-metadata)
Clustering, [Delta Lake 3.1.0](https://github.com/delta-io/delta/releases/tag/v3.1.0),[_](/delta-clustering.md)
Row Tracking, [Delta Lake 3.2.0](https://github.com/delta-io/delta/releases/tag/v3.2.0),[_](/delta-row-tracking.md)

<a id="table-protocol"></a>

Expand Down Expand Up @@ -105,6 +106,7 @@ The following table shows minimum protocol versions required for <Delta> feature
Timestamp without Timezone,7,3,[TimestampNTZType](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)
Iceberg Compatibility V1,7,2,[IcebergCompatV1](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#iceberg-compatibility-v1)
V2 Checkpoints,7,3,[V2 Checkpoint Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#v2-spec)
Row Tracking,7,3,[_](/delta-row-tracking.md)

<a id="upgrade"></a>

Expand Down
Loading