Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Identity Column #1959

Open
1 of 5 tasks
felipepessoto opened this issue Aug 3, 2023 · 8 comments
Open
1 of 5 tasks

[Feature Request] Identity Column #1959

felipepessoto opened this issue Aug 3, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@felipepessoto
Copy link
Contributor

felipepessoto commented Aug 3, 2023

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Identity Column (writer version 6) as defined by https://github.com/delta-io/delta/blob/master/PROTOCOL.md#identity-columns.

Design doc: https://docs.google.com/document/d/1G8Vj6wOxswMx1JklllLoSn-obEpJ-iE_Lhpbd-RfIr4/edit?usp=sharing

PR:

Motivation

This is probably the biggest missing part in Open Source Spark Delta.

Further details

Willingness to contribute

@c27kwan volunteered to work on this feature and posted a design doc here.

@felipepessoto felipepessoto added the enhancement New feature or request label Aug 3, 2023
@felipepessoto
Copy link
Contributor Author

@dennyglee, @allisonport-db, do you have any update about this? This feature probably is the most important missing feature in OSS Delta.

Thanks.

@felipepessoto
Copy link
Contributor Author

@tdas any chance this can be prioritized for next release?

Thanks.

@keen85
Copy link

keen85 commented Feb 8, 2024

duplicate of #1072?

@felipepessoto
Copy link
Contributor Author

I think so. But I would update #1072 to be broader. The way the request is made it seems the Identity feature is already done, and it is only the DeltaTableBuilder API that is missing.

@c27kwan
Copy link
Contributor

c27kwan commented Mar 26, 2024

I'm interested on working on this!

@c27kwan
Copy link
Contributor

c27kwan commented Mar 27, 2024

I can't modify the main comment because i'm not a maintainer. Here's the design doc : https://docs.google.com/document/d/1G8Vj6wOxswMx1JklllLoSn-obEpJ-iE_Lhpbd-RfIr4/edit?usp=sharing

@felipepessoto
Copy link
Contributor Author

@c27kwan that is great.

Have you discussed with any of the maintainers about your intention to contribute? I’m asking because this is a big feature and I just want to make sure they aren’t internally working on it and they are open to accept your implementation.

Thanks.

@vkorukanti
Copy link
Collaborator

Hi @felipepessoto, we don't have anyone else working on this feature. Had an offline chat with @c27kwan before assigning the issue to @c27kwan. Feel free to look at the design doc and post any questions you have.

vkorukanti pushed a commit that referenced this issue Apr 11, 2024
## Description
This PR is part of #1959

In this PR, we introduce the IdentityColumnsTableFeature to test-only so
that we can start developing with it.

Note, we do not add support to minWriterVersion 6 yet to
properties.defaults.minWriterVersion because that will enable the table
feature outside of testing.

## How was this patch tested?
Existing tests pass. 

## Does this PR introduce _any_ user-facing changes?
No, this is a test-only change.
andreaschat-db pushed a commit to andreaschat-db/delta that referenced this issue Apr 16, 2024
## Description
This PR is part of delta-io#1959

In this PR, we introduce the IdentityColumnsTableFeature to test-only so
that we can start developing with it.

Note, we do not add support to minWriterVersion 6 yet to
properties.defaults.minWriterVersion because that will enable the table
feature outside of testing.

## How was this patch tested?
Existing tests pass. 

## Does this PR introduce _any_ user-facing changes?
No, this is a test-only change.
tdas pushed a commit that referenced this issue Apr 18, 2024
#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959

In this PR, we introduce IdentityColumn.scala, a common file which
contains most of the helpers for Identity Columns, necessary for
unblocking future PRs.

## How was this patch tested?
This PR commits dead code. Existing tests pass.

## Does this PR introduce _any_ user-facing changes?
No.
@tdas tdas added this to the 3.3.0 milestone Apr 19, 2024
scottsand-db pushed a commit that referenced this issue Apr 25, 2024
#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

This PR is part of #1959

In this PR, we introduce the `GenerateIdentityValues` UDF used for
populating Identity Column values. The UDF is not used in Delta in this
PR yet.

`GenerateIdentityValues` is a simple non-deterministic UDF which keeps a
counter with the user specified `start` and `step`. It counts in
increments of `numPartitions` so that it can be parallelized in
different tasks.

## How was this patch tested?
New test suite and unit tests for the UDF.

## Does this PR introduce _any_ user-facing changes?
No.
allisonport-db pushed a commit that referenced this issue Apr 30, 2024
#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959
* We introduce `generateAlwaysAsIdentity` and
`generatedByDefaultAsIdentity`APIs into DeltaColumnBuilder so that users
can create Delta table with Identity column.
* We guard the creation of identity column tables with a feature flag
until development is complete.

## How was this patch tested?
New tests. 

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
Yes, we introduce `generateAlwaysAsIdentity` and
`generatedByDefaultAsIdentity` interfaces to DeltaColumnBuilder for
creating identity columns.
**Interfaces**
```
def generatedAlwaysAsIdentity(): DeltaColumnBuilder
def generatedAlwaysAsIdentity(start: Long, step: Long): DeltaColumnBuilder
def generatedByDefaultAsIdentity(): DeltaColumnBuilder
def generatedByDefaultAsIdentity(start: Long, step: Long): DeltaColumnBuilder
```
When the `start` and the `step` parameters are not specified, they
default to `1L`. `generatedByDefaultAsIdentity` allows users to insert
values into the column while a column specified
with`generatedAlwaysAsIdentity` can only ever have system generated
values.

**Example Usage**
```
// Creates a Delta identity column.
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity()
// Which is equivalent to the call
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity(start = 1L, step = 1L)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants