Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add column mapping metadata update functionality #3393

Merged
merged 2 commits into from
Jul 29, 2024

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Jul 18, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This adds column mapping metadata when the column mapping feature is enabled on a table and fixes #3383

How was this patch tested?

New tests have been added to verify the behavior

Does this PR introduce any user-facing changes?

@nastra nastra force-pushed the column-mapping-meta-update branch 7 times, most recently from 0f5ac79 to 3169b2e Compare July 18, 2024 14:53
@vkorukanti vkorukanti self-requested a review July 18, 2024 16:07
@raveeram-db
Copy link
Contributor

Thanks for the PR @nastra ! Change LGTM so far. cc: @scottsand-db

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some comments!

@nastra nastra force-pushed the column-mapping-meta-update branch 3 times, most recently from 7df2b70 to bcd839a Compare July 19, 2024 08:33
@nastra nastra requested a review from scottsand-db July 19, 2024 10:34
@nastra nastra force-pushed the column-mapping-meta-update branch 4 times, most recently from f6ba330 to 599bd06 Compare July 22, 2024 06:36
////////////////////////////

/** Visible for testing */
static int findMaxColumnId(StructType schema) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't handle nested schemas, right? are you planning on adding that in a subsequent PR?

I worry that adding it in a subsequent PR will leave us open to forgetting spots to add it ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Contributor Author

@nastra nastra Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I initially discussed this with @raveeram-db to add nested schema support in a separate PR. However, I can also add it as part of this PR if necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #3427 to handled nested schema

if (field.getMetadata().contains(COLUMN_MAPPING_PHYSICAL_NAME_KEY)) {
return (String) field
.getMetadata()
.get(COLUMN_MAPPING_PHYSICAL_NAME_KEY);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you create that issue about getString on FieldMetadata class? we can leave a TODO here

Copy link
Contributor Author

@nastra nastra Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on #3393 (comment) I was waiting for a response whether we should do it in the first place, so I haven't opened an issue yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #3419 to address that

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for working on this.

Qn: looks like the data writing path is not yet handled which means we can create tables with column mapping enabled, but the data written is not correct. Can we split out the physical name/id mapping info method and complete the implementation (i.e handling the nested and array/map types)?

private static boolean allowMappingModeChange(String oldMode, String newMode) {
// only upgrade from none to name mapping is allowed
return oldMode.equals(newMode) ||
(COLUMN_MAPPING_MODE_NONE.equals(oldMode) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should just define an enum and let the TableConfig return the enum value. Otherwise we need to deal with the case sensitive issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know, I can make a separate PR to change it to use the enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that enum be in ColumnMapping or in TableConfig? No worries, I can add that enum. However, that feels like it would be better to do in a follow-up PR in order to not increase the diff here, wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum in either places is fine.

private static Metadata assignColumnIdAndPhysicalName(Metadata metadata, boolean isNewTable) {
StructType schema = metadata.getSchema();
int maxColumnId = Math.max(
Integer.parseInt(metadata.getConfiguration()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the MAX_ID property to TableConfig and use it to parse? btw Delta-Spark is parsing it as Long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why delta-spark actually parses this as Long instead of an Int? Schema Ids should be ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason I haven't added the MAX_COLUMN_ID to TableConfig because it's an internal table property that should be set by users and TableConfig seems to be made for properties that can be explicitly set by users.
I think TableConfig would need a concept of whether a property should be settable by a user or not, wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we need a property in the TableConfig to say it is changeable by the use or not. It can be a separate PR.

////////////////////////////

/** Visible for testing */
static int findMaxColumnId(StructType schema) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

.putString(COLUMN_MAPPING_PHYSICAL_NAME_KEY, physicalName)
.build());
}
newSchema = newSchema.add(updatedField);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we need to handle the nested types and array/map elements?

@nastra
Copy link
Contributor Author

nastra commented Jul 24, 2024

Looks great! Thanks for working on this.

Qn: looks like the data writing path is not yet handled which means we can create tables with column mapping enabled, but the data written is not correct. Can we split out the physical name/id mapping info method and complete the implementation (i.e handling the nested and array/map types)?

@vkorukanti are you referring to what's being handled as part of #3384 w.r.t the data write path?

@nastra nastra force-pushed the column-mapping-meta-update branch from 599bd06 to be1dc76 Compare July 24, 2024 14:57
@nastra nastra force-pushed the column-mapping-meta-update branch from be1dc76 to bb55534 Compare July 24, 2024 15:13
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor follow ups and in addition to the data path change. This PR LGTM.

private static boolean allowMappingModeChange(String oldMode, String newMode) {
// only upgrade from none to name mapping is allowed
return oldMode.equals(newMode) ||
(COLUMN_MAPPING_MODE_NONE.equals(oldMode) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum in either places is fine.

private static Metadata assignColumnIdAndPhysicalName(Metadata metadata, boolean isNewTable) {
StructType schema = metadata.getSchema();
int maxColumnId = Math.max(
Integer.parseInt(metadata.getConfiguration()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we need a property in the TableConfig to say it is changeable by the use or not. It can be a separate PR.

@vkorukanti
Copy link
Collaborator

Looks great! Thanks for working on this.
Qn: looks like the data writing path is not yet handled which means we can create tables with column mapping enabled, but the data written is not correct. Can we split out the physical name/id mapping info method and complete the implementation (i.e handling the nested and array/map types)?

@vkorukanti are you referring to what's being handled as part of #3384 w.r.t the data write path?

Yes, that's the issue.

@vkorukanti vkorukanti merged commit ec3f6be into delta-io:master Jul 29, 2024
10 checks passed
@nastra nastra deleted the column-mapping-meta-update branch July 29, 2024 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request][Kernel] Add column mapping metadata update functionality
4 participants