Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust schema usage and checks for batch CDF queries #1509

Closed
wants to merge 4 commits into from

Conversation

jackierwzhang
Copy link
Contributor

@jackierwzhang jackierwzhang commented Dec 6, 2022

Description

Properly handle read-incompatible schema changes when querying batch CDF (e.g. table_changes() TVF, and SQL/DF APIs).
Right now, batch CDF is almost always serving past data using the latest schema, but the latest schema may not be read-compatible with the data files.

this PR introduces the following:

  1. added checks for read-incompatibility in batch CDF
  2. a new Delta SQL conf changeDataFeed.defaultSchemaModeForColumnMappingTable that can be set to endVersion, latest or legacy. If legacy it would fallback to the current behavior in which either latest schema is used or a time-travel version schema is used, if endVersion is set, we will use the end version's schema to serve the batch and if latest is set, the latest schema will always be used. Note, this is orthogonal to 1), checks will be triggered all the time to ensure read-compatibility, even when endVersion is used.
  3. The SQL conf cannot be used when time-travel options versionOf is specified. Apparently ppl can time-travel the schema during querying batch CDF, this is probably unintentional, but since it exists, i explicitly blocked two options from being used concurrently.

How was this patch tested?

New Unit tests.

Does this PR introduce any user-facing changes?

No

*
* However, if there are schema changes between analysis and execution, since we froze this
* schema, our schema incompatibility checks will kick in during the scan so we will always
* be safe - Although it is a notable caveat that user should be aware of because the CDC query
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are users aware of this? should we update any documentation to make this clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is documented yet for this behavior, but we probably should cc @jose-torres

* may break.
*/
private lazy val endingVersionForBatchSchema: Long = endingVersion.map { v =>
// As defined in docs, if ending version is greater than the latest version, we will just use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which docs? method docs? public website docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method doc, i could make it clearer.

Comment on lines 53 to 56
sql(
s"""
|ALTER TABLE delta.`${dir.getCanonicalPath}` ADD COLUMN (name string)
|""".stripMargin)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be 1 line instead of 4?

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants