Skip to content

[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277

Open
moomindani wants to merge 1 commit intoapache:masterfrom
moomindani:sdp-doc-storage-refresh
Open

[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277
moomindani wants to merge 1 commit intoapache:masterfrom
moomindani:sdp-doc-storage-refresh

Conversation

@moomindani
Copy link
Copy Markdown
Contributor

@moomindani moomindani commented Apr 9, 2026

Closes #55276.

What changes were proposed in this pull request?

Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:

  • Table Format: Default format (parquet via spark.sql.sources.default) and how to specify a different format with Python and SQL examples
  • How Materialized Views are Refreshed: Full recomputation (TRUNCATE + append) on every pipeline run, and how this differs from database-native materialized views
  • How Streaming Tables are Refreshed: Incremental processing with checkpoints and schema evolution support
  • Full Refresh: Behavior differences between materialized views and streaming tables

Why are the changes needed?

The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:

  • What format their tables are stored in by default
  • That materialized views perform a full recomputation on every run (unlike PostgreSQL-style MVs)
  • That streaming tables require checkpoint storage on a Hadoop-compatible file system
  • What --full-refresh actually does for each dataset type

Without this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation change only. Verified the content is accurate by reading the SDP implementation (DatasetManager.scala, FlowExecution.scala).

Was this patch authored or co-authored using generative AI tooling?

Yes.

Add a new section to the Spark Declarative Pipelines programming guide
that explains the storage and refresh mechanics, including:
- Default table format and how to specify a different format
- How materialized views are refreshed (full recomputation via TRUNCATE + append)
- How streaming tables are refreshed (incremental processing with checkpoints)
- Full refresh behavior for both dataset types
@moomindani moomindani force-pushed the sdp-doc-storage-refresh branch from 5763f95 to 723afa3 Compare April 9, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document how SDP datasets are stored and refreshed

1 participant