[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed by moomindani · Pull Request #55277 · apache/spark

moomindani · 2026-04-09T08:07:09Z

What changes were proposed in this pull request?

Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:

Table Format: Default format (parquet via spark.sql.sources.default) and how to specify a different format with Python and SQL examples
How Materialized Views are Refreshed: Full recomputation (TRUNCATE + append) on every pipeline run, and how this differs from database-native materialized views
How Streaming Tables are Refreshed: Incremental processing with checkpoints and schema evolution support
Full Refresh: Behavior differences between materialized views and streaming tables

Why are the changes needed?

The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:

What format their tables are stored in by default
That materialized views perform a full recomputation on every run (unlike PostgreSQL-style MVs)
That streaming tables require checkpoint storage on a Hadoop-compatible file system
What --full-refresh actually does for each dataset type

Without this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation change only. Verified the content is accurate by reading the SDP implementation (DatasetManager.scala, FlowExecution.scala).

Was this patch authored or co-authored using generative AI tooling?

Yes.

Add a new section to the Spark Declarative Pipelines programming guide that explains the storage and refresh mechanics, including: - Default table format and how to specify a different format - How materialized views are refreshed (full recomputation via TRUNCATE + append) - How streaming tables are refreshed (incremental processing with checkpoints) - Full refresh behavior for both dataset types

moomindani force-pushed the sdp-doc-storage-refresh branch from 5763f95 to 723afa3 Compare April 9, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277

[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277
moomindani wants to merge 1 commit intoapache:masterfrom
moomindani:sdp-doc-storage-refresh

moomindani commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

moomindani commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

moomindani commented Apr 9, 2026 •

edited

Loading