[SPARK-54022][SPARK-56617][CONNECT][TESTS] Add DSv2 CACHE TABLE tests using Spark Connect#55577
Open
longvu-db wants to merge 23 commits intoapache:masterfrom
Open
[SPARK-54022][SPARK-56617][CONNECT][TESTS] Add DSv2 CACHE TABLE tests using Spark Connect#55577longvu-db wants to merge 23 commits intoapache:masterfrom
longvu-db wants to merge 23 commits intoapache:masterfrom
Conversation
…park Connect Add DataSourceV2CacheConnectSuite that tests CACHE TABLE behavior with two Connect sessions. Session 1 caches and reads; session 2 acts as an external writer. Uses SharedInMemoryTableCatalog so both sessions share the same underlying table data via a static ConcurrentHashMap. Tests cover all five CACHE TABLE scenarios from the design doc: - S1: external data write after CACHE TABLE - S2: session write then external write - S3: external schema change (ADD COLUMN) - S4: session schema change then external write - S5: external drop and recreate table - REFRESH TABLE and UNCACHE TABLE interactions Co-authored-by: Isaac
Two categories of tests following the SPARK-54022 pattern: 1. Catalog API tests: modify table directly via catalog API (bypasses CacheManager). Cached reads return pinned/stale data. Session writes invalidate and recache. 2. External session tests: second SparkSession (shared CacheManager) modifies table via SQL (triggers refreshCache). Tests verify the shared CacheManager behavior for all 5 design doc scenarios. Co-authored-by: Isaac
The DataSourceV2CacheSuite imports SharedInMemoryTableCatalog but the class was not included in this branch. Add it to fix CI. Co-authored-by: Isaac
Co-authored-by: Isaac
SparkSession.builder().sparkContext(sc).create() creates a new SharedState with a separate CacheManager, so ext-session writes never refresh the primary session's cache. Additionally, extSession.close() stops the shared SparkContext, breaking all subsequent tests. Fix by using spark.newSession() which shares the same SharedState (and CacheManager), and does not stop the SparkContext on cleanup. Co-authored-by: Isaac
Remove DataSourceV2CacheSuite (classic) as this PR focuses on Connect tests only. Trim DataSourceV2CacheConnectSuite to strictly the five design doc scenarios (S1-S5), removing REFRESH TABLE and UNCACHE TABLE extra tests. Each test documents both the current behavior (shared CacheManager, external writes visible) and the proposed behavior (per-session caching, external writes pinned) in comments. Co-authored-by: Isaac
…ents Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
…xternal writes Replace the Connect client tests (which shared a CacheManager and didn't actually test cache pinning) with proper in-process SparkConnectServerTest tests. External writes now go through the catalog API (InMemoryBaseTable.withData), which bypasses the CacheManager entirely. This verifies that external writes are truly invisible to cached reads. Remove SharedInMemoryTableCatalog and RemoteSparkSession changes (no longer needed). The tests use testcat (InMemoryTableCatalog) directly. Co-authored-by: Isaac
Use getServerSession() after first RPC to access the isolated server session for assertCached/checkAnswer. Use loadTable with write privileges to get the original table (not a copy) for external writes. Fix S1+S2 expected values to account for both external and session rows. All 4 tests pass. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Add assertRows() calls that read data through the Connect client
(connectSession.sql("SELECT * FROM T").collect()) alongside the
existing server-side assertions (checkAnswer(serverSession.table(T))).
This verifies the full Connect round-trip for cached data reads,
not just server-side cache state.
Co-authored-by: Isaac
Read data through connectSession (the Connect client) instead of serverSession (classic) for all data assertions. Only assertCached remains on serverSession since cache plan internals are not exposed through Connect. Co-authored-by: Isaac
Co-authored-by: Isaac
Verify that after external drop/recreate, the table schema is preserved as (id, salary). Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add DSv2 CACHE TABLE test coverage for Spark Connect in a new
DataSourceV2CacheConnectSuite.The tests use an in-process Connect server (
SparkConnectServerTest) so that a Connect client performs cache and SQL operations, while external writes go through the catalog API (InMemoryBaseTable.withData), which bypasses theCacheManager. This simulates a truly external writer whose changes are invisible to cached reads.New tests:
"CACHE TABLE pins state; session write invalidates, external does not"): External data write via catalog API is invisible to the cached table. Session INSERT invalidates and recaches. Subsequent external write is again invisible."cached table pinned against external schema change"): External ADD COLUMN via catalog API is invisible to the cached table."session schema change invalidates cache, external write invisible"): SessionALTER TABLE ADD COLUMNvia Connect invalidates and rebuilds cache with the new 3-column schema."cached table after external drop and recreate sees empty table"): External drop+recreate via catalog API produces a new table with a different ID; query sees the new empty table.Why are the changes needed?
The classic (non-Connect) CACHE TABLE tests in
DataSourceV2DataFrameSuiteverify cache pinning behavior using direct catalog API access. This PR adds the equivalent coverage for Spark Connect, ensuring that cache pinning works correctly when operations flow through the Connect protocol.Does this PR introduce any user-facing change?
No. This PR only adds tests.
How was this patch tested?
New tests in
DataSourceV2CacheConnectSuite(all 4 tests pass).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-6)