Skip to content

[SPARK-54022][SPARK-56617][CONNECT][TESTS] Add DSv2 CACHE TABLE tests using Spark Connect#55577

Open
longvu-db wants to merge 23 commits intoapache:masterfrom
longvu-db:dsv2-cache-connect-tests
Open

[SPARK-54022][SPARK-56617][CONNECT][TESTS] Add DSv2 CACHE TABLE tests using Spark Connect#55577
longvu-db wants to merge 23 commits intoapache:masterfrom
longvu-db:dsv2-cache-connect-tests

Conversation

@longvu-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add DSv2 CACHE TABLE test coverage for Spark Connect in a new DataSourceV2CacheConnectSuite.

The tests use an in-process Connect server (SparkConnectServerTest) so that a Connect client performs cache and SQL operations, while external writes go through the catalog API (InMemoryBaseTable.withData), which bypasses the CacheManager. This simulates a truly external writer whose changes are invisible to cached reads.

New tests:

  • Scenario 1+2 ("CACHE TABLE pins state; session write invalidates, external does not"): External data write via catalog API is invisible to the cached table. Session INSERT invalidates and recaches. Subsequent external write is again invisible.
  • Scenario 3 ("cached table pinned against external schema change"): External ADD COLUMN via catalog API is invisible to the cached table.
  • Scenario 4 ("session schema change invalidates cache, external write invisible"): Session ALTER TABLE ADD COLUMN via Connect invalidates and rebuilds cache with the new 3-column schema.
  • Scenario 5 ("cached table after external drop and recreate sees empty table"): External drop+recreate via catalog API produces a new table with a different ID; query sees the new empty table.

Why are the changes needed?

The classic (non-Connect) CACHE TABLE tests in DataSourceV2DataFrameSuite verify cache pinning behavior using direct catalog API access. This PR adds the equivalent coverage for Spark Connect, ensuring that cache pinning works correctly when operations flow through the Connect protocol.

Does this PR introduce any user-facing change?

No. This PR only adds tests.

How was this patch tested?

New tests in DataSourceV2CacheConnectSuite (all 4 tests pass).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

…park Connect

Add DataSourceV2CacheConnectSuite that tests CACHE TABLE behavior with
two Connect sessions. Session 1 caches and reads; session 2 acts as an
external writer. Uses SharedInMemoryTableCatalog so both sessions share
the same underlying table data via a static ConcurrentHashMap.

Tests cover all five CACHE TABLE scenarios from the design doc:
- S1: external data write after CACHE TABLE
- S2: session write then external write
- S3: external schema change (ADD COLUMN)
- S4: session schema change then external write
- S5: external drop and recreate table
- REFRESH TABLE and UNCACHE TABLE interactions

Co-authored-by: Isaac
Two categories of tests following the SPARK-54022 pattern:

1. Catalog API tests: modify table directly via catalog API (bypasses
   CacheManager). Cached reads return pinned/stale data. Session writes
   invalidate and recache.

2. External session tests: second SparkSession (shared CacheManager)
   modifies table via SQL (triggers refreshCache). Tests verify
   the shared CacheManager behavior for all 5 design doc scenarios.

Co-authored-by: Isaac
The DataSourceV2CacheSuite imports SharedInMemoryTableCatalog but
the class was not included in this branch. Add it to fix CI.

Co-authored-by: Isaac
SparkSession.builder().sparkContext(sc).create() creates a new
SharedState with a separate CacheManager, so ext-session writes
never refresh the primary session's cache. Additionally,
extSession.close() stops the shared SparkContext, breaking all
subsequent tests.

Fix by using spark.newSession() which shares the same SharedState
(and CacheManager), and does not stop the SparkContext on cleanup.

Co-authored-by: Isaac
Remove DataSourceV2CacheSuite (classic) as this PR focuses on Connect
tests only. Trim DataSourceV2CacheConnectSuite to strictly the five
design doc scenarios (S1-S5), removing REFRESH TABLE and UNCACHE TABLE
extra tests. Each test documents both the current behavior (shared
CacheManager, external writes visible) and the proposed behavior
(per-session caching, external writes pinned) in comments.

Co-authored-by: Isaac
Co-authored-by: Isaac
…xternal writes

Replace the Connect client tests (which shared a CacheManager and didn't
actually test cache pinning) with proper in-process SparkConnectServerTest
tests. External writes now go through the catalog API
(InMemoryBaseTable.withData), which bypasses the CacheManager entirely.
This verifies that external writes are truly invisible to cached reads.

Remove SharedInMemoryTableCatalog and RemoteSparkSession changes (no
longer needed). The tests use testcat (InMemoryTableCatalog) directly.

Co-authored-by: Isaac
Use getServerSession() after first RPC to access the isolated server
session for assertCached/checkAnswer. Use loadTable with write
privileges to get the original table (not a copy) for external writes.
Fix S1+S2 expected values to account for both external and session rows.

All 4 tests pass.

Co-authored-by: Isaac
Co-authored-by: Isaac
Add assertRows() calls that read data through the Connect client
(connectSession.sql("SELECT * FROM T").collect()) alongside the
existing server-side assertions (checkAnswer(serverSession.table(T))).
This verifies the full Connect round-trip for cached data reads,
not just server-side cache state.

Co-authored-by: Isaac
Read data through connectSession (the Connect client) instead of
serverSession (classic) for all data assertions. Only assertCached
remains on serverSession since cache plan internals are not exposed
through Connect.

Co-authored-by: Isaac
Verify that after external drop/recreate, the table schema is
preserved as (id, salary).

Co-authored-by: Isaac
Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant