[WIP][SPARK-41367][SQL] Enable V2 file tables in read paths in session catalog #38885
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Currently the config
spark.sql.sources.useV1SourceList
doesn't work with V2 file tables in session catalog, it is always the V1 path that is used. This PR enables V2 file tables (if they are not inspark.sql.sources.useV1SourceList
) in read paths via session catalog and fixes a few issues where V2 behaves differently to V1.Why are the changes needed?
It would be good if we could use the already available V2 file source implmenentaions with the session catalog. We ran into a few problems with V2 optimization paths that want to fix in the future. But, currently Spark don't have built-in catalog support for any of the V2 file table implementations. As a first step this PR enables V2 controlled by
spark.sql.sources.useV1SourceList
for the select query plans only. All commands andInsertIntoStatement
remain using V1 implementations.The PR also contains some test changes:
SQLQuerySuite
is splitted into V1 and V2 versions.OrcPartitionDiscoverySuite
andParquetPartitionDiscoverySuite
are modified to behave like the V1 versions do. Basically the order of output columns changed in the edge case when partitioning and data columns overlap.Does this PR introduce any user-facing change?
Yes, see order of output columns when partitioning and data columns overlap.
How was this patch tested?
Existing and new UTs.