DRILL-7055: Revise SELECT * to exclude partitions #1675
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Historically, a SELECT * (wildcard) query on a partitioned table included partition directory names as a set of "dir0", "dir1" columns. When used with files at differnt depths, this can lead to schema change exceptions as some readers create, say, "dir0" and "dir1", while others create just "dir0".
The result is that either 1) things just work, 2) the client gets some batches with two partition columns, others with one, or 3) a hard schema change occurs as the project operator creates missing columns as nullable int.
This change proposes to include table columns with using the wildcard and to no longer include partition columns. Partition columns will now work the way the "implicit" file columns already work, so this change improves consistency.
The partition columns are still available: they can be requested explicitly:
Both before and after this change, when including the partition columns explicitly, the nullable int issue described above will occur. However, this change positions us for the revised scan framework that will properly provide the partition columns as nullable VARCHAR whether a matching directory exists or not.
This is a potentially breaking change: any user that uses SELECT * and expects partition columns (and manages to work around the schema change issues) will see different behavior: they will have to revise queries to include partition columns.