-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25894][SQL] Add a ColumnarFileFormat type which returns the column count for a given schema #22905
Closed
mallman
wants to merge
3
commits into
apache:master
from
VideoAmp:spark-25894-file_source_scan_exec_column_count_metadata
Closed
[SPARK-25894][SQL] Add a ColumnarFileFormat type which returns the column count for a given schema #22905
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
32 changes: 32 additions & 0 deletions
32
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.datasources | ||
|
||
import org.apache.spark.sql.internal.SQLConf | ||
import org.apache.spark.sql.types.StructType | ||
|
||
/** | ||
* An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when | ||
* debugging a physical query plan. | ||
*/ | ||
private[sql] trait ColumnarFileFormat { | ||
mallman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
_: FileFormat => | ||
|
||
/** Returns the number of columns required to satisfy the given schema. */ | ||
def columnCountForSchema(conf: SQLConf, schema: StructType): Int | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we know column count from
requiredSchema
metadata?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can "guess-timate" the physical column count by counting the leaf fields in the
ReadSchema
metadata value, but the true answer is an implementation issue of the file format. For example, in the implementation ofColumnarFileFormat
for Parquet, we convert the the Catalyst schema to the Parquet schema before counting columns. I suppose a similar approach would be required for ORC and other columnar formats.That being said, this new metadata value isn't really meant to provide new and essential information, per se. Its purpose is to provide easy-to-read, practical information that's useful for quickly validating that schema pruning is working as expected. For example, seeing that a query is reading all 423 columns from a table instead of 15 tells us pretty quickly that schema pruning is not working (unless we really are trying to read the entire table schema). I've found the
ReadSchema
value to be difficult to read in practice because of its terse syntax, and because its printout is truncated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we only include this info when the columnar reader is on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we really should include in the metadata? If the purpose of this is to check if the column pruning works or not, logging should be good enough. Adding a trait for it sounds an overkill for the current status. Let's not add an abstraction just for rough guess that it can be generalised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your basis for this assertion?
Also, what kind of logging are you suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No .. I don't think we should add it only because it's requested once. They look same instances to me. I will have no argument if this one is added and other people request to add others later. We should make it clear why this one should be specifically added. We're not going to add all the information to metadata as requested.
If the purpose of adding it is to check if the pushing down is actually working or not, the logging sounds appropriate for its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I really think it's more appropriate to check if something as expected or not by logging.
I am not underestimating your statement. Let's be very clear why it should be put in metadata over logging. How and why it can be useful over logging? in what cases?
For clarification, the scope of this information is narrower then just checking if the column pruning is working or not since we print out requested columns from Spark side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reiterate a sample use case:
This also matters to users planning/debugging queries in a Jupyter notebook, as we have in VideoAmp. The LOE for these users to go to a driver log file is quite high by comparison to inspecting a query plan.
When you refer to logging, which log are you referring to? When would this information be logged? And at what log level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That basically says logging is useless when to use beeline. I don't think this info is super important to (non-advanced) users.
I mean log4j which is Spark's logging module, and I meant information you're including in the metadata. Maybe info level? or debug level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experience says otherwise, and advanced users use beeline and Jupyter, too.