New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319
Conversation
…SourceStrategy mode
@rajeshbalamohan can you fix your title / description? You are having the title spilling over to the end of the description. |
Two other things:
|
Sure @rxin. I just updated it. Thanks |
Thanks! |
import java.io.IOException; | ||
import java.util.List; | ||
|
||
public class OrcRecordReader extends RecordReader<NullWritable, OrcStruct> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is based on a file from hive, can you say that in the classdoc and explain what the differences are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This is based on OrcNewInputFormat.OrcRecordReader (which is marked private). Only addition is the getObjectInspector targeted to reduce namenode calls later. I will update the doc.
cc @liancheng for review |
// file. Would be helpful for partitioned datasets. | ||
new OrcRecordReader(OrcFile.createReader(new Path(new URI(file | ||
.filePath)), OrcFile.readerOptions(conf)), conf, fileSplit.getStart(), | ||
fileSplit.getLength()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please split this call into the following two for better readability:
val orcReader = OrcFile.createReader(
new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
new OrcRecordReader(orcReader, conf, fileSplit.getStart(), fileSplit.getLength())
@rajeshbalamohan Sorry for the late review and thanks for working on this! My major concern is that the newly added |
Thanks for the review @liancheng
|
test this please |
add to whitelist |
LGTM pending Jenkins. |
Test build #2859 has finished for PR 12319 at commit
|
Test build #56770 has finished for PR 12319 at commit
|
Thanks - merging in master. |
Thanks @liancheng , @rxin |
What changes were proposed in this pull request?
When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets).
Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets.
How was this patch tested?
Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
…SourceStrategy mode