[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

rajeshbalamohan · 2016-04-12T02:36:32Z

What changes were proposed in this pull request?

When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets).

Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets.

How was this patch tested?

Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite

…SourceStrategy mode

rxin · 2016-04-12T03:45:31Z

@rajeshbalamohan can you fix your title / description? You are having the title spilling over to the end of the description.

rxin · 2016-04-12T04:02:06Z

Two other things:

can you follow the format used by other prs, i.e. [SPARK-14551][SQL] ...
can you spell out name node? Most people don't know that NN = namenode. It is a very Hadoop specific thing.

rajeshbalamohan · 2016-04-12T04:11:45Z

Sure @rxin. I just updated it. Thanks

rxin · 2016-04-12T06:14:50Z

Thanks!

rxin · 2016-04-12T06:15:59Z

sql/core/src/main/java/org/apache/hadoop/hive/ql/io/orc/OrcRecordReader.java

+import java.io.IOException;
+import java.util.List;
+
+public class OrcRecordReader extends RecordReader<NullWritable, OrcStruct> {


if this is based on a file from hive, can you say that in the classdoc and explain what the differences are?

Sure. This is based on OrcNewInputFormat.OrcRecordReader (which is marked private). Only addition is the getObjectInspector targeted to reduce namenode calls later. I will update the doc.

rxin · 2016-04-12T06:16:12Z

cc @liancheng for review

liancheng · 2016-04-20T08:04:41Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

+          // file. Would be helpful for partitioned datasets.
+          new OrcRecordReader(OrcFile.createReader(new Path(new URI(file
+            .filePath)), OrcFile.readerOptions(conf)), conf, fileSplit.getStart(),
+            fileSplit.getLength())


Nit: Please split this call into the following two for better readability:

val orcReader = OrcFile.createReader( new Path(new URI(file.filePath)), OrcFile.readerOptions(conf)) new OrcRecordReader(orcReader, conf, fileSplit.getStart(), fileSplit.getLength())

liancheng · 2016-04-20T08:28:23Z

@rajeshbalamohan Sorry for the late review and thanks for working on this! My major concern is that the newly added OrcRecordReader should be live in spark-hive rather than spark-sql. Otherwise it looks good except for a few styling issues.

rajeshbalamohan · 2016-04-21T06:29:26Z

Thanks for the review @liancheng
Latest commit addresses the review comments. Changes are as follows

Moved OrcRecordReader changes to SparkOrcNewRecordReader in spark-hive
Fixed comment in OrcRelation stating it is a custom Orc record reader. Also, SparkOrcNewRecordReader would suggest that this is specific to Spark.
Removed pom.xml related changes
Fixed styling issues.

liancheng · 2016-04-23T03:24:44Z

test this please

liancheng · 2016-04-23T03:25:00Z

add to whitelist

liancheng · 2016-04-23T03:25:46Z

LGTM pending Jenkins.

SparkQA · 2016-04-23T04:28:05Z

Test build #2859 has finished for PR 12319 at commit d6bc52d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-23T04:31:22Z

Test build #56770 has finished for PR 12319 at commit d6bc52d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-23T05:51:16Z

Thanks - merging in master.

rajeshbalamohan · 2016-04-25T05:08:10Z

Thanks @liancheng , @rxin

SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with File…

1b99d95

…SourceStrategy mode

rajeshbalamohan changed the title ~~SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with File…~~ SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation Apr 12, 2016

rajeshbalamohan changed the title ~~SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation~~ [SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation Apr 12, 2016

rxin reviewed Apr 12, 2016
View reviewed changes

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation

e16018e

liancheng reviewed Apr 20, 2016
View reviewed changes

rbalamohan added 2 commits April 21, 2016 11:00

Merge branch 'master' into SPARK-14551

19502a8

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation

d6bc52d

asfgit closed this in e5226e3 Apr 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

rajeshbalamohan commented Apr 12, 2016

rxin commented Apr 12, 2016

rxin commented Apr 12, 2016

rajeshbalamohan commented Apr 12, 2016

rxin commented Apr 12, 2016

rxin Apr 12, 2016

rajeshbalamohan Apr 12, 2016

rxin commented Apr 12, 2016

liancheng Apr 20, 2016 •

edited

liancheng commented Apr 20, 2016

rajeshbalamohan commented Apr 21, 2016 •

edited

liancheng commented Apr 23, 2016

liancheng commented Apr 23, 2016

liancheng commented Apr 23, 2016

SparkQA commented Apr 23, 2016

SparkQA commented Apr 23, 2016

rxin commented Apr 23, 2016

rajeshbalamohan commented Apr 25, 2016

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

Conversation

rajeshbalamohan commented Apr 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Apr 12, 2016

rxin commented Apr 12, 2016

rajeshbalamohan commented Apr 12, 2016

rxin commented Apr 12, 2016

rxin Apr 12, 2016

Choose a reason for hiding this comment

rajeshbalamohan Apr 12, 2016

Choose a reason for hiding this comment

rxin commented Apr 12, 2016

liancheng Apr 20, 2016 • edited

Choose a reason for hiding this comment

liancheng commented Apr 20, 2016

rajeshbalamohan commented Apr 21, 2016 • edited

liancheng commented Apr 23, 2016

liancheng commented Apr 23, 2016

liancheng commented Apr 23, 2016

SparkQA commented Apr 23, 2016

SparkQA commented Apr 23, 2016

rxin commented Apr 23, 2016

rajeshbalamohan commented Apr 25, 2016

liancheng Apr 20, 2016 •

edited

rajeshbalamohan commented Apr 21, 2016 •

edited