Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation #12319

Closed
wants to merge 4 commits into from

Conversation

rajeshbalamohan
Copy link

What changes were proposed in this pull request?

When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets).

Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets.

How was this patch tested?

Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite

…SourceStrategy mode

@rxin
Copy link
Contributor

rxin commented Apr 12, 2016

@rajeshbalamohan can you fix your title / description? You are having the title spilling over to the end of the description.

@rajeshbalamohan rajeshbalamohan changed the title SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with File… SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation Apr 12, 2016
@rxin
Copy link
Contributor

rxin commented Apr 12, 2016

Two other things:

  1. can you follow the format used by other prs, i.e. [SPARK-14551][SQL] ...
  2. can you spell out name node? Most people don't know that NN = namenode. It is a very Hadoop specific thing.

@rajeshbalamohan rajeshbalamohan changed the title SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation [SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation Apr 12, 2016
@rajeshbalamohan
Copy link
Author

Sure @rxin. I just updated it. Thanks

@rxin
Copy link
Contributor

rxin commented Apr 12, 2016

Thanks!

import java.io.IOException;
import java.util.List;

public class OrcRecordReader extends RecordReader<NullWritable, OrcStruct> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is based on a file from hive, can you say that in the classdoc and explain what the differences are?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is based on OrcNewInputFormat.OrcRecordReader (which is marked private). Only addition is the getObjectInspector targeted to reduce namenode calls later. I will update the doc.

@rxin
Copy link
Contributor

rxin commented Apr 12, 2016

cc @liancheng for review

// file. Would be helpful for partitioned datasets.
new OrcRecordReader(OrcFile.createReader(new Path(new URI(file
.filePath)), OrcFile.readerOptions(conf)), conf, fileSplit.getStart(),
fileSplit.getLength())
Copy link
Contributor

@liancheng liancheng Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please split this call into the following two for better readability:

          val orcReader = OrcFile.createReader(
            new Path(new URI(file.filePath)), OrcFile.readerOptions(conf))
          new OrcRecordReader(orcReader, conf, fileSplit.getStart(), fileSplit.getLength())

@liancheng
Copy link
Contributor

@rajeshbalamohan Sorry for the late review and thanks for working on this! My major concern is that the newly added OrcRecordReader should be live in spark-hive rather than spark-sql. Otherwise it looks good except for a few styling issues.

@rajeshbalamohan
Copy link
Author

rajeshbalamohan commented Apr 21, 2016

Thanks for the review @liancheng
Latest commit addresses the review comments. Changes are as follows

  • Moved OrcRecordReader changes to SparkOrcNewRecordReader in spark-hive
  • Fixed comment in OrcRelation stating it is a custom Orc record reader. Also, SparkOrcNewRecordReader would suggest that this is specific to Spark.
  • Removed pom.xml related changes
  • Fixed styling issues.

@liancheng
Copy link
Contributor

test this please

@liancheng
Copy link
Contributor

add to whitelist

@liancheng
Copy link
Contributor

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #2859 has finished for PR 12319 at commit d6bc52d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #56770 has finished for PR 12319 at commit d6bc52d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 23, 2016

Thanks - merging in master.

@asfgit asfgit closed this in e5226e3 Apr 23, 2016
@rajeshbalamohan
Copy link
Author

Thanks @liancheng , @rxin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants