Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

Closed

Conversation

liancheng
Copy link
Contributor

What changes were proposed in this pull request?

This PR implements FileFormat.buildReader() for our ORC data source. It also fixed several minor styling issues related to HadoopFsRelation planning code path.

Note that OrcNewInputFormat doesn't rely on OrcNewSplit for creating OrcRecordReaders, plain FileSplit is just fine. That's why we can simply create the record reader with the help of OrcNewInputFormat and FileSplit.

How was this patch tested?

Existing test cases should do the work

@liancheng liancheng force-pushed the spark-14116-build-reader-for-orc branch from e219295 to 057b6f2 Compare March 24, 2016 15:58
@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54048 has finished for PR 11936 at commit 057b6f2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends Partition

@liancheng liancheng changed the title [SPARK-14116][SQL][WIP] Implements buildReader() for ORC data source [SPARK-14116][SQL] Implements buildReader() for ORC data source Mar 25, 2016
@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54169 has finished for PR 11936 at commit b1d630b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

cc @cloud-fan @yhuai @marmbrus

@@ -81,10 +81,10 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
val bucketColumns =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this variable is never used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea... Didn't remove it since I haven't gone through that part thoroughly.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54172 has finished for PR 11936 at commit 7d628ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@marmbrus Unfortunately our current strategy for generating Spark partitions doesn't play well with ORC version of buildReader(), or to be more specific, OrcRecordReader. The following Spark shell snippet shows the problem:

import org.apache.spark.sql.types._

// Just creates a random file that is large enough
val df = sqlContext.range(1000000).select(
  'id cast StringType as 'a,
  'id cast StringType as 'b,
  'id cast StringType as 'c
)

val path = "/tmp/large.orc"
df.write.mode("overwrite").orc(path)

sqlContext.sql(s"SET spark.sql.files.maxPartitionBytes=${1024 * 1024}")
sqlContext.sql(s"SET spark.sql.parquet.enableVectorizedReader=false")

sqlContext.read.orc(path).count() // <-- Gives 500,000 instead of 1,000,000

Please refer to discussion here for details.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54182 has finished for PR 11936 at commit 2cbb56d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Just realized the problem while washing dishes...

The above comment is a false alarm. The real problem is that I ignored start position and length of the PartitionedFile and created ORC splits using OrcNewInputFormat. After digging for a while, I found that although instantiating OrcNewSplit is pretty complicated, OrcNewInputFormat doesn't really require an OrcNewSplit, a plain FileSplit is just enough. So now the OrcRecordReader is simply created using OrcNewInputFormat and a FileSplit, which works well.

// Appends partition values
val fullOutput = dataSchema.toAttributes ++ partitionSchema.toAttributes
val joinedRow = new JoinedRow()
val appendPartitionColumns = GenerateUnsafeProjection.generate(fullOutput, fullOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this out of the function? Then we don't need to create them everytime this function is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we should. Michael also mentioned this in this thread. We can do it in a follow-up PR after finishing other data sources to avoid conflicts.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54183 has finished for PR 11936 at commit a26973e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// iterator.
val maybePhysicalSchema = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf))

maybePhysicalSchema.fold(Iterator.empty: Iterator[InternalRow]) { physicalSchema =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that using a fold at here make the code harder to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use a more straightforward way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Mar 26, 2016

Test build #54254 has finished for PR 11936 at commit 25d894f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (files.fileFormat.toString == "TestFileFormat" ||
files.fileFormat.isInstanceOf[parquet.DefaultSource]) &&
files.fileFormat.isInstanceOf[parquet.DefaultSource] ||
files.fileFormat.toString == "ORC") &&
files.sqlContext.conf.parquetFileScan =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this conf.

@yhuai
Copy link
Contributor

yhuai commented Mar 26, 2016

Thanks. I am merging this to master.

@liancheng @cloud-fan Let's address https://github.com/apache/spark/pull/11936/files#r57520723 in either of your PR for other formats.

@asfgit asfgit closed this in b547de8 Mar 26, 2016
@liancheng liancheng deleted the spark-14116-build-reader-for-orc branch March 29, 2016 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants