Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22712][SQL] Use `buildReaderWithPartitionValues` in native OrcFileFormat #19907

Closed

Conversation

Projects
None yet
3 participants
@dongjoon-hyun
Copy link
Member

commented Dec 6, 2017

What changes were proposed in this pull request?

To support vectorization in native OrcFileFormat later, we need to use buildReaderWithPartitionValues instead of buildReader like ParquetFileFormat. This PR replaces buildReader with buildReaderWithPartitionValues.

How was this patch tested?

Pass the Jenkins with the existing test cases.

@@ -124,7 +124,7 @@ class OrcFileFormat
true
}

override def buildReader(
override def buildReaderWithPartitionValues(

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Dec 6, 2017

Author Member

Hi, @cloud-fan . During the previous ORC PR, we left this behind.

@SparkQA

This comment has been minimized.

Copy link

commented Dec 6, 2017

Test build #84543 has finished for PR 19907 at commit 199c835.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
val unsafeProjection = UnsafeProjection.create(requiredSchema)
val deserializer = new OrcDeserializer(dataSchema, requiredSchema, requestedColIds)
val colIds = requestedColIds ++ List.fill(partitionSchema.length)(-1).toArray[Int]
val unsafeProjection = UnsafeProjection.create(resultSchema)

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Dec 6, 2017

Contributor

can we follow parquet and just join the data row and partition row, and do a final unsafe projection? It's much easier and there is no performance difference.

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Dec 6, 2017

Author Member

Parquet Vectorization work like the following.

      // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
      if (parquetReader.isInstanceOf[VectorizedParquetRecordReader] &&
          enableVectorizedReader) {
        iter.asInstanceOf[Iterator[InternalRow]]

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Dec 6, 2017

Author Member

Oh, I see. you meant non-vectorized path. Sorry, I was confused since I focused too much on vectorized path. I'll do.

@SparkQA

This comment has been minimized.

Copy link

commented Dec 6, 2017

Test build #84570 has finished for PR 19907 at commit f69fc4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@dongjoon-hyun

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2017

Hi, @gatorsmile .
Could you review this please, too?

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2017

thanks, merging to master!

@asfgit asfgit closed this in dd59a4b Dec 7, 2017

@dongjoon-hyun

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2017

Thank you, @cloud-fan !

@dongjoon-hyun dongjoon-hyun deleted the dongjoon-hyun:SPARK-ORC-BUILD-READER branch Dec 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.