Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Oct 21, 2015

This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects).

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44083 has finished for PR 9203 at commit 15ebb3b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * abstract class ColumnarIterator extends Iterator[InternalRow]\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n * class SpecificColumnarIterator extends $\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, maybe change this to assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in separate PR

@JoshRosen
Copy link
Contributor

I'm really excited to see how much of a performance difference this makes when scanning string columns, since this could potentially provide a big perf. boost to the sc.textFile replacement that I showed the other day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever start from the middle of an existing buffer? If not, can we move startingOffset into holder class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in UnsafeProjection, there could be multiple UnsafeRowWriter share a single BufferHolder.

@rxin
Copy link
Contributor

rxin commented Oct 21, 2015

LGTM otherwise.

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44089 has finished for PR 9203 at commit 81e3fad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

@davies
Copy link
Contributor Author

davies commented Oct 21, 2015

@JoshRosen If we only scan the cached string and access it (without adding a ConvertToUnsafe), this could be a little bit slower, because we copy the bytes into UnsafeRow now. For the UTF8String object, we still will create it anyway, before this patch, we create it in ColumnAccessor, after this patch, we create in UnsafeRow.getUTF8String() (we could re-use that actually).

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44092 has finished for PR 9203 at commit e33170d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why implement these primitive write method in writer class not in codegen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those in UnsafeRowWriter are needed for ColumnAccessor. The purpose of these *Writer is to simplify generated code (which is already hard to understand), these method also go in this direction.

In terms of performance, I think there will not difference, they all could be JITed and inlined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is simpler to put them there rather than in generated code.

@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44098 has finished for PR 9203 at commit 1421bce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44100 has finished for PR 9203 at commit cab9286.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

@rxin
Copy link
Contributor

rxin commented Oct 22, 2015

I'm going to merge this first. @davies please take a look at @cloud-fan 's comments, and address them in follow-up prs if necessary.

@asfgit asfgit closed this in 1d97332 Oct 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants