Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12879][SQL] improve the unsafe row writing framework #10809

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

As we begin to use unsafe row writing framework(BufferHolder and UnsafeRowWriter) in more and more places(UnsafeProjection, UnsafeRowParquetRecordReader, GenerateColumnAccessor, etc.), we should add more doc to it and make it easier to use.

This PR abstract the technique used in UnsafeRowParquetRecordReader: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.

a local benchmark shows UnsafeProjection is up to 1.7x faster after this PR:
old version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             2616.04           102.61         1.00 X
single nullable long                    3032.54            88.52         0.86 X
primitive types                         9121.05            29.43         0.29 X
nullable primitive types               12410.60            21.63         0.21 X

new version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             1533.34           175.07         1.00 X
single nullable long                    2306.73           116.37         0.66 X
primitive types                         8403.93            31.94         0.18 X
nullable primitive types               12448.39            21.56         0.12 X

For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR.

@cloud-fan
Copy link
Contributor Author

cc @davies @nongli

// need to clear it out every time.
""
} else {
s"$rowWriter.zeroOutNullBites();"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49602 has finished for PR 10809 at commit 3978711.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

zeroOutNullBites();
}

public void zeroOutNullBites() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullBytes

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49606 has finished for PR 10809 at commit 9a63852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// need to clear it out every time.
""
} else {
s"$rowWriter.zeroOutNullBytes();"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense for me.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49647 has finished for PR 10809 at commit 5567ef1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Jan 19, 2016

LGTM

* A helper class to write data into global row buffer using `UnsafeRow` format.
*
* It will remember the offset of row buffer which it starts to write, and move the cursor of row
* buffer while writing. If a new record comes, the cursor of row buffer will be reset, so we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new record mean nested struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the record that this writer is responsible to write, it can be the whole row record, or a nested struct, or even a struct type element in array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the new record is not clear to me, it should be nested struct.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50026 has finished for PR 10809 at commit f79f63c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jan 26, 2016

LGTM, merging this into master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants