[SPARK-12879][SQL] improve the unsafe row writing framework #10809

cloud-fan · 2016-01-18T18:09:58Z

As we begin to use unsafe row writing framework(BufferHolder and UnsafeRowWriter) in more and more places(UnsafeProjection, UnsafeRowParquetRecordReader, GenerateColumnAccessor, etc.), we should add more doc to it and make it easier to use.

This PR abstract the technique used in UnsafeRowParquetRecordReader: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.

a local benchmark shows UnsafeProjection is up to 1.7x faster after this PR:
old version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             2616.04           102.61         1.00 X
single nullable long                    3032.54            88.52         0.86 X
primitive types                         9121.05            29.43         0.29 X
nullable primitive types               12410.60            21.63         0.21 X

new version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             1533.34           175.07         1.00 X
single nullable long                    2306.73           116.37         0.66 X
primitive types                         8403.93            31.94         0.18 X
nullable primitive types               12448.39            21.56         0.12 X

For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR.

cloud-fan · 2016-01-18T18:11:30Z

cc @davies @nongli

cloud-fan · 2016-01-18T18:20:18Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

+        // need to clear it out every time.
+        ""
+      } else {
+        s"$rowWriter.zeroOutNullBites();"


Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

SparkQA · 2016-01-18T18:23:46Z

Test build #49602 has finished for PR 10809 at commit 3978711.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-01-18T19:08:25Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

+    zeroOutNullBites();
+  }
+
+  public void zeroOutNullBites() {


SparkQA · 2016-01-18T20:21:32Z

Test build #49606 has finished for PR 10809 at commit 9a63852.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-19T00:53:13Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

+        // need to clear it out every time.
+        ""
+      } else {
+        s"$rowWriter.zeroOutNullBytes();"


Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

Make sense for me.

SparkQA · 2016-01-19T02:52:35Z

Test build #49647 has finished for PR 10809 at commit 5567ef1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-01-19T19:08:24Z

LGTM

davies · 2016-01-19T20:41:28Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

+ * A helper class to write data into global row buffer using `UnsafeRow` format.
+ *
+ * It will remember the offset of row buffer which it starts to write, and move the cursor of row
+ * buffer while writing.  If a new record comes, the cursor of row buffer will be reset, so we need


This new record mean nested struct?

the record that this writer is responsible to write, it can be the whole row record, or a nested struct, or even a struct type element in array.

I mean the new record is not clear to me, it should be nested struct.

SparkQA · 2016-01-26T00:08:17Z

Test build #50026 has finished for PR 10809 at commit f79f63c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-01-26T00:23:27Z

LGTM, merging this into master, thanks!

cloud-fan reviewed Jan 18, 2016
View reviewed changes

cloud-fan force-pushed the unsafe-projection branch from 3978711 to 9a63852 Compare January 18, 2016 18:30

improve the unsafe row writing framework

9a63852

nongli reviewed Jan 18, 2016
View reviewed changes

cloud-fan mentioned this pull request Jan 18, 2016

[SPARK-12888][SQL] benchmark the new hash expression #10816

Closed

address comments

5567ef1

cloud-fan reviewed Jan 19, 2016
View reviewed changes

davies reviewed Jan 19, 2016
View reviewed changes

cloud-fan added 2 commits January 25, 2016 13:17

Merge remote-tracking branch 'origin/master' into unsafe-projection

b9377fa

address comments

f79f63c

asfgit closed this in be375fc Jan 26, 2016

cloud-fan mentioned this pull request Jan 29, 2016

[SPARK-13093][SQL] improve null check in nullSafeCodeGen for unary, binary and ternary expression #10987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12879][SQL] improve the unsafe row writing framework #10809

[SPARK-12879][SQL] improve the unsafe row writing framework #10809

cloud-fan commented Jan 18, 2016

cloud-fan commented Jan 18, 2016

cloud-fan Jan 18, 2016

SparkQA commented Jan 18, 2016

nongli Jan 18, 2016

SparkQA commented Jan 18, 2016

cloud-fan Jan 19, 2016

davies Jan 19, 2016

SparkQA commented Jan 19, 2016

nongli commented Jan 19, 2016

davies Jan 19, 2016

cloud-fan Jan 19, 2016

davies Jan 19, 2016

SparkQA commented Jan 26, 2016

davies commented Jan 26, 2016

[SPARK-12879][SQL] improve the unsafe row writing framework #10809

[SPARK-12879][SQL] improve the unsafe row writing framework #10809

Conversation

cloud-fan commented Jan 18, 2016

cloud-fan commented Jan 18, 2016

cloud-fan Jan 18, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 18, 2016

nongli Jan 18, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 18, 2016

cloud-fan Jan 19, 2016

Choose a reason for hiding this comment

davies Jan 19, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2016

nongli commented Jan 19, 2016

davies Jan 19, 2016

Choose a reason for hiding this comment

cloud-fan Jan 19, 2016

Choose a reason for hiding this comment

davies Jan 19, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 26, 2016

davies commented Jan 26, 2016