[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

tejasapatil · 2017-03-07T02:38:08Z

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-19843

Created wrapper classes (IntWrapper, LongWrapper) to wrap the result of parsing (which are primitive types). In case of problem in parsing, the method would return a boolean.

How was this patch tested?

Added new unit tests
Ran a prod job which had conversion from string -> int and verified the outputs

Performance

Tiny regression when all strings are valid integers

conversion to int:       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------
trunk                         502 /  522         33.4          29.9       1.0X
SPARK-19843                   493 /  503         34.0          29.4       1.0X

Huge gain when all strings are invalid integers

conversion to int:      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------
trunk                     33913 / 34219          0.5        2021.4       1.0X
SPARK-19843                  154 /  162        108.8           9.2     220.0X

tejasapatil · 2017-03-07T02:38:24Z

Jenkins test this please

SparkQA · 2017-03-07T05:16:09Z

Test build #74061 has finished for PR 17184 at commit adcb405.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-03-07T05:49:07Z

cc @cloud-fan for review

tejasapatil · 2017-03-07T05:49:51Z

also, cc @gatorsmile

…or invalid inputs

cloud-fan · 2017-03-07T09:36:41Z

Ideally we need a better way to indicate an invalid input, as exception handling is too expensive. Adding a check before process is one approach, but as you said, it has performance penalty if all records are valid.

How about we use C-style function and return the result via parameter?. e.g.

class ToIntResult {
  int res = 0;
  boolean valid = true;
}

public void toInt(ToIntResult t) {
  ...
  if (something wrong) {
    t.valid = false;
  }
  ...
  t.res = xxx;
}

The caller side can reuse the result object to reduce cost

srowen · 2017-03-07T09:47:35Z

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java

+
+  @Test
+  public void testIsIntMaybe() throws IOException {
+    assertEquals(true, UTF8String.fromString("1").isIntMaybe());


assertTrue?

srowen · 2017-03-07T09:48:10Z

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java

+    assertEquals(true, UTF8String.fromString(String.valueOf(Integer.MIN_VALUE)).isIntMaybe());
+
+    Random rand = new Random();
+    for(int i = 0; i < 10; i++) {


Nit, which lint might catch: space after for.
Seed the RNG for determinism?

fixed linter

I dont want determinism so not setting seed

srowen · 2017-03-07T09:48:30Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+
+    if (!Character.isDigit(firstByte)) {
+      // if the first character isn't a digit, then it has to be either `+` OR `-`
+      if ((firstByte == '-' || firstByte == '+')) {


Nit: extra parens here

this code is removed as I have changed the approach

srowen · 2017-03-07T09:49:25Z

sql/catalyst/src/test/scala/org/apache/spark/sql/UTF8StringBenchmark.scala

+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.Benchmark
+
+object UTF8StringBenchmark {


Not sure what the convention is, but I don't know if we need to preserve this benchmarking code

Removed the benchmark code. Had originally attached in case anyone wanted to verify if the benchmarking was sane

cloud-fan · 2017-03-08T01:03:57Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

-    if (b >= '0' && b <= '9') {
-      return b - '0';
+  public static class LongWrapper {
+    private long value = 0;


it's for internal use only, shall we make it public? We can probably save some function calls

removed getter and setter. changed to public

cloud-fan · 2017-03-08T01:05:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

-        case _: NumberFormatException => null
+      val result = new IntWrapper()
+      buildCast[UTF8String](_, s => if (s.toShort(result)) {
+        result.getValue.asInstanceOf[Short]


use toShort?

tejasapatil · 2017-03-08T01:15:04Z

Generated code (in case anyone reviewing the PR wants to see):

/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UTF8String.IntWrapper project_wrapper;
/* 010 */   private UTF8String.LongWrapper project_wrapper1;
/* 011 */   private UTF8String.IntWrapper project_wrapper2;
/* 012 */   private UTF8String.IntWrapper project_wrapper3;
/* 013 */   private UnsafeRow project_result;
/* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
/* 016 */
/* 017 */   public GeneratedIterator(Object[] references) {
/* 018 */     this.references = references;
/* 019 */   }
/* 020 */
/* 021 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 022 */     partitionIndex = index;
/* 023 */     this.inputs = inputs;
/* 024 */     inputadapter_input = inputs[0];
/* 025 */     project_wrapper = new UTF8String.IntWrapper();
/* 026 */     project_wrapper1 = new UTF8String.LongWrapper();
/* 027 */     project_wrapper2 = new UTF8String.IntWrapper();
/* 028 */     project_wrapper3 = new UTF8String.IntWrapper();
/* 029 */     project_result = new UnsafeRow(4);
/* 030 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 0);
/* 031 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 4);
/* 032 */
/* 033 */   }
/* 034 */
/* 035 */   protected void processNext() throws java.io.IOException {
/* 036 */     while (inputadapter_input.hasNext() && !stopEarly()) {
/* 037 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 038 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 039 */       UTF8String inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getUTF8String(0));
/* 040 */
/* 041 */       boolean project_isNull = inputadapter_isNull;
/* 042 */       int project_value = -1;
/* 043 */       if (!inputadapter_isNull) {
/* 044 */         if (inputadapter_value.toInt(project_wrapper)) {
/* 045 */           project_value = project_wrapper.value;
/* 046 */         } else {
/* 047 */           project_isNull = true;
/* 048 */         }
/* 049 */
/* 050 */       }
/* 051 */       boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);
/* 052 */       UTF8String inputadapter_value1 = inputadapter_isNull1 ? null : (inputadapter_row.getUTF8String(1));
/* 053 */       boolean project_isNull2 = inputadapter_isNull1;
/* 054 */       long project_value2 = -1L;
/* 055 */       if (!inputadapter_isNull1) {
/* 056 */         if (inputadapter_value1.toLong(project_wrapper1)) {
/* 057 */           project_value2 = project_wrapper1.value;
/* 058 */         } else {
/* 059 */           project_isNull2 = true;
/* 060 */         }
/* 061 */
/* 062 */       }
/* 063 */       boolean project_isNull4 = inputadapter_isNull;
/* 064 */       byte project_value4 = (byte)-1;
/* 065 */       if (!inputadapter_isNull) {
/* 066 */         if (inputadapter_value.toByte(project_wrapper2)) {
/* 067 */           project_value4 = (byte) project_wrapper2.value;
/* 068 */         } else {
/* 069 */           project_isNull4 = true;
/* 070 */         }
/* 071 */
/* 072 */       }
/* 073 */       boolean project_isNull6 = inputadapter_isNull;
/* 074 */       short project_value6 = (short)-1;
/* 075 */       if (!inputadapter_isNull) {
/* 076 */         if (inputadapter_value.toShort(project_wrapper3)) {
/* 077 */           project_value6 = (short) project_wrapper3.value;
/* 078 */         } else {
/* 079 */           project_isNull6 = true;
/* 080 */         }
/* 081 */
/* 082 */       }
/* 083 */       project_rowWriter.zeroOutNullBytes();
/* 084 */
/* 085 */       if (project_isNull) {
/* 086 */         project_rowWriter.setNullAt(0);
/* 087 */       } else {
/* 088 */         project_rowWriter.write(0, project_value);
/* 089 */       }
/* 090 */
/* 091 */       if (project_isNull2) {
/* 092 */         project_rowWriter.setNullAt(1);
/* 093 */       } else {
/* 094 */         project_rowWriter.write(1, project_value2);
/* 095 */       }
/* 096 */
/* 097 */       if (project_isNull4) {
/* 098 */         project_rowWriter.setNullAt(2);
/* 099 */       } else {
/* 100 */         project_rowWriter.write(2, project_value4);
/* 101 */       }
/* 102 */
/* 103 */       if (project_isNull6) {
/* 104 */         project_rowWriter.setNullAt(3);
/* 105 */       } else {
/* 106 */         project_rowWriter.write(3, project_value6);
/* 107 */       }
/* 108 */       append(project_result);
/* 109 */       if (shouldStop()) return;
/* 110 */     }
/* 111 */   }
/* 112 */ }

tejasapatil · 2017-03-08T01:31:52Z

Updated the perf. numbers with latest version of code

SparkQA · 2017-03-08T03:09:15Z

Test build #74156 has finished for PR 17184 at commit a91ca1b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class LongWrapper
public static class IntWrapper

SparkQA · 2017-03-08T03:13:02Z

Test build #74159 has finished for PR 17184 at commit 7ae2f73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T04:03:47Z

Test build #74161 has finished for PR 17184 at commit 51e7e3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T04:19:56Z

thanks, merging to master!

rxin · 2017-03-08T04:29:02Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

-      return b - '0';
-    }
-    throw new NumberFormatException(toString());
+  public static class LongWrapper {


@tejasapatil can you submit a follow up small pr to add classdoc for this? would be great to explain why we have this.

followup PR : #17205

rxin · 2017-03-08T04:29:26Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    return true;
+  }
+
+  public static class IntWrapper {


and perhaps move this closer to LongWrapper

actually why bother having an IntWrapper? Can't you use LongWrapper?

yea the calculation still use int/long respectively, the wrapper is only used for holding the result. There should be not much performance penalty to always use LongWrapper.

Using IntWrapper for integers gives better perf. If we use LongWrapper for integers, there would be a conversion needed from long -> int. Its not that big of a difference but given that ints are used heavily in workloads, I dont want to leave that behind.

Here is microbenchmark result for current approach VS using LongWrapper everywhere

conversion to int: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ IntWrapper 20397 / 20564 26.3 38.0 1.0X LongWrapper 20855 / 21530 25.7 38.8 1.0X

OK that's fair. In that case let's document this (otherwise somebody might come in and remove this in the future..)

rxin · 2017-03-08T04:32:59Z

I believe IBM J9 actually improved this specific case (their JIT handles tons of exceptions better). Oh well -- if only JIT is perfect.

## What changes were proposed in this pull request? This is as per suggestion by rxin at : #17184 (comment) ## How was this patch tested? NA as this is a documentation change Author: Tejas Patil <tejasp@fb.com> Closes #17205 from tejasapatil/SPARK-19843_followup.

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive f…

4a7df9f

…or invalid inputs

srowen requested changes Mar 7, 2017

View reviewed changes

review comment

a91ca1b

tejasapatil force-pushed the SPARK-19843_is_numeric_maybe branch from adcb405 to a91ca1b Compare March 8, 2017 00:49

self review

7ae2f73

cloud-fan reviewed Mar 8, 2017

View reviewed changes

remove getter and setter

51e7e3f

asfgit closed this in c96d14a Mar 8, 2017

rxin reviewed Mar 8, 2017

View reviewed changes

tejasapatil deleted the SPARK-19843_is_numeric_maybe branch March 8, 2017 04:33

tejasapatil mentioned this pull request Mar 8, 2017

[SPARK-19843] [SQL] [Followup] Classdoc for IntWrapper and LongWrapper #17205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

tejasapatil commented Mar 7, 2017 •

edited

tejasapatil commented Mar 7, 2017

SparkQA commented Mar 7, 2017

tejasapatil commented Mar 7, 2017

tejasapatil commented Mar 7, 2017

cloud-fan commented Mar 7, 2017 •

edited

srowen Mar 7, 2017

tejasapatil Mar 8, 2017

srowen Mar 7, 2017

tejasapatil Mar 8, 2017

srowen Mar 7, 2017

tejasapatil Mar 8, 2017

srowen Mar 7, 2017

tejasapatil Mar 8, 2017

cloud-fan Mar 8, 2017

tejasapatil Mar 8, 2017

cloud-fan Mar 8, 2017

tejasapatil Mar 8, 2017

tejasapatil commented Mar 8, 2017

tejasapatil commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

rxin Mar 8, 2017

tejasapatil Mar 8, 2017

tejasapatil Mar 8, 2017

rxin Mar 8, 2017

rxin Mar 8, 2017

cloud-fan Mar 8, 2017

tejasapatil Mar 8, 2017

rxin Mar 8, 2017

rxin commented Mar 8, 2017

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

Conversation

tejasapatil commented Mar 7, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

Performance

tejasapatil commented Mar 7, 2017

SparkQA commented Mar 7, 2017

tejasapatil commented Mar 7, 2017

tejasapatil commented Mar 7, 2017

cloud-fan commented Mar 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tejasapatil commented Mar 8, 2017

tejasapatil commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Mar 8, 2017

tejasapatil commented Mar 7, 2017 •

edited

cloud-fan commented Mar 7, 2017 •

edited