Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184

Closed

Conversation

tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Mar 7, 2017

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-19843

Created wrapper classes (IntWrapper, LongWrapper) to wrap the result of parsing (which are primitive types). In case of problem in parsing, the method would return a boolean.

How was this patch tested?

  • Added new unit tests
  • Ran a prod job which had conversion from string -> int and verified the outputs

Performance

Tiny regression when all strings are valid integers

conversion to int:       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------
trunk                         502 /  522         33.4          29.9       1.0X
SPARK-19843                   493 /  503         34.0          29.4       1.0X

Huge gain when all strings are invalid integers

conversion to int:      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------
trunk                     33913 / 34219          0.5        2021.4       1.0X
SPARK-19843                  154 /  162        108.8           9.2     220.0X

@tejasapatil
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74061 has finished for PR 17184 at commit adcb405.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

cc @cloud-fan for review

@tejasapatil
Copy link
Contributor Author

also, cc @gatorsmile

@cloud-fan
Copy link
Contributor

cloud-fan commented Mar 7, 2017

Ideally we need a better way to indicate an invalid input, as exception handling is too expensive. Adding a check before process is one approach, but as you said, it has performance penalty if all records are valid.

How about we use C-style function and return the result via parameter?. e.g.

class ToIntResult {
  int res = 0;
  boolean valid = true;
}

public void toInt(ToIntResult t) {
  ...
  if (something wrong) {
    t.valid = false;
  }
  ...
  t.res = xxx;
}

The caller side can reuse the result object to reduce cost


@Test
public void testIsIntMaybe() throws IOException {
assertEquals(true, UTF8String.fromString("1").isIntMaybe());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertTrue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

assertEquals(true, UTF8String.fromString(String.valueOf(Integer.MIN_VALUE)).isIntMaybe());

Random rand = new Random();
for(int i = 0; i < 10; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, which lint might catch: space after for.
Seed the RNG for determinism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • fixed linter
  • I dont want determinism so not setting seed


if (!Character.isDigit(firstByte)) {
// if the first character isn't a digit, then it has to be either `+` OR `-`
if ((firstByte == '-' || firstByte == '+')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra parens here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is removed as I have changed the approach

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.util.Benchmark

object UTF8StringBenchmark {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the convention is, but I don't know if we need to preserve this benchmarking code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the benchmark code. Had originally attached in case anyone wanted to verify if the benchmarking was sane

if (b >= '0' && b <= '9') {
return b - '0';
public static class LongWrapper {
private long value = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's for internal use only, shall we make it public? We can probably save some function calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed getter and setter. changed to public

case _: NumberFormatException => null
val result = new IntWrapper()
buildCast[UTF8String](_, s => if (s.toShort(result)) {
result.getValue.asInstanceOf[Short]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use toShort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@tejasapatil
Copy link
Contributor Author

Generated code (in case anyone reviewing the PR wants to see):

/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UTF8String.IntWrapper project_wrapper;
/* 010 */   private UTF8String.LongWrapper project_wrapper1;
/* 011 */   private UTF8String.IntWrapper project_wrapper2;
/* 012 */   private UTF8String.IntWrapper project_wrapper3;
/* 013 */   private UnsafeRow project_result;
/* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
/* 016 */
/* 017 */   public GeneratedIterator(Object[] references) {
/* 018 */     this.references = references;
/* 019 */   }
/* 020 */
/* 021 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 022 */     partitionIndex = index;
/* 023 */     this.inputs = inputs;
/* 024 */     inputadapter_input = inputs[0];
/* 025 */     project_wrapper = new UTF8String.IntWrapper();
/* 026 */     project_wrapper1 = new UTF8String.LongWrapper();
/* 027 */     project_wrapper2 = new UTF8String.IntWrapper();
/* 028 */     project_wrapper3 = new UTF8String.IntWrapper();
/* 029 */     project_result = new UnsafeRow(4);
/* 030 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 0);
/* 031 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 4);
/* 032 */
/* 033 */   }
/* 034 */
/* 035 */   protected void processNext() throws java.io.IOException {
/* 036 */     while (inputadapter_input.hasNext() && !stopEarly()) {
/* 037 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 038 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 039 */       UTF8String inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getUTF8String(0));
/* 040 */
/* 041 */       boolean project_isNull = inputadapter_isNull;
/* 042 */       int project_value = -1;
/* 043 */       if (!inputadapter_isNull) {
/* 044 */         if (inputadapter_value.toInt(project_wrapper)) {
/* 045 */           project_value = project_wrapper.value;
/* 046 */         } else {
/* 047 */           project_isNull = true;
/* 048 */         }
/* 049 */
/* 050 */       }
/* 051 */       boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);
/* 052 */       UTF8String inputadapter_value1 = inputadapter_isNull1 ? null : (inputadapter_row.getUTF8String(1));
/* 053 */       boolean project_isNull2 = inputadapter_isNull1;
/* 054 */       long project_value2 = -1L;
/* 055 */       if (!inputadapter_isNull1) {
/* 056 */         if (inputadapter_value1.toLong(project_wrapper1)) {
/* 057 */           project_value2 = project_wrapper1.value;
/* 058 */         } else {
/* 059 */           project_isNull2 = true;
/* 060 */         }
/* 061 */
/* 062 */       }
/* 063 */       boolean project_isNull4 = inputadapter_isNull;
/* 064 */       byte project_value4 = (byte)-1;
/* 065 */       if (!inputadapter_isNull) {
/* 066 */         if (inputadapter_value.toByte(project_wrapper2)) {
/* 067 */           project_value4 = (byte) project_wrapper2.value;
/* 068 */         } else {
/* 069 */           project_isNull4 = true;
/* 070 */         }
/* 071 */
/* 072 */       }
/* 073 */       boolean project_isNull6 = inputadapter_isNull;
/* 074 */       short project_value6 = (short)-1;
/* 075 */       if (!inputadapter_isNull) {
/* 076 */         if (inputadapter_value.toShort(project_wrapper3)) {
/* 077 */           project_value6 = (short) project_wrapper3.value;
/* 078 */         } else {
/* 079 */           project_isNull6 = true;
/* 080 */         }
/* 081 */
/* 082 */       }
/* 083 */       project_rowWriter.zeroOutNullBytes();
/* 084 */
/* 085 */       if (project_isNull) {
/* 086 */         project_rowWriter.setNullAt(0);
/* 087 */       } else {
/* 088 */         project_rowWriter.write(0, project_value);
/* 089 */       }
/* 090 */
/* 091 */       if (project_isNull2) {
/* 092 */         project_rowWriter.setNullAt(1);
/* 093 */       } else {
/* 094 */         project_rowWriter.write(1, project_value2);
/* 095 */       }
/* 096 */
/* 097 */       if (project_isNull4) {
/* 098 */         project_rowWriter.setNullAt(2);
/* 099 */       } else {
/* 100 */         project_rowWriter.write(2, project_value4);
/* 101 */       }
/* 102 */
/* 103 */       if (project_isNull6) {
/* 104 */         project_rowWriter.setNullAt(3);
/* 105 */       } else {
/* 106 */         project_rowWriter.write(3, project_value6);
/* 107 */       }
/* 108 */       append(project_result);
/* 109 */       if (shouldStop()) return;
/* 110 */     }
/* 111 */   }
/* 112 */ }

@tejasapatil
Copy link
Contributor Author

Updated the perf. numbers with latest version of code

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74156 has finished for PR 17184 at commit a91ca1b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class LongWrapper
  • public static class IntWrapper

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74159 has finished for PR 17184 at commit 7ae2f73.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74161 has finished for PR 17184 at commit 51e7e3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in c96d14a Mar 8, 2017
return b - '0';
}
throw new NumberFormatException(toString());
public static class LongWrapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil can you submit a follow up small pr to add classdoc for this? would be great to explain why we have this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followup PR : #17205

return true;
}

public static class IntWrapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and perhaps move this closer to LongWrapper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually why bother having an IntWrapper? Can't you use LongWrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea the calculation still use int/long respectively, the wrapper is only used for holding the result. There should be not much performance penalty to always use LongWrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using IntWrapper for integers gives better perf. If we use LongWrapper for integers, there would be a conversion needed from long -> int. Its not that big of a difference but given that ints are used heavily in workloads, I dont want to leave that behind.

Here is microbenchmark result for current approach VS using LongWrapper everywhere

conversion to int:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
IntWrapper                                  20397 / 20564         26.3          38.0       1.0X
LongWrapper                                 20855 / 21530         25.7          38.8       1.0X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's fair. In that case let's document this (otherwise somebody might come in and remove this in the future..)

@rxin
Copy link
Contributor

rxin commented Mar 8, 2017

I believe IBM J9 actually improved this specific case (their JIT handles tons of exceptions better). Oh well -- if only JIT is perfect.

@tejasapatil tejasapatil deleted the SPARK-19843_is_numeric_maybe branch March 8, 2017 04:33
asfgit pushed a commit that referenced this pull request Mar 8, 2017
## What changes were proposed in this pull request?

This is as per suggestion by rxin at : #17184 (comment)

## How was this patch tested?

NA as this is a documentation change

Author: Tejas Patil <tejasp@fb.com>

Closes #17205 from tejasapatil/SPARK-19843_followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants