Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22771][SQL] Concatenate binary inputs into a binary output #19977

Closed
wants to merge 16 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 14, 2017

What changes were proposed in this pull request?

This pr modified concat to concat binary inputs into a single binary output.
concat in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, concat also outputs binary.

How was this patch tested?

Added tests in SQLQueryTestSuite and TypeCoercionSuite.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84904 has finished for PR 19977 at commit 8b7dcc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84910 has finished for PR 19977 at commit a252bab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could you confirm whether Hive behaves the same?

@maropu
Copy link
Member Author

maropu commented Dec 14, 2017

ok

@maropu
Copy link
Member Author

maropu commented Dec 15, 2017

checked;

$ hive --version
Hive 2.3.0

hive> create table t1(a string, b string);
hive> create view v1 as select a || b from t1;
hive> describe v1;
OK
_c0                 	string   

hive> create table t2(a binary, b binary);
hive> create view v2 as select a || b from t2;
hive> describe v2;
_c0                 	binary              	                    


override def inputTypes: Seq[AbstractDataType] =
Seq.fill(children.size)(if (isBinaryMode) BinaryType else StringType)
override def dataType: DataType = if (isBinaryMode) BinaryType else StringType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we worry about backward compatibility?

Copy link
Member Author

@maropu maropu Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, should be. Any existing option for keeping back compatibility? Or, how about adding a new option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conf is needed for sure. We also need a Migration Guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to add a new option for this case only? If we keep adding new options for each case, options could blow up?

@@ -50,15 +51,23 @@ import org.apache.spark.unsafe.types.{ByteArray, UTF8String}
""")
case class Concat(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes {

override def inputTypes: Seq[AbstractDataType] = Seq.fill(children.size)(StringType)
override def dataType: DataType = StringType
private lazy val isBinaryMode = children.nonEmpty && children.forall(_.dataType == BinaryType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all inputs are binary, concat also outputs binary.

Is this true in Hive and others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check some patterns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg and hive have the same;

postgres=# create table t1(a bytea, b bytea, c varchar, d varchar);
postgres=# create view v1 as select a || b || c || d from t1;
postgres=# \d v1
   View "public.view41_1"
  Column  | Type | Modifiers 
----------+------+-----------
 ?column? | text | 


hive> create table t1(a binary, b binary, c text, d test);
hive> create view v1 as select a || b || c || d from t1;
hive> describe v1;
_c0                 	string  

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, thanks for the info!
I checked the db2 behaviour and I found db2 seems to have a bit different casting rule.
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0000736.html?view=kc
IIUC, in db2, the type of concat(binary, string) is binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked mysql: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_concat

recap:
hive, postgresql: concat(binary, string) => string
mysql, db2: conat(binary, string) => binary

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84950 has finished for PR 19977 at commit 0d5cd5d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ConcatCoercion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84951 has finished for PR 19977 at commit c3ea879.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84953 has finished for PR 19977 at commit b5fe52c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84963 has finished for PR 19977 at commit ffb6e1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ConcatCoercion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #84987 has finished for PR 19977 at commit 2d3926e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@maropu
Copy link
Member Author

maropu commented Dec 16, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #84996 has finished for PR 19977 at commit 2d3926e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@maropu
Copy link
Member Author

maropu commented Dec 16, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #85000 has finished for PR 19977 at commit 2d3926e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@maropu
Copy link
Member Author

maropu commented Dec 16, 2017

oh...

@maropu
Copy link
Member Author

maropu commented Dec 16, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #85005 has finished for PR 19977 at commit 2d3926e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@maropu
Copy link
Member Author

maropu commented Dec 17, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2017

Test build #85031 has finished for PR 19977 at commit 2d3926e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FunctionArgumentConversion(conf: SQLConf) extends TypeCoercionRule
  • case class Concat(children: Seq[Expression], isBinaryMode: Boolean = false)

@@ -1035,6 +1035,12 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val ConcatBinaryModeEnabled = buildConf("spark.sql.expression.concat.binaryMode.enabled")
Copy link
Member

@gatorsmile gatorsmile Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> spark.sql.typeCoercion.concatBinaryAsString

Copy link
Member Author

@maropu maropu Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update after reviews finished

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe CONCAT_BINARY_AS_STRING_ENABLED?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> spark.sql.function.concatBinaryAsString

@gatorsmile
Copy link
Member

@maropu No need to re-trigger it. The failure is not caused by this PR.

@gatorsmile
Copy link
Member

Will review it tomorrow. Thanks!

@maropu
Copy link
Member Author

maropu commented Dec 18, 2017

I found different behaviours in a string functions elt though, is this expected (binray output is a small surprise?);

hive> create table t(a binary, b binary);
hive> create view v as select elt(1, a, b) from t;
hive> describe v;
_c0                 	string    

mysql> create table t(a binary, b binary);
mysql> create view v as select elt(1, a, b) from t;
mysql> SHOW COLUMNS FROM v;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| elt(1, a, b) | varbinary(1) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)

scala> val df = Seq(("aaa".getBytes, "bbb".getBytes)).toDF("a", "b")
scala> df.selectExpr("elt(1, a, b)").printSchema
root
 |-- elt(1, CAST(a AS STRING), CAST(b AS STRING)): string (nullable = true)

@gcz2022
Copy link

gcz2022 commented Dec 19, 2017

You mean answers of mysql is unexpected? I think it's common these dbs get different behaviors, while Spark mainly follows Hive.

@SparkQA
Copy link

SparkQA commented Dec 28, 2017

Test build #85451 has finished for PR 19977 at commit 179c6fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2017

Test build #85454 has finished for PR 19977 at commit 1c94418.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -24,3 +24,17 @@ select left("abcd", 2), left("abcd", 5), left("abcd", '2'), left("abcd", null);
select left(null, -2), left("abcd", -2), left("abcd", 0), left("abcd", 'a');
select right("abcd", 2), right("abcd", 5), right("abcd", '2'), right("abcd", null);
select right(null, -2), right("abcd", -2), right("abcd", 0), right("abcd", 'a');

-- turn on concatBinaryAsString
set spark.sql.function.concatBinaryAsString=false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of other dbms-like systems concat binary inputs as binary, IMO turning off by default is okay to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant you said turn on in the comment (L28).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh....

@viirya
Copy link
Member

viirya commented Dec 28, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Dec 28, 2017

Test build #85473 has finished for PR 19977 at commit 1e13b70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Try

SELECT (col1 || (col3 || col4)) col
FROM (
  SELECT
    string(id) col1,
    encode(string(id + 2), 'utf-8') col3,
    encode(string(id + 3), 'utf-8') col4
  FROM range(10)
)

@maropu
Copy link
Member Author

maropu commented Dec 29, 2017

ah, ok. good catch. I'll fix soon.

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85495 has finished for PR 19977 at commit 57a9d1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -653,7 +660,11 @@ object CombineConcats extends Rule[LogicalPlan] {
}

def apply(plan: LogicalPlan): LogicalPlan = plan.transformExpressionsDown {
case concat: Concat if concat.children.exists(_.isInstanceOf[Concat]) =>
case concat: Concat if concat.children.exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a dedicated helper function for the if condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85511 has finished for PR 19977 at commit b9febbd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 30, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Dec 30, 2017

Test build #85535 has finished for PR 19977 at commit b9febbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in f2b3525 Dec 30, 2017
@maropu
Copy link
Member Author

maropu commented Dec 30, 2017

thanks, I'll fix elt too in following pr

asfgit pushed a commit that referenced this pull request Jan 4, 2018
…tDataTypes

## What changes were proposed in this pull request?
This pr is a follow-up to fix a bug left in #19977.

## How was this patch tested?
Added tests in `StringExpressionsSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #20149 from maropu/SPARK-22771-FOLLOWUP.

(cherry picked from commit 6f68316)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
asfgit pushed a commit that referenced this pull request Jan 4, 2018
…tDataTypes

## What changes were proposed in this pull request?
This pr is a follow-up to fix a bug left in #19977.

## How was this patch tested?
Added tests in `StringExpressionsSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #20149 from maropu/SPARK-22771-FOLLOWUP.
asfgit pushed a commit that referenced this pull request Jan 6, 2018
## What changes were proposed in this pull request?
This pr modified `elt` to output binary for binary inputs.
`elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise).
This pr is related to #19977.

## How was this patch tested?
Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #20135 from maropu/SPARK-22937.

(cherry picked from commit e8af7e8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
asfgit pushed a commit that referenced this pull request Jan 6, 2018
## What changes were proposed in this pull request?
This pr modified `elt` to output binary for binary inputs.
`elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise).
This pr is related to #19977.

## How was this patch tested?
Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #20135 from maropu/SPARK-22937.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants