Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8245][SQL] FormatNumber/Length Support for Expression #7034

Closed
wants to merge 4 commits into from

Conversation

chenghao-intel
Copy link
Contributor

  • BinaryType for Length
  • FormatNumber

@chenghao-intel
Copy link
Contributor Author

@rxin @cloud-fan @marmbrus

case e if !e.childrenResolved => e

case e: ExpressionConstraint =>
val newChildren = e.children.zip(e.constraint).map { case (expr, constraint) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we require e.children.length == e.constraint.length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that also, but this is will be solved during the debugging time, and in a release version, we never run into the case e.children.length != e.constraint.length, doesn't it?

@cloud-fan
Copy link
Contributor

Thanks for doing it! We do need an abstraction to do type checking and supported-type cast together, the ExpectsInputTypes is really bad...
However, input type constraint may be complex like we need all children be same type for Coalesce and BinaryExpression. I think we need a more general abstraction or use DataTypeConstraint to only replace ExpectsInputTypes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35836 has finished for PR 7034 at commit aa16d12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AcceptType
    • case class AcceptSpecifiedType(val types: Set[DataType]) extends AcceptType
    • case class DataTypeConstraint(
    • trait ExpressionConstraint
    • case class Sha2(left: Expression, right: Expression)
    • case class Length(child: Expression) extends UnaryExpression with ExpressionConstraint
    • case class FormatNumber(x: Expression, d: Expression) extends Expression with ExpressionConstraint

@rxin
Copy link
Contributor

rxin commented Jun 26, 2015

This seems too complicated. Why not just rename ExpectsInputTypes to AutoCastInputTypes, and then add another ExpectsInputTypes to do type checking without auto casting?

@chenghao-intel
Copy link
Contributor Author

Thank you all for reviewing the code & suggestions.
@cloud-fan DataTypeConstraint doesn't aim to replace the existed generic type coercion rules in HiveTypeCoercion, particularly not for those cases need to make the identical data type for all of its children expression cases, like Coalesce or BinaryArithmetic
@rxin ExpectsInputTypes couldn't describe the case of supporting multiple datatypes for single child expression. For example: Length(child: Expression) supports both StringType and BinaryType of the child, AutoCastInputTypes may also not able to auto infer the right casting, that's why I expose both of the interface in DataTypeConstraint

Recently, we've been working on adding more expression, but always be challenged for the data type supporting / casting issues, that's would be great if we can figure out how to solve that ASAP, either in this PR or someone else's PR.

@SparkQA
Copy link

SparkQA commented Jun 28, 2015

Test build #35913 has finished for PR 7034 at commit 58e67ef.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AcceptType
    • case class AcceptSpecifiedType(val types: Set[DataType]) extends AcceptType
    • case class DataTypeConstraint(
    • trait ExpressionConstraint
    • case class Length(child: Expression) extends UnaryExpression with ExpressionConstraint
    • case class FormatNumber(x: Expression, d: Expression) extends Expression with ExpressionConstraint

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 28, 2015

Test build #35933 has finished for PR 7034 at commit 58e67ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AcceptType
    • case class AcceptSpecifiedType(val types: Set[DataType]) extends AcceptType
    • case class DataTypeConstraint(
    • trait ExpressionConstraint
    • case class Length(child: Expression) extends UnaryExpression with ExpressionConstraint
    • case class FormatNumber(x: Expression, d: Expression) extends Expression with ExpressionConstraint

* Accept all of the data types, except the [[UserDefinedType]] for the child expression.
*/
case object AcceptAllTypeExceptUserDefinedType extends AcceptType {
def accept(dt: DataType): Boolean = if (dt.isInstanceOf[UserDefinedType[_]]) false else true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!(dt.isInstanceOf[UserDefinedType[_]])

asfgit pushed a commit that referenced this pull request Jul 2, 2015
Jira:
https://issues.apache.org/jira/browse/SPARK-8223
https://issues.apache.org/jira/browse/SPARK-8224

~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged

rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).

If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7178 from tarekauel/8223 and squashes the following commits:

8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
@marmbrus
Copy link
Contributor

marmbrus commented Jul 9, 2015

Can this be closed now that #5796 is merged?

@chenghao-intel
Copy link
Contributor Author

It still have the expression FormatNumber, I will update the title and rebase the code.

child.dataType match {
case StringType => defineCodeGen(ctx, ev, c => s"($c).numChars()")
case BinaryType => defineCodeGen(ctx, ev, c => s"($c).length")
case NullType => defineCodeGen(ctx, ev, c => s"-1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to support NullType here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will causes exception in StringFunctionSuite, as we will not run Analyzer at all there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just remove that test case, can't you?

checkEvaluation(Length(Literal.create(null, NullType)), null, create_row(null))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, we can do that now since you've handled the NullType in a single place.

@chenghao-intel chenghao-intel changed the title [SPARK-8653][SPARK-8245][SQL] Add DataTypeConstraint Support for Expression [SPARK-8245][SQL] FormatNumber/Length Support for Expression Jul 15, 2015
* and returns the result as a string. If D is 0, the result has no decimal point or
* fractional part.
*/
case class FormatNumber(x: Expression, d: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override prettyName

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, yes, it's done, but in the end of this class code.

@rxin
Copy link
Contributor

rxin commented Jul 15, 2015

LGTM other than that.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37333 has finished for PR 7034 at commit 73a7cc3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends SparkListenerEvent
    • case class BlockUpdatedInfo(
    • class StorageListener(storageStatusListener: StorageStatusListener) extends BlockStatusListener
    • class LDAModel(JavaModelWrapper):
    • class LDA(object):
    • trait ImplicitCastInputTypes extends ExpectsInputTypes
    • abstract class BinaryOperator extends BinaryExpression with ExpectsInputTypes
    • case class UnaryMinus(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class UnaryPositive(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class BitwiseNot(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Factorial(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Round(child: Expression, scale: Expression)
    • case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Sha1(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Not(child: Expression)
    • case class And(left: Expression, right: Expression) extends BinaryOperator with Predicate
    • case class Or(left: Expression, right: Expression) extends BinaryOperator with Predicate
    • trait StringRegexExpression extends ImplicitCastInputTypes
    • trait String2StringExpression extends ImplicitCastInputTypes
    • trait StringComparison extends ImplicitCastInputTypes
    • case class StringSpace(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class UnBase64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class FormatNumber(x: Expression, d: Expression)
    • case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37338 has finished for PR 7034 at commit f282180.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModelWrapper):
    • class LDA(object):
    • case class Round(child: Expression, scale: Expression)
    • case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class FormatNumber(x: Expression, d: Expression)

@rxin
Copy link
Contributor

rxin commented Jul 15, 2015

@chenghao-intel you need to update Python to reflect the strlen -> length naming change.

@chenghao-intel
Copy link
Contributor Author

oh, yes, thanks for the reminding.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37436 has finished for PR 7034 at commit 601bbf5.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class FormatNumber(x: Expression, d: Expression)

@rxin
Copy link
Contributor

rxin commented Jul 16, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37443 has finished for PR 7034 at commit e534b87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class FormatNumber(x: Expression, d: Expression)

@rxin
Copy link
Contributor

rxin commented Jul 16, 2015

Thanks - merging this.

@asfgit asfgit closed this in 42dea3a Jul 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants