-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8270][SQL] levenshtein distance #7214
Conversation
s"$res = $stringUtils.getLevenshteinDistance($left.toString(), $right.toString());") | ||
} | ||
|
||
override def dataType: DataType = IntegerType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this right after inputTypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
Jenkins, test this please. |
Jenkins, test this please. |
LGTM. |
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringFunctionsSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/functions.scala # sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
Test build #36526 has finished for PR 7214 at commit
|
* Computes the length of a given string value | ||
* | ||
* Computes the length of a given string value. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for being nitpicky, but can you remove this extra space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually never mind - i can fix it while i merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tarek Auel: Instead of calling the Apache library, can you implement LevenshteinDistance by yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but is there a benefit? The StringUtils
implementation does look reasonable, doesn't it? Do you have a special concern about their implementation? (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-lang3/3.0/org/apache/commons/lang3/StringUtils.java#StringUtils.getLevenshteinDistance%28java.lang.CharSequence%2Cjava.lang.CharSequence%29)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few benefits:
- We don't need to depend on commons-util. In general it is good to reduce external dependency due to dependency conflicts.
- Our version ideally should work directly against UTF8String, so we don't need conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I filed https://issues.apache.org/jira/browse/SPARK-8830
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Can I implement this as part of UTF8String
or should it be in stringOperations? Else I do have to implement it in Java (codeGen) and Scala, don't I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add it to utf8string, and just call it from here
Thanks. I've merged this. |
you doesn't need to check left or right string are null or not since the getLevenshteinDistance check if the strings are null or not. Duplicate checks will make programs slower. |
Jira: https://issues.apache.org/jira/browse/SPARK-8270
Info: I can not build the latest master, it stucks during the build process:
[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml