Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8270][SQL] levenshtein distance #7214

Closed
wants to merge 5 commits into from

Conversation

tarekbecker
Copy link
Contributor

Jira: https://issues.apache.org/jira/browse/SPARK-8270

Info: I can not build the latest master, it stucks during the build process: [INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml

s"$res = $stringUtils.getLevenshteinDistance($left.toString(), $right.toString());")
}

override def dataType: DataType = IntegerType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this right after inputTypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@rxin
Copy link
Contributor

rxin commented Jul 4, 2015

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Jul 4, 2015

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Jul 4, 2015

LGTM.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringFunctionsSuite.scala
#	sql/core/src/main/scala/org/apache/spark/sql/functions.scala
#	sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
@SparkQA
Copy link

SparkQA commented Jul 4, 2015

Test build #36526 has finished for PR 7214 at commit a2ad318.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression

* Computes the length of a given string value
*
* Computes the length of a given string value.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for being nitpicky, but can you remove this extra space?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually never mind - i can fix it while i merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tarek Auel: Instead of calling the Apache library, can you implement LevenshteinDistance by yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but is there a benefit? The StringUtils implementation does look reasonable, doesn't it? Do you have a special concern about their implementation? (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-lang3/3.0/org/apache/commons/lang3/StringUtils.java#StringUtils.getLevenshteinDistance%28java.lang.CharSequence%2Cjava.lang.CharSequence%29)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few benefits:

  1. We don't need to depend on commons-util. In general it is good to reduce external dependency due to dependency conflicts.
  2. Our version ideally should work directly against UTF8String, so we don't need conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Can I implement this as part of UTF8String or should it be in stringOperations? Else I do have to implement it in Java (codeGen) and Scala, don't I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it to utf8string, and just call it from here

@rxin
Copy link
Contributor

rxin commented Jul 4, 2015

Thanks. I've merged this.

@asfgit asfgit closed this in 6b3574e Jul 4, 2015
@hujy
Copy link
Contributor

hujy commented Jul 6, 2015

you doesn't need to check left or right string are null or not since the getLevenshteinDistance check if the strings are null or not. Duplicate checks will make programs slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants