[SPARK-9578] [ML] Stemmer feature transformer #10272

hhbyyh · 2015-12-12T03:21:42Z

jira: https://issues.apache.org/jira/browse/SPARK-9578

Classical Porter stemmer, which is implemented referring to scalanlp/chalk
https://github.com/scalanlp/chalk/blob/master/src/main/scala/chalk/text/analyze
@jasonbaldridge Let me know if you're interested.

I compared the following implementations:
http://tartarus.org/martin/PorterStemmer/scala.txt
https://github.com/ifesdjeen/jReadability/blob/master/src/scala/main/com/jreadability/main/Stemmer.scala
https://github.com/aztek/porterstemmer/blob/master/src/main/scala/com/github/aztek/porterstemmer/PorterStemmer.scala

SparkQA · 2015-12-12T04:13:16Z

Test build #47606 has finished for PR 10272 at commit ddb26da.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class Stemmer (override val uid: String)\n

BenFradet · 2015-12-14T13:31:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/Stemmer.scala

The different orElse need to be uniformized.
Same thing goes for step4.

BenFradet · 2015-12-14T13:33:48Z

There are a few quirks regarding formatting.

Also, I'm wondering if the different "step" methods should be documented or renamed so we get what they're doing without having to skim over the code.

hhbyyh · 2015-12-17T06:24:18Z

@BenFradet, Thanks for helping review. I added some comments, yet listing all conditions may not be necessary.

SparkQA · 2015-12-17T08:02:39Z

Test build #47903 has finished for PR 10272 at commit ff03152.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class Stemmer (override val uid: String)\n

BenFradet · 2015-12-17T08:05:41Z

LGTM

jasonbaldridge · 2015-12-17T12:23:10Z

FWIW, I did a Scala adaptation of a Java PorterStemmer here too: https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala

hhbyyh · 2015-12-18T03:01:38Z

Thanks @jasonbaldridge for taking a look. If you're interested, I'll close the PR thus you can send a new one. I can help review and compare the performance.

Thanks @BenFradet for helping review.

jasonbaldridge · 2015-12-18T15:43:52Z

@hhbyyh: I'm pretty slammed right now with other things, so if you'd like to go ahead and compare and choose whichever, that's totally fine with me. Thanks!

hhbyyh · 2015-12-24T07:09:32Z

Current implementation still outperforms the one from https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala, by about 70%.

This is ready for review now.

mengxr · 2016-03-03T18:02:27Z

@hhbyyh I would try to avoid maintaining a stemmer implementation in MLlib. This is not a distributed algorithm and there exist several implementations from NLP libraries. The best option is to introduce a dependency and wrap the stemmer implementation there. If we made some improvements to an existing stemmer implementation, we should consider contributing to it.

I checked chalk's dependency: https://repo1.maven.org/maven2/org/scalanlp/chalk/1.3.0/chalk-1.3.0.pom. It looks okay once we removed the scala actors. Could you help check the details?

hhbyyh · 2016-03-10T00:28:51Z

@mengxr
Thanks for taking a look.
The comment from Joseph (https://issues.apache.org/jira/browse/SPARK-5571?focusedCommentId=14632052&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14632052) seems to prefer to add the code directly.
We can put this on hold if necessary.

mengxr · 2016-03-21T23:28:53Z

@hhbyyh Joseph's comment was about carefully introducing new dependencies. If we pick a library that doesn't depend on many others, it should be safe for us. There are several packages containing Porter stemmer, e.g., lucene, CoreNLP, and chalk. Lucene is a lightweight library, but used by many other systems. So it is not a safe choice. CoreNLP is licensed under LGPL, so not an option here. chalk seems okay to me by looking at its dependencies.

I'm a little worried about the cost if we maintain our own implementation in MLlib. We cannot leverage other NLP projects (where the experts are) on possible improvements. So could you take a look at chalk?

@jasonbaldridge To add chalk as a dependency, we need chalk releases for both Scala 2.10 and 2.11. But I only see 2.10 releases on maven central. Do you have plans for publishing new releases for both 2.10 and 2.11?

jasonbaldridge · 2016-03-22T04:20:52Z

@mengxr Chalk isn't under current development unfortunately, and I'm not sure whether I'll be getting back to it. Another option might be to add it to the lib-text library, which is being maintained and updated, and might have some other useful things for you:

https://github.com/peoplepattern/lib-text

cc @eponvert and @dlwh

dlwh · 2016-03-22T04:25:58Z

(Totally not paying attention to this issue.)

Epic has a PorterStemmer as well. (Well, I re-took it from Chalk.)
https://github.com/dlwh/epic

I've been neglecting Epic to some extent of late, but it's there and
available.

I'm not sure it's worth adding a large dependency just for that, but just
FYI.

On Mon, Mar 21, 2016 at 9:21 PM, Jason Baldridge notifications@github.com
wrote:

@mengxr https://github.com/mengxr Chalk isn't under current development
unfortunately, and I'm not sure whether I'll be getting back to it. Another
option might be to add it to the lib-text library, which is being
maintained and updated, and might have some other useful things for you:

https://github.com/peoplepattern/lib-text

cc @eponvert https://github.com/eponvert and @dlwh
https://github.com/dlwh

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#10272 (comment)

MLnick · 2016-03-22T08:33:07Z

IMO more specific or complex domain-specific stuff should live outside of core, until such time as there is clear demand across a wider user base that justifies the maintenance cost of including it. Already Spark ML has a large maintenance & code review burden just with the algos and feature transformers that are already in there.

The whole point of an API for pipelines is to enable external libraries for more specific use cases. This is doubly the case when well-known and robust libraries already provide the functionality. As you can see from your PR, implementing one's own stemmer transformer using one of the external NLP libs is a few lines of code.

Things like NLP (and image, video and audio processing, for example) should start life as a Spark package. How about looking at contributing to https://github.com/mengxr/spark-corenlp and wrapping the CoreNLP stemmer functionality as a transformer?

eponvert · 2016-03-22T13:30:04Z

@jasonbaldridge @mengxr I went ahead and opened peoplepattern/lib-text#21, though I've not scheduled it for a target release yet

hhbyyh · 2016-03-22T13:46:09Z

I'm fine to put this on hold like previously stated.Thank you all for the discussion. It should have provided enough information for anyone interested in using Stemmer with Spark.

I plan to put this in the topic modeling Spark package for now. Will send a link here afterwards.

jasonbaldridge · 2016-03-22T16:10:42Z

@MLnick CoreNLP is GPL, so I would worry that some people would use it as though it were part of an ASL suite. Spark-CoreNLP is correctly licensed as GPL, but there should be some big warning flags in the README so that people don't inadvertently use it in a way that is inconsistent with the GPL.

FWIW, I would strongly prefer an "NLP for Spark" package to be ASL, so Spark-CoreNLP isn't useful (though it's great stuff, objectively speaking).

hhbyyh added 3 commits December 11, 2015 21:38

initial stemmer

857f68d

line fix

211ac04

case fix

ddb26da

BenFradet reviewed Dec 14, 2015
View reviewed changes

hhbyyh added 2 commits December 17, 2015 13:23

Merge remote-tracking branch 'upstream/master' into stem

dc419a1

add comments and fix style

ff03152

hhbyyh closed this Mar 22, 2016

[SPARK-9578] [ML] Stemmer feature transformer #10272

[SPARK-9578] [ML] Stemmer feature transformer #10272

Uh oh!

Conversation

hhbyyh commented Dec 12, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

BenFradet Dec 14, 2015

Choose a reason for hiding this comment

Uh oh!

BenFradet commented Dec 14, 2015

Uh oh!

hhbyyh commented Dec 17, 2015

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

BenFradet commented Dec 17, 2015

Uh oh!

jasonbaldridge commented Dec 17, 2015

Uh oh!

hhbyyh commented Dec 18, 2015

Uh oh!

jasonbaldridge commented Dec 18, 2015

Uh oh!

hhbyyh commented Dec 24, 2015

Uh oh!

mengxr commented Mar 3, 2016

Uh oh!

hhbyyh commented Mar 10, 2016

Uh oh!

mengxr commented Mar 21, 2016

Uh oh!

jasonbaldridge commented Mar 22, 2016

Uh oh!

dlwh commented Mar 22, 2016

Uh oh!

MLnick commented Mar 22, 2016

Uh oh!

eponvert commented Mar 22, 2016

Uh oh!

hhbyyh commented Mar 22, 2016

Uh oh!

jasonbaldridge commented Mar 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants