-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9578] [ML] Stemmer feature transformer #10272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #47606 has finished for PR 10272 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different orElse need to be uniformized.
Same thing goes for step4.
|
There are a few quirks regarding formatting. Also, I'm wondering if the different "step" methods should be documented or renamed so we get what they're doing without having to skim over the code. |
|
@BenFradet, Thanks for helping review. I added some comments, yet listing all conditions may not be necessary. |
|
Test build #47903 has finished for PR 10272 at commit
|
|
LGTM |
|
FWIW, I did a Scala adaptation of a Java PorterStemmer here too: https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala |
|
Thanks @jasonbaldridge for taking a look. If you're interested, I'll close the PR thus you can send a new one. I can help review and compare the performance. Thanks @BenFradet for helping review. |
|
@hhbyyh: I'm pretty slammed right now with other things, so if you'd like to go ahead and compare and choose whichever, that's totally fine with me. Thanks! |
|
Current implementation still outperforms the one from https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala, by about 70%. This is ready for review now. |
|
@hhbyyh I would try to avoid maintaining a stemmer implementation in MLlib. This is not a distributed algorithm and there exist several implementations from NLP libraries. The best option is to introduce a dependency and wrap the stemmer implementation there. If we made some improvements to an existing stemmer implementation, we should consider contributing to it. I checked chalk's dependency: https://repo1.maven.org/maven2/org/scalanlp/chalk/1.3.0/chalk-1.3.0.pom. It looks okay once we removed the scala actors. Could you help check the details? |
|
@mengxr |
|
@hhbyyh Joseph's comment was about carefully introducing new dependencies. If we pick a library that doesn't depend on many others, it should be safe for us. There are several packages containing Porter stemmer, e.g., lucene, CoreNLP, and chalk. Lucene is a lightweight library, but used by many other systems. So it is not a safe choice. CoreNLP is licensed under LGPL, so not an option here. chalk seems okay to me by looking at its dependencies. I'm a little worried about the cost if we maintain our own implementation in MLlib. We cannot leverage other NLP projects (where the experts are) on possible improvements. So could you take a look at chalk? @jasonbaldridge To add chalk as a dependency, we need chalk releases for both Scala 2.10 and 2.11. But I only see 2.10 releases on maven central. Do you have plans for publishing new releases for both 2.10 and 2.11? |
|
@mengxr Chalk isn't under current development unfortunately, and I'm not sure whether I'll be getting back to it. Another option might be to add it to the lib-text library, which is being maintained and updated, and might have some other useful things for you: |
|
(Totally not paying attention to this issue.) Epic has a PorterStemmer as well. (Well, I re-took it from Chalk.) I've been neglecting Epic to some extent of late, but it's there and I'm not sure it's worth adding a large dependency just for that, but just On Mon, Mar 21, 2016 at 9:21 PM, Jason Baldridge notifications@github.com
|
|
IMO more specific or complex domain-specific stuff should live outside of core, until such time as there is clear demand across a wider user base that justifies the maintenance cost of including it. Already Spark ML has a large maintenance & code review burden just with the algos and feature transformers that are already in there. The whole point of an API for pipelines is to enable external libraries for more specific use cases. This is doubly the case when well-known and robust libraries already provide the functionality. As you can see from your PR, implementing one's own stemmer transformer using one of the external NLP libs is a few lines of code. Things like NLP (and image, video and audio processing, for example) should start life as a Spark package. How about looking at contributing to https://github.com/mengxr/spark-corenlp and wrapping the CoreNLP stemmer functionality as a transformer? |
|
@jasonbaldridge @mengxr I went ahead and opened peoplepattern/lib-text#21, though I've not scheduled it for a target release yet |
|
I'm fine to put this on hold like previously stated.Thank you all for the discussion. It should have provided enough information for anyone interested in using Stemmer with Spark. I plan to put this in the topic modeling Spark package for now. Will send a link here afterwards. |
|
@MLnick CoreNLP is GPL, so I would worry that some people would use it as though it were part of an ASL suite. Spark-CoreNLP is correctly licensed as GPL, but there should be some big warning flags in the README so that people don't inadvertently use it in a way that is inconsistent with the GPL. FWIW, I would strongly prefer an "NLP for Spark" package to be ASL, so Spark-CoreNLP isn't useful (though it's great stuff, objectively speaking). |
jira: https://issues.apache.org/jira/browse/SPARK-9578
Classical Porter stemmer, which is implemented referring to scalanlp/chalk
https://github.com/scalanlp/chalk/blob/master/src/main/scala/chalk/text/analyze
@jasonbaldridge Let me know if you're interested.
I compared the following implementations:
http://tartarus.org/martin/PorterStemmer/scala.txt
https://github.com/ifesdjeen/jReadability/blob/master/src/scala/main/com/jreadability/main/Stemmer.scala
https://github.com/aztek/porterstemmer/blob/master/src/main/scala/com/github/aztek/porterstemmer/PorterStemmer.scala