Skip to content

Conversation

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47606 has finished for PR 10272 at commit ddb26da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class Stemmer (override val uid: String)\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different orElse need to be uniformized.
Same thing goes for step4.

@BenFradet
Copy link
Contributor

There are a few quirks regarding formatting.

Also, I'm wondering if the different "step" methods should be documented or renamed so we get what they're doing without having to skim over the code.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 17, 2015

@BenFradet, Thanks for helping review. I added some comments, yet listing all conditions may not be necessary.

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47903 has finished for PR 10272 at commit ff03152.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class Stemmer (override val uid: String)\n

@BenFradet
Copy link
Contributor

LGTM

@jasonbaldridge
Copy link

FWIW, I did a Scala adaptation of a Java PorterStemmer here too: https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 18, 2015

Thanks @jasonbaldridge for taking a look. If you're interested, I'll close the PR thus you can send a new one. I can help review and compare the performance.

Thanks @BenFradet for helping review.

@jasonbaldridge
Copy link

@hhbyyh: I'm pretty slammed right now with other things, so if you'd like to go ahead and compare and choose whichever, that's totally fine with me. Thanks!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 24, 2015

Current implementation still outperforms the one from https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala, by about 70%.

This is ready for review now.

@mengxr
Copy link
Contributor

mengxr commented Mar 3, 2016

@hhbyyh I would try to avoid maintaining a stemmer implementation in MLlib. This is not a distributed algorithm and there exist several implementations from NLP libraries. The best option is to introduce a dependency and wrap the stemmer implementation there. If we made some improvements to an existing stemmer implementation, we should consider contributing to it.

I checked chalk's dependency: https://repo1.maven.org/maven2/org/scalanlp/chalk/1.3.0/chalk-1.3.0.pom. It looks okay once we removed the scala actors. Could you help check the details?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 10, 2016

@mengxr
Thanks for taking a look.
The comment from Joseph (https://issues.apache.org/jira/browse/SPARK-5571?focusedCommentId=14632052&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14632052) seems to prefer to add the code directly.
We can put this on hold if necessary.

@mengxr
Copy link
Contributor

mengxr commented Mar 21, 2016

@hhbyyh Joseph's comment was about carefully introducing new dependencies. If we pick a library that doesn't depend on many others, it should be safe for us. There are several packages containing Porter stemmer, e.g., lucene, CoreNLP, and chalk. Lucene is a lightweight library, but used by many other systems. So it is not a safe choice. CoreNLP is licensed under LGPL, so not an option here. chalk seems okay to me by looking at its dependencies.

I'm a little worried about the cost if we maintain our own implementation in MLlib. We cannot leverage other NLP projects (where the experts are) on possible improvements. So could you take a look at chalk?

@jasonbaldridge To add chalk as a dependency, we need chalk releases for both Scala 2.10 and 2.11. But I only see 2.10 releases on maven central. Do you have plans for publishing new releases for both 2.10 and 2.11?

@jasonbaldridge
Copy link

@mengxr Chalk isn't under current development unfortunately, and I'm not sure whether I'll be getting back to it. Another option might be to add it to the lib-text library, which is being maintained and updated, and might have some other useful things for you:

https://github.com/peoplepattern/lib-text

cc @eponvert and @dlwh

@dlwh
Copy link

dlwh commented Mar 22, 2016

(Totally not paying attention to this issue.)

Epic has a PorterStemmer as well. (Well, I re-took it from Chalk.)
https://github.com/dlwh/epic

I've been neglecting Epic to some extent of late, but it's there and
available.

I'm not sure it's worth adding a large dependency just for that, but just
FYI.

On Mon, Mar 21, 2016 at 9:21 PM, Jason Baldridge notifications@github.com
wrote:

@mengxr https://github.com/mengxr Chalk isn't under current development
unfortunately, and I'm not sure whether I'll be getting back to it. Another
option might be to add it to the lib-text library, which is being
maintained and updated, and might have some other useful things for you:

https://github.com/peoplepattern/lib-text

cc @eponvert https://github.com/eponvert and @dlwh
https://github.com/dlwh


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#10272 (comment)

@MLnick
Copy link
Contributor

MLnick commented Mar 22, 2016

IMO more specific or complex domain-specific stuff should live outside of core, until such time as there is clear demand across a wider user base that justifies the maintenance cost of including it. Already Spark ML has a large maintenance & code review burden just with the algos and feature transformers that are already in there.

The whole point of an API for pipelines is to enable external libraries for more specific use cases. This is doubly the case when well-known and robust libraries already provide the functionality. As you can see from your PR, implementing one's own stemmer transformer using one of the external NLP libs is a few lines of code.

Things like NLP (and image, video and audio processing, for example) should start life as a Spark package. How about looking at contributing to https://github.com/mengxr/spark-corenlp and wrapping the CoreNLP stemmer functionality as a transformer?

@eponvert
Copy link

@jasonbaldridge @mengxr I went ahead and opened peoplepattern/lib-text#21, though I've not scheduled it for a target release yet

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 22, 2016

I'm fine to put this on hold like previously stated.Thank you all for the discussion. It should have provided enough information for anyone interested in using Stemmer with Spark.

I plan to put this in the topic modeling Spark package for now. Will send a link here afterwards.

@hhbyyh hhbyyh closed this Mar 22, 2016
@jasonbaldridge
Copy link

@MLnick CoreNLP is GPL, so I would worry that some people would use it as though it were part of an ASL suite. Spark-CoreNLP is correctly licensed as GPL, but there should be some big warning flags in the README so that people don't inadvertently use it in a way that is inconsistent with the GPL.

FWIW, I would strongly prefer an "NLP for Spark" package to be ASL, so Spark-CoreNLP isn't useful (though it's great stuff, objectively speaking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants