-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084
Conversation
Test build #35981 has finished for PR 7084 at commit
|
Test build #35982 has finished for PR 7084 at commit
|
* @group param | ||
*/ | ||
val minTermCounts: IntParam = new IntParam(this, "minTermCounts", | ||
"lower bound of effective term counts (>= 0)", ParamValidators.gtEq(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "(>= 1)" instead of "(>= 0)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, already changed that in the new commit.
Test build #36075 has finished for PR 7084 at commit
|
@Experimental | ||
class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF{ | ||
|
||
def this(vocabulary: Array[String]) = this(Identifiable.randomUID("countVectorizer"), vocabulary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine for now, but I had some thoughts about having an empty constructor for including every word encountered if no vocabulary is provided. If it requires significant modification, we should make a separate JIRA for it.
Yes that's the plan (an estimator). And I know jkbradley has a similar implementation in LDA example. If @jkbradley is interested in migrating it here ( perhaps another jira) , we can keep the scope of this to transformer only. Any way, I think the constructor for passing in a vocabulary is useful. |
Test build #36329 has finished for PR 7084 at commit
|
I agree we should add an Estimator version of CountVectorizer which first fits on the data to build a dictionary. Because of that, maybe we should call this PR's class CountVectorizerModel, and a later PR can add the CountVectorizer (which will be the Estimator version). I'll take a look through the PR now. |
* @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted. | ||
*/ | ||
@Experimental | ||
class CountVectorizer (override val uid: String, vocabulary: Array[String]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make vocabulary be a val? That will be good when we make an Estimator version to let users access the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point!
That's all for a first pass! |
Thank a lot @jkbradley. I sent an update with:
|
Test build #36560 has finished for PR 7084 at commit
|
extends UnaryTransformer[Seq[String], Vector, CountVectorizerModel] { | ||
|
||
def this(vocabulary: Array[String]) = | ||
this(Identifiable.randomUID("countVectorizerModel"), vocabulary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this: we generally use very short uid names. How about "cntVec"?
@hhbyyh Thank you for the updates! Other than those 2 nits, it looks good. |
Test build #36922 has finished for PR 7084 at commit
|
LGTM merging with master |
@hhbyyh Could you please make follow-up JIRAs?
Thanks! |
Thanks @jkbradley , just want to know if you are interested in CountVectorizer. I assume it will be similar to the pre-process in LDA example. |
I think it'd be nice to have. Feel free to take code from that example. The CountVectorizer PR or a later PR could modify the LDA example to use CountVectorizer. |
jira: https://issues.apache.org/jira/browse/SPARK-8703
Converts a text document to a sparse vector of token counts.
I can further add an estimator to extract vocabulary from corpus if that's appropriate.