[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

hhbyyh · 2015-06-29T11:52:24Z

jira: https://issues.apache.org/jira/browse/SPARK-8703

Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's appropriate.

SparkQA · 2015-06-29T13:07:21Z

Test build #35981 has finished for PR 7084 at commit 7c61fb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-29T13:40:47Z

Test build #35982 has finished for PR 7084 at commit 809fb59.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizer (override val uid: String, vocabulary: Array[String])

feynmanliang · 2015-06-29T23:51:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+   * @group param
+   */
+  val minTermCounts: IntParam = new IntParam(this, "minTermCounts",
+    "lower bound of effective term counts (>= 0)", ParamValidators.gtEq(1))


Should be "(>= 1)" instead of "(>= 0)"

Thanks, already changed that in the new commit.

SparkQA · 2015-06-30T04:03:54Z

Test build #36075 has finished for PR 7084 at commit 7ee1c31.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF

feynmanliang · 2015-07-01T20:43:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+@Experimental
+class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF{
+
+  def this(vocabulary: Array[String]) = this(Identifiable.randomUID("countVectorizer"), vocabulary)


This is probably fine for now, but I had some thoughts about having an empty constructor for including every word encountered if no vocabulary is provided. If it requires significant modification, we should make a separate JIRA for it.

hhbyyh · 2015-07-02T00:28:40Z

Yes that's the plan (an estimator). And I know jkbradley has a similar implementation in LDA example. If @jkbradley is interested in migrating it here ( perhaps another jira) , we can keep the scope of this to transformer only. Any way, I think the constructor for passing in a vocabulary is useful.

SparkQA · 2015-07-02T03:54:39Z

Test build #36329 has finished for PR 7084 at commit 99b0c14.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizer (override val uid: String, vocabulary: Array[String])

jkbradley · 2015-07-05T20:03:37Z

I agree we should add an Estimator version of CountVectorizer which first fits on the data to build a dictionary. Because of that, maybe we should call this PR's class CountVectorizerModel, and a later PR can add the CountVectorizer (which will be the Estimator version).

I'll take a look through the PR now.

jkbradley · 2015-07-05T23:05:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+ * @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted.
+ */
+@Experimental
+class CountVectorizer (override val uid: String, vocabulary: Array[String])


Should we make vocabulary be a val? That will be good when we make an Estimator version to let users access the dictionary.

Great point!

jkbradley · 2015-07-05T23:06:12Z

That's all for a first pass!

hhbyyh · 2015-07-06T05:24:59Z

Thank a lot @jkbradley. I sent an update with:

change the class name to CountVectorizerModel.
make vocab a val.
change minTermCount to minTermFreq and improve doc.
other minor fix.

SparkQA · 2015-07-06T06:04:34Z

Test build #36560 has finished for PR 7084 at commit 24728e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizerModel (override val uid: String, val vocabulary: Array[String])
- case class ScalaUDF(
- case class CurrentDate() extends LeafExpression
- case class CurrentTimestamp() extends LeafExpression
- case class Hex(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnHex(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class ShiftLeft(left: Expression, right: Expression)
- case class ShiftRight(left: Expression, right: Expression)
- case class ShiftRightUnsigned(left: Expression, right: Expression)
- case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
- case class Ascii(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Base64(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnBase64(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Decode(bin: Expression, charset: Expression) extends Expression with ExpectsInputTypes
- case class Encode(value: Expression, charset: Expression)
- case class UserDefinedFunction protected[sql] (

jkbradley · 2015-07-09T00:02:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala

+  extends UnaryTransformer[Seq[String], Vector, CountVectorizerModel] {
+
+  def this(vocabulary: Array[String]) =
+    this(Identifiable.randomUID("countVectorizerModel"), vocabulary)


I just noticed this: we generally use very short uid names. How about "cntVec"?

jkbradley · 2015-07-09T00:04:08Z

@hhbyyh Thank you for the updates! Other than those 2 nits, it looks good.

SparkQA · 2015-07-09T10:43:28Z

Test build #36922 has finished for PR 7084 at commit 5f3f655.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizerModel (override val uid: String, val vocabulary: Array[String])

jkbradley · 2015-07-09T17:26:07Z

LGTM merging with master
Thanks!

jkbradley · 2015-07-09T20:24:11Z

@hhbyyh Could you please make follow-up JIRAs?

CountVectorizer (which does estimation)
Python API
documentation

Thanks!

hhbyyh · 2015-07-10T01:39:47Z

Thanks @jkbradley , just want to know if you are interested in CountVectorizer. I assume it will be similar to the pre-process in LDA example.

jkbradley · 2015-07-10T18:05:27Z

I think it'd be nice to have. Feel free to take code from that example. The CountVectorizer PR or a later PR could modify the LDA example to use CountVectorizer.

hhbyyh added 2 commits June 29, 2015 19:42

add countVectorizer

7c61fb3

minor fix for ut

809fb59

feynmanliang reviewed Jun 29, 2015
View reviewed changes

extends HashingTF

7ee1c31

feynmanliang reviewed Jul 1, 2015
View reviewed changes

hhbyyh added 2 commits July 2, 2015 09:30

Merge remote-tracking branch 'upstream/master' into countVectorization

12c2dc8

undo extension from HashingTF

99b0c14

jkbradley reviewed Jul 5, 2015
View reviewed changes

hhbyyh added 3 commits July 6, 2015 09:40

Merge remote-tracking branch 'upstream/master' into countVectorization

1deca28

rename to model and some fix

576728a

style improvement

24728e4

jkbradley reviewed Jul 9, 2015
View reviewed changes

text change

5f3f655

asfgit closed this in 0cd84c8 Jul 9, 2015

feynmanliang mentioned this pull request Jul 9, 2015

[Spark-8169] [ML] Add StopWordsRemover as a transformer #6742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

hhbyyh commented Jun 29, 2015

SparkQA commented Jun 29, 2015

SparkQA commented Jun 29, 2015

feynmanliang Jun 29, 2015

hhbyyh Jun 30, 2015

SparkQA commented Jun 30, 2015

feynmanliang Jul 1, 2015

hhbyyh commented Jul 2, 2015

SparkQA commented Jul 2, 2015

jkbradley commented Jul 5, 2015

jkbradley Jul 5, 2015

hhbyyh Jul 6, 2015

jkbradley commented Jul 5, 2015

hhbyyh commented Jul 6, 2015

SparkQA commented Jul 6, 2015

jkbradley Jul 9, 2015

jkbradley commented Jul 9, 2015

SparkQA commented Jul 9, 2015

jkbradley commented Jul 9, 2015

jkbradley commented Jul 9, 2015

hhbyyh commented Jul 10, 2015

jkbradley commented Jul 10, 2015

[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

Conversation

hhbyyh commented Jun 29, 2015

SparkQA commented Jun 29, 2015

SparkQA commented Jun 29, 2015

feynmanliang Jun 29, 2015

Choose a reason for hiding this comment

hhbyyh Jun 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 30, 2015

feynmanliang Jul 1, 2015

Choose a reason for hiding this comment

hhbyyh commented Jul 2, 2015

SparkQA commented Jul 2, 2015

jkbradley commented Jul 5, 2015

jkbradley Jul 5, 2015

Choose a reason for hiding this comment

hhbyyh Jul 6, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 5, 2015

hhbyyh commented Jul 6, 2015

SparkQA commented Jul 6, 2015

jkbradley Jul 9, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 9, 2015

SparkQA commented Jul 9, 2015

jkbradley commented Jul 9, 2015

jkbradley commented Jul 9, 2015

hhbyyh commented Jul 10, 2015

jkbradley commented Jul 10, 2015