Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector #7084

Closed
wants to merge 9 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jun 29, 2015

jira: https://issues.apache.org/jira/browse/SPARK-8703

Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's appropriate.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35981 has finished for PR 7084 at commit 7c61fb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35982 has finished for PR 7084 at commit 809fb59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer (override val uid: String, vocabulary: Array[String])

* @group param
*/
val minTermCounts: IntParam = new IntParam(this, "minTermCounts",
"lower bound of effective term counts (>= 0)", ParamValidators.gtEq(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "(>= 1)" instead of "(>= 0)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, already changed that in the new commit.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36075 has finished for PR 7084 at commit 7ee1c31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF

@Experimental
class CountVectorizer (override val uid: String, vocabulary: Array[String]) extends HashingTF{

def this(vocabulary: Array[String]) = this(Identifiable.randomUID("countVectorizer"), vocabulary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine for now, but I had some thoughts about having an empty constructor for including every word encountered if no vocabulary is provided. If it requires significant modification, we should make a separate JIRA for it.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 2, 2015

Yes that's the plan (an estimator). And I know jkbradley has a similar implementation in LDA example. If @jkbradley is interested in migrating it here ( perhaps another jira) , we can keep the scope of this to transformer only. Any way, I think the constructor for passing in a vocabulary is useful.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36329 has finished for PR 7084 at commit 99b0c14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer (override val uid: String, vocabulary: Array[String])

@jkbradley
Copy link
Member

I agree we should add an Estimator version of CountVectorizer which first fits on the data to build a dictionary. Because of that, maybe we should call this PR's class CountVectorizerModel, and a later PR can add the CountVectorizer (which will be the Estimator version).

I'll take a look through the PR now.

* @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted.
*/
@Experimental
class CountVectorizer (override val uid: String, vocabulary: Array[String])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make vocabulary be a val? That will be good when we make an Estimator version to let users access the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point!

@jkbradley
Copy link
Member

That's all for a first pass!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 6, 2015

Thank a lot @jkbradley. I sent an update with:

  1. change the class name to CountVectorizerModel.
  2. make vocab a val.
  3. change minTermCount to minTermFreq and improve doc.
  4. other minor fix.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36560 has finished for PR 7084 at commit 24728e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizerModel (override val uid: String, val vocabulary: Array[String])
    • case class ScalaUDF(
    • case class CurrentDate() extends LeafExpression
    • case class CurrentTimestamp() extends LeafExpression
    • case class Hex(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class UnHex(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class ShiftLeft(left: Expression, right: Expression)
    • case class ShiftRight(left: Expression, right: Expression)
    • case class ShiftRightUnsigned(left: Expression, right: Expression)
    • case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
    • case class Ascii(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Base64(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class UnBase64(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Decode(bin: Expression, charset: Expression) extends Expression with ExpectsInputTypes
    • case class Encode(value: Expression, charset: Expression)
    • case class UserDefinedFunction protected[sql] (

extends UnaryTransformer[Seq[String], Vector, CountVectorizerModel] {

def this(vocabulary: Array[String]) =
this(Identifiable.randomUID("countVectorizerModel"), vocabulary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this: we generally use very short uid names. How about "cntVec"?

@jkbradley
Copy link
Member

@hhbyyh Thank you for the updates! Other than those 2 nits, it looks good.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36922 has finished for PR 7084 at commit 5f3f655.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizerModel (override val uid: String, val vocabulary: Array[String])

@jkbradley
Copy link
Member

LGTM merging with master
Thanks!

@asfgit asfgit closed this in 0cd84c8 Jul 9, 2015
@jkbradley
Copy link
Member

@hhbyyh Could you please make follow-up JIRAs?

  • CountVectorizer (which does estimation)
  • Python API
  • documentation

Thanks!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 10, 2015

Thanks @jkbradley , just want to know if you are interested in CountVectorizer. I assume it will be similar to the pre-process in LDA example.

@jkbradley
Copy link
Member

I think it'd be nice to have. Feel free to take code from that example. The CountVectorizer PR or a later PR could modify the LDA example to use CountVectorizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants