[SPARK-8455][ML] Implement n-gram feature transformer #6887

feynmanliang · 2015-06-18T21:26:25Z

Implementation of n-gram feature transformer for ML.

mengxr · 2015-06-18T21:46:42Z

mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala

+   * Defauult: 2, bigram features
+   * @group param
+   */
+  val NGramLength: IntParam = new IntParam(this, "NGramLength", "number elements per n-gram (>=1)",


NGramLength -> nGramLength. Actually, I think calling it n should be sufficient given the context.

SparkQA · 2015-06-18T22:40:00Z

Test build #35173 has finished for PR 6887 at commit fe93873.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NGram(override val uid: String)

SparkQA · 2015-06-19T05:50:18Z

Test build #35214 timed out for PR 6887 at commit 9fadd36 after a configured wait of 175m.

rxin · 2015-06-19T05:58:04Z

mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala

+      )))
+    testNGram(NGramTransformer, dataset)
+  }
+  test("input array < n yields a single n-gram consisting of input array") {


add a blank line here

SparkQA · 2015-06-19T17:33:30Z

Test build #35288 has finished for PR 6887 at commit d2c839f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NGram(override val uid: String)

jkbradley · 2015-06-21T17:55:33Z

Jenkins test this please

SparkQA · 2015-06-21T19:07:15Z

Test build #35404 has finished for PR 6887 at commit d2c839f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NGram(override val uid: String)

jkbradley · 2015-06-21T19:17:29Z

2 thoughts:

@mengxr suggested that for a sequence of length < n, we return nothing. That does not seem ideal since it throws out information. (I would be surprised if I applied a transformer and got back empty sequences.) Using the default behavior of Scala's grouped seems better to me.
- @mengxr @feynmanliang What do you think?
(future) In general, people will want to apply: Tokenizer, NGrams, HashingTF. Later on, we should provide something which handles this directly, rather than creating a bunch of intermediate objects.

feynmanliang · 2015-06-21T20:10:42Z

I actually think @mengxr suggestion makes sense; it guarantees that the output is an actual n-gram rather than being dependent on whether the input length > n or not.

Another possibility could be to pad with some null word. The use cases I am imagining is a n-gram HMM language model, in which case partial sequences with k words (k < n) should be represented by padding null words on the left.

jkbradley · 2015-06-22T06:29:32Z

Yeah, I guess I'm OK either way. I figure the normal user won't care if it's a strict n-gram but might be upset if a non-empty document produces an all-zero feature vector.

jkbradley · 2015-06-22T21:14:58Z

LGTM Merging with master
Thanks!

Implementation of n-gram feature transformer for ML. Author: Feynman Liang <fliang@databricks.com> Closes apache#6887 from feynmanliang/ngram-featurizer and squashes the following commits: d2c839f [Feynman Liang] Make n > input length yield empty output 9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces fe93873 [Feynman Liang] Implement n-gram feature transformer

Implement n-gram feature transformer

fe93873

feynmanliang changed the title ~~Implement n-gram feature transformer~~ [SPARK-8455][ML] Implement n-gram feature transformer Jun 18, 2015

mengxr reviewed Jun 18, 2015
View reviewed changes

Add empty and corner test cases, fix names and spaces

9fadd36

rxin reviewed Jun 19, 2015
View reviewed changes

Make n > input length yield empty output

d2c839f

asfgit closed this in afe35f0 Jun 22, 2015

feynmanliang deleted the ngram-featurizer branch June 24, 2015 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8455][ML] Implement n-gram feature transformer #6887

[SPARK-8455][ML] Implement n-gram feature transformer #6887

feynmanliang commented Jun 18, 2015

mengxr Jun 18, 2015

feynmanliang Jun 19, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 19, 2015

rxin Jun 19, 2015

feynmanliang Jun 19, 2015

SparkQA commented Jun 19, 2015

jkbradley commented Jun 21, 2015

SparkQA commented Jun 21, 2015

jkbradley commented Jun 21, 2015

feynmanliang commented Jun 21, 2015

jkbradley commented Jun 22, 2015

jkbradley commented Jun 22, 2015

[SPARK-8455][ML] Implement n-gram feature transformer #6887

[SPARK-8455][ML] Implement n-gram feature transformer #6887

Conversation

feynmanliang commented Jun 18, 2015

mengxr Jun 18, 2015

Choose a reason for hiding this comment

feynmanliang Jun 19, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 18, 2015

SparkQA commented Jun 19, 2015

rxin Jun 19, 2015

Choose a reason for hiding this comment

feynmanliang Jun 19, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 19, 2015

jkbradley commented Jun 21, 2015

SparkQA commented Jun 21, 2015

jkbradley commented Jun 21, 2015

feynmanliang commented Jun 21, 2015

jkbradley commented Jun 22, 2015

jkbradley commented Jun 22, 2015