Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8455][ML] Implement n-gram feature transformer #6887

Closed
wants to merge 3 commits into from

Conversation

feynmanliang
Copy link
Contributor

Implementation of n-gram feature transformer for ML.

@feynmanliang feynmanliang changed the title Implement n-gram feature transformer [SPARK-8455][ML] Implement n-gram feature transformer Jun 18, 2015
* Defauult: 2, bigram features
* @group param
*/
val NGramLength: IntParam = new IntParam(this, "NGramLength", "number elements per n-gram (>=1)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NGramLength -> nGramLength. Actually, I think calling it n should be sufficient given the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35173 has finished for PR 6887 at commit fe93873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NGram(override val uid: String)

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35214 timed out for PR 6887 at commit 9fadd36 after a configured wait of 175m.

)))
testNGram(NGramTransformer, dataset)
}
test("input array < n yields a single n-gram consisting of input array") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35288 has finished for PR 6887 at commit d2c839f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NGram(override val uid: String)

@jkbradley
Copy link
Member

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jun 21, 2015

Test build #35404 has finished for PR 6887 at commit d2c839f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NGram(override val uid: String)

@jkbradley
Copy link
Member

2 thoughts:

  • @mengxr suggested that for a sequence of length < n, we return nothing. That does not seem ideal since it throws out information. (I would be surprised if I applied a transformer and got back empty sequences.) Using the default behavior of Scala's grouped seems better to me.
  • (future) In general, people will want to apply: Tokenizer, NGrams, HashingTF. Later on, we should provide something which handles this directly, rather than creating a bunch of intermediate objects.

@feynmanliang
Copy link
Contributor Author

I actually think @mengxr suggestion makes sense; it guarantees that the output is an actual n-gram rather than being dependent on whether the input length > n or not.

Another possibility could be to pad with some null word. The use cases I am imagining is a n-gram HMM language model, in which case partial sequences with k words (k < n) should be represented by padding null words on the left.

@jkbradley
Copy link
Member

Yeah, I guess I'm OK either way. I figure the normal user won't care if it's a strict n-gram but might be upset if a non-empty document produces an all-zero feature vector.

@jkbradley
Copy link
Member

LGTM Merging with master
Thanks!

@asfgit asfgit closed this in afe35f0 Jun 22, 2015
@feynmanliang feynmanliang deleted the ngram-featurizer branch June 24, 2015 21:35
animeshbaranawal pushed a commit to animeshbaranawal/spark that referenced this pull request Jun 25, 2015
Implementation of n-gram feature transformer for ML.

Author: Feynman Liang <fliang@databricks.com>

Closes apache#6887 from feynmanliang/ngram-featurizer and squashes the following commits:

d2c839f [Feynman Liang] Make n > input length yield empty output
9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
fe93873 [Feynman Liang] Implement n-gram feature transformer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants