Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10299][ML] word2vec should allow users to specify the window size #8513

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Aug 28, 2015

Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .

@SparkQA
Copy link

SparkQA commented Aug 28, 2015

Test build #41762 has finished for PR 8513 at commit f0fd13c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [SPARK-10299][ML][WIP] word2vec should allow users to specify the window size [SPARK-10299][ML] word2vec should allow users to specify the window size Aug 30, 2015
@@ -49,6 +49,17 @@ private[feature] trait Word2VecBase extends Params
def getVectorSize: Int = $(vectorSize)

/**
* The window size (context words from [-window, window])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: end line with "."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but we should also have the defaults documented in the scaladocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe make a cleanup JIRA to do this for all the params?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feynmanliang
Copy link
Contributor

LGTM, minor doc comments which could be addressed in separate PR

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41847 has finished for PR 8513 at commit c125c3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 18, 2015

cc @Ishiihara who I think was maybe the original author of the fixed window size.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 1, 2015

ping @mengxr who has some recent commits in this file.

@Ishiihara
Copy link
Contributor

@holdenk LGTM. The reason to make the window size constant is that the window size does not affect the result too much given a large corpus.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 14, 2015

@Ishiihara do you think it is worth merging in then or not so much? The documentation I've seen for different word2vec implementations seem to indicate that changing the window size can make a difference (and there is the user request to make it configurable).

@holdenk
Copy link
Contributor Author

holdenk commented Dec 1, 2015

ping @mengxr or @jkbradley if this looks ok to you it would be nice to get merged in

@srowen
Copy link
Member

srowen commented Dec 2, 2015

Although the window size doesn't matter a lot, yeah, it seems desirable to make it configurable.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 4, 2015

@srowen Would you be comfortable merging given the existing review by the original author? Or should I get another set of eyes to take a look?

@srowen
Copy link
Member

srowen commented Dec 5, 2015

I'm OK with this but I'm only uncertain about merging for 1.6.0. Eh, 1.7.0? 2.0.0? it just matters because of the version label in @since. I had preferred writing 1.7.0 and change it later to 2.0.0 if needed. Maybe also a good time to check if @rxin is OK with at least tagging things for 1.7 at the moment, which may be revised later.

@rxin
Copy link
Contributor

rxin commented Dec 7, 2015

Feel free to do whatever.

@MLnick
Copy link
Contributor

MLnick commented Dec 7, 2015

@srowen @marmbrus @rxin since 1.6.0-RC2 will still be cut as there seem to be a few critical bugs, e.g. https://issues.apache.org/jira/browse/SPARK-12155 and https://issues.apache.org/jira/browse/SPARK-12165, and this is a very minor change, can we put in in 1.6.0? If not, I think just target 1.7.0 for now .

LGTM too.

@srowen
Copy link
Member

srowen commented Dec 7, 2015

I'm OK with that; it's quite safe and minor. I'd understand if someone objected since it's not a fix. Let me pause for that.

@@ -49,6 +49,17 @@ private[feature] trait Word2VecBase extends Params
def getVectorSize: Int = $(vectorSize)

/**
* The window size (context words from [-window, window]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State default

@jkbradley
Copy link
Member

Just had minor comments, but I feel like the SQLContext issue should probably be fixed before merging. I'm OK with putting it in 1.6

}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing this final line? i think this would fail style checker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this show up this way in the github diff viewer, there is a newline after the windowsize test (I'll remerge in master and see if fixes the diff view)

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #2182 has finished for PR 8513 at commit c125c3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47372 has finished for PR 8513 at commit 76d7b5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 9, 2015

@jkbradley addressed the issues (also cleaned up the rest of the tests in the same file)

asfgit pushed a commit that referenced this pull request Dec 9, 2015
Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.

(cherry picked from commit 22b9a87)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Dec 9, 2015

Merged to master/1.6

@asfgit asfgit closed this in 22b9a87 Dec 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants