Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7794] [MLLIB] update RegexTokenizer default settings #6330

Closed
wants to merge 1 commit into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 21, 2015

The previous default is {gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}. The default pattern is hard to understand. This PR changes the default to {gaps: true, pattern: "\\s+"}. @jkbradley

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33281 has finished for PR 6330 at commit 5ee7cde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
My only comment is that I'd prefer to have the Python doc (including doc embedded in Param) match the Scala doc, rather than being a short version of it. I also like having the embedded doc match the Scala/Python generated doc so that explainParam prints out more info.

asfgit pushed a commit that referenced this pull request May 22, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings

(cherry picked from commit f5db4b4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in f5db4b4 May 22, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 22, 2015

Agree. Right now it has huge overhead of keeping the docs in sync. It would be nice if this could be done automatically, e.g., geting the Scala Param.doc via Py4J. I merged this into master and branch-1.4.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants