[Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer by GayathriMurali · Pull Request #13745 · apache/spark

GayathriMurali · 2016-06-18T00:26:16Z

What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

mengxr · 2016-06-18T04:19:45Z

add to whitelist

mengxr · 2016-06-18T04:22:01Z

docs/ml-features.md

Could you limit the line width to 100?

SparkQA · 2016-06-18T05:17:27Z

Test build #60752 has finished for PR 13745 at commit 01e4a08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T05:22:54Z

Test build #60758 has finished for PR 13745 at commit 3b01f11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-06-20T10:14:50Z

docs/ml-features.md

Space missing: .Then -> . Then

MLnick · 2016-06-20T10:46:49Z

@GayathriMurali could you update the title to be SPARK-15997 as opposed to Spark 15997

MLnick · 2016-06-20T11:00:56Z

@jkbradley we could force a single partition in the data with repartition in the example code (per your comment #13176 (comment)). Perhaps we can hide that from the doc with example off? It is not ideal to have that in the example code either really, but I agree it's better than relativeError(0).

The only other solution I can think of off-hand is to change the example input data to be large enough that it shouldn't matter about partitioning.

GayathriMurali · 2016-06-21T21:19:24Z

@jkbradley @MLnick I agree with repartition idea. Although I think that it may not be a bad idea to call out that approxquantile calcultion for smaller datasets may be different on different machines depending on underlying cores available and leave the example and code as is. Please let me know whats best and I can change the documentation accordingly.

jkbradley · 2016-06-22T17:25:05Z

+1 for hiding repartition using example off/on
I'd add a small comment next to the repartition to make it clear why it is there.

GayathriMurali · 2016-06-22T23:35:23Z

@jkbradley @MLnick repartition needs to be added along with the creation of the dataframe like this.
val df = spark.createDataFrame(data).toDF("id","hour").repartition(1) since df is of type val. we cannot hide this statement. I could convert df to mutable object, but that would seem inconsistent. Am i missing something here?

jkbradley · 2016-06-23T01:20:56Z

Does this work?

    val df = spark.createDataFrame(data).toDF("id", "hour")
    // $example off$
      .repartition(1)
    // $example on$

GayathriMurali · 2016-06-23T02:07:04Z

Oops! That works.

SparkQA · 2016-06-23T04:07:55Z

Test build #61092 has finished for PR 13745 at commit 1074351.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-23T04:53:18Z

Test build #61094 has finished for PR 13745 at commit bbc0868.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-06-23T06:53:04Z

examples/src/main/python/ml/quantile_discretizer_example.py

Did you check this works? I think it will throw SyntaxError. You may need to do dataFrame = dataFrame.repartition(1)

Also can we just call it df to match the other examples (i just noticed that)

@MLnick I ran all unit tests and also tested them manually. It works fine. But I guess, writing it df = df.repartition(1) in the next line makes it look better. I will modify that.

This is what I get:

./bin/spark-submit ~/workspace/quantile_discretizer_example.py File "/Users/nick/workspace/quantile_discretizer_example.py", line 18 .repartition(1) ^ SyntaxError: invalid syntax

MLnick · 2016-06-23T07:14:39Z

@GayathriMurali couple final comments, then I think it's good to go. Thanks!

…antizer

SparkQA · 2016-06-23T15:18:06Z

Test build #61115 has finished for PR 13745 at commit 2225c0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-06-24T11:25:45Z

LGTM. Merged to master/branch-2.0. Thanks!

…rizer and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997. (cherry picked from commit be88383) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>

mengxr reviewed Jun 18, 2016
View reviewed changes

docs/ml-features.md Outdated

Copy link

Contributor

mengxr Jun 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you limit the line width to 100?

GayathriMurali mentioned this pull request Jun 18, 2016

[SPARK-15100][DOC] Modified user guide and examples for CountVectorizer, HashingTF and QuantileDiscretizer #13176

Closed

MLnick reviewed Jun 20, 2016
View reviewed changes

docs/ml-features.md Outdated

Copy link

Contributor

MLnick Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing: .Then -> . Then

GayathriMurali changed the title ~~[Spark 15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer~~ [Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer Jun 21, 2016

MLnick reviewed Jun 23, 2016
View reviewed changes

GayathriMurali added 8 commits June 23, 2016 07:31

User guide changes to CountVectorizer, QuantileDiscretizer and HashingTF

75a8e52

Review comments

3c319f4

Review comments

b8ce97a

Review comments

c40fd4b

Review comments

c3bd40e

Review Comments

4372682

Fixing QuantileDiscretizer doc and example

cc88685

Including relativeError in all examples with a note

b874123

GayathriMurali added 7 commits June 23, 2016 07:31

Fixed python style issue

2241181

Review comments

7dfa33b

Remove default value inclusion

7b9204f

Limit line width

d3beddf

used repartition to fix the inconsistent output issue with DiscreteQu…

c0fb5ef

…antizer

Typo fix

767bb5f

Review comments

2225c0a

GayathriMurali force-pushed the SPARK-15997 branch from bbc0868 to 2225c0a Compare June 23, 2016 14:54

asfgit closed this in be88383 Jun 24, 2016

Conversation

GayathriMurali commented Jun 18, 2016

What changes were proposed in this pull request?

Uh oh!

mengxr commented Jun 18, 2016

Uh oh!

mengxr Jun 18, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

MLnick Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Jun 20, 2016

Uh oh!

MLnick commented Jun 20, 2016

Uh oh!

GayathriMurali commented Jun 21, 2016

Uh oh!

jkbradley commented Jun 22, 2016

Uh oh!

GayathriMurali commented Jun 22, 2016

Uh oh!

jkbradley commented Jun 23, 2016

Uh oh!

GayathriMurali commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 23, 2016

Uh oh!

SparkQA commented Jun 23, 2016

Uh oh!

MLnick Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Jun 23, 2016

Choose a reason for hiding this comment

Uh oh!

GayathriMurali Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Jun 23, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Jun 23, 2016

Uh oh!

SparkQA commented Jun 23, 2016

Uh oh!

MLnick commented Jun 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GayathriMurali commented Jun 23, 2016 •

edited

Loading

MLnick Jun 23, 2016 •

edited

Loading

GayathriMurali Jun 23, 2016 •

edited

Loading