Skip to content

[Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer#13745

Closed
GayathriMurali wants to merge 15 commits intoapache:masterfrom
GayathriMurali:SPARK-15997
Closed

[Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer#13745
GayathriMurali wants to merge 15 commits intoapache:masterfrom
GayathriMurali:SPARK-15997

Conversation

@GayathriMurali
Copy link
Contributor

What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

@mengxr
Copy link
Contributor

mengxr commented Jun 18, 2016

add to whitelist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you limit the line width to 100?

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60752 has finished for PR 13745 at commit 01e4a08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60758 has finished for PR 13745 at commit 3b01f11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing: .Then -> . Then

@MLnick
Copy link
Contributor

MLnick commented Jun 20, 2016

@GayathriMurali could you update the title to be SPARK-15997 as opposed to Spark 15997

@MLnick
Copy link
Contributor

MLnick commented Jun 20, 2016

@jkbradley we could force a single partition in the data with repartition in the example code (per your comment #13176 (comment)). Perhaps we can hide that from the doc with example off? It is not ideal to have that in the example code either really, but I agree it's better than relativeError(0).

The only other solution I can think of off-hand is to change the example input data to be large enough that it shouldn't matter about partitioning.

@GayathriMurali GayathriMurali changed the title [Spark 15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer [Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer Jun 21, 2016
@GayathriMurali
Copy link
Contributor Author

@jkbradley @MLnick I agree with repartition idea. Although I think that it may not be a bad idea to call out that approxquantile calcultion for smaller datasets may be different on different machines depending on underlying cores available and leave the example and code as is. Please let me know whats best and I can change the documentation accordingly.

@jkbradley
Copy link
Member

+1 for hiding repartition using example off/on
I'd add a small comment next to the repartition to make it clear why it is there.

@GayathriMurali
Copy link
Contributor Author

@jkbradley @MLnick repartition needs to be added along with the creation of the dataframe like this.
val df = spark.createDataFrame(data).toDF("id","hour").repartition(1) since df is of type val. we cannot hide this statement. I could convert df to mutable object, but that would seem inconsistent. Am i missing something here?

@jkbradley
Copy link
Member

Does this work?

    val df = spark.createDataFrame(data).toDF("id", "hour")
    // $example off$
      .repartition(1)
    // $example on$

@GayathriMurali
Copy link
Contributor Author

GayathriMurali commented Jun 23, 2016

Oops! That works.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61092 has finished for PR 13745 at commit 1074351.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61094 has finished for PR 13745 at commit bbc0868.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MLnick MLnick Jun 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check this works? I think it will throw SyntaxError. You may need to do dataFrame = dataFrame.repartition(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we just call it df to match the other examples (i just noticed that)

Copy link
Contributor Author

@GayathriMurali GayathriMurali Jun 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick I ran all unit tests and also tested them manually. It works fine. But I guess, writing it df = df.repartition(1) in the next line makes it look better. I will modify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I get:

./bin/spark-submit ~/workspace/quantile_discretizer_example.py
  File "/Users/nick/workspace/quantile_discretizer_example.py", line 18
    .repartition(1)
    ^
SyntaxError: invalid syntax

@MLnick
Copy link
Contributor

MLnick commented Jun 23, 2016

@GayathriMurali couple final comments, then I think it's good to go. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61115 has finished for PR 13745 at commit 2225c0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jun 24, 2016

LGTM. Merged to master/branch-2.0. Thanks!

asfgit pushed a commit that referenced this pull request Jun 24, 2016
…rizer and CountVectorizer

## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13745 from GayathriMurali/SPARK-15997.

(cherry picked from commit be88383)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@asfgit asfgit closed this in be88383 Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants