[Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer#13745
[Spark-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer#13745GayathriMurali wants to merge 15 commits intoapache:masterfrom
Conversation
|
add to whitelist |
docs/ml-features.md
Outdated
There was a problem hiding this comment.
Could you limit the line width to 100?
|
Test build #60752 has finished for PR 13745 at commit
|
|
Test build #60758 has finished for PR 13745 at commit
|
docs/ml-features.md
Outdated
There was a problem hiding this comment.
Space missing: .Then -> . Then
|
@GayathriMurali could you update the title to be |
|
@jkbradley we could force a single partition in the data with The only other solution I can think of off-hand is to change the example input data to be large enough that it shouldn't matter about partitioning. |
|
@jkbradley @MLnick I agree with repartition idea. Although I think that it may not be a bad idea to call out that approxquantile calcultion for smaller datasets may be different on different machines depending on underlying cores available and leave the example and code as is. Please let me know whats best and I can change the documentation accordingly. |
|
+1 for hiding repartition using example off/on |
|
@jkbradley @MLnick |
|
Does this work? |
|
Oops! That works. |
|
Test build #61092 has finished for PR 13745 at commit
|
|
Test build #61094 has finished for PR 13745 at commit
|
There was a problem hiding this comment.
Did you check this works? I think it will throw SyntaxError. You may need to do dataFrame = dataFrame.repartition(1)
There was a problem hiding this comment.
Also can we just call it df to match the other examples (i just noticed that)
There was a problem hiding this comment.
@MLnick I ran all unit tests and also tested them manually. It works fine. But I guess, writing it df = df.repartition(1) in the next line makes it look better. I will modify that.
There was a problem hiding this comment.
This is what I get:
./bin/spark-submit ~/workspace/quantile_discretizer_example.py
File "/Users/nick/workspace/quantile_discretizer_example.py", line 18
.repartition(1)
^
SyntaxError: invalid syntax
|
@GayathriMurali couple final comments, then I think it's good to go. Thanks! |
bbc0868 to
2225c0a
Compare
|
Test build #61115 has finished for PR 13745 at commit
|
|
LGTM. Merged to master/branch-2.0. Thanks! |
…rizer and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997. (cherry picked from commit be88383) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
What changes were proposed in this pull request?
Made changes to HashingTF,QuantileVectorizer and CountVectorizer