New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format #11186
[SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format #11186
Conversation
…ark MLlib FPM and Recommendation]
…ark MLlib FPM and Recommendation]
…ark MLlib FPM and Recommendation]
…ark MLlib FPM and Recommendation]
Test build #51195 has finished for PR 11186 at commit
|
itemsets. | ||
:param minSupport: | ||
The minimal support level of the sequential pattern, any | ||
pattern appears more than (minSupport * size-of-the-dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this from appears
-> appearing
(or ... pattern that appears ...
)
@BryanCutler made a quick pass. While we're doing the format change, we may as well make a few little doc clean ups as per my comments. |
jenkins retest please |
Jenkins retest this please |
Test build #51401 has finished for PR 11186 at commit
|
* | ||
* @param ratings RDD of (userID, productID, rating) pairs | ||
* @param ratings RDD of [[Rating]] objects with userID, productID, and rating | ||
* @param rank number of features to use | ||
* @param iterations number of iterations of ALS (recommended: 10-20) | ||
* @param lambda regularization factor (recommended: 0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer regularization parameter
rather than factor
. Could we also change the PySpark doc from smoothing parameter
-> regularization parameter
, to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, I don't like the (recommended: 0.01)
in the comment for lambda
. This is the default for the method call without specifying lambda
, but I certainly don't think this is the recommended regularization. Lambda can vary widely depending on the dataset characteristics, and should be selected in the usual manner (e.g. through cross-validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestions, I agree with what you said. Also, the recommendation of the number of iterations is also a bit strange. I can see putting out a ballpark number so people don't put a huge number, since it can converge quickly and take a bit longer relative to other algos. But, the PySpark default is 5 while in Scala it recommends 10-20. Maybe it would be better to just put a sentence in the online docs about this and remove the recommendation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's remove the recommended part of the iterations doc, and the doc here can be amended accordingly to something like iterations is the number of iterations to run. ALS typically converges to a reasonable solution within 10-20 iterations.
…ed iteration to online docs
Test build #51572 has finished for PR 11186 at commit
|
LGTM, merged into master. Thanks @BryanCutler! |
Part of task for SPARK-11219 to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.
Closes #10602
Closes #10897