-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11965] [ML] [Doc] Update user guide for RFormula feature interactions #10222
Conversation
Test build #47420 has finished for PR 10222 at commit
|
Test build #47520 has finished for PR 10222 at commit
|
Jenkins, test this please. |
Test build #47562 has finished for PR 10222 at commit
|
`RFormula` produces a vector column of features and a double or string column of label. | ||
Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. | ||
If the label column is string type, it will be first transformed to double label with `StringIndexer`. | ||
If the label does not already present in the DataFrame, the output label column will be created from the specified response variable in the formula. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If the label is not"
A few remarks regarding phrasing but otherwise it lgtm. |
Test build #47799 has finished for PR 10222 at commit
|
|
||
Suppose `a` and `b` are double columns, we use the following simple examples to illustrate the effect of `RFormula`: | ||
|
||
* `y ~ a + b` means model `y = w0 + w1 * a + w2 * b` where `w0` is the intercept and `w1, w2` are coefficients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=
-> ~
because the model family is not yet determined
Made some minor comments inline. We should consider copying the content to the API doc to keep the API doc and user guide in sync. It would be nice if we can define a process to update the doc. My recommendation is to keep the Scala API doc update-to-date and then make user guide and Python/R API doc derive from it. |
Test build #49786 has finished for PR 10222 at commit
|
Test build #49787 has finished for PR 10222 at commit
|
Test build #49790 has finished for PR 10222 at commit
|
* | ||
* The basic operators are: | ||
* | ||
* `~` separate target and terms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct ScalaDoc syntax for a list. See https://wiki.scala-lang.org/display/SW/Syntax. You can run build/sbt mllib/doc
to check the generated html API doc.
LGTM except one minor issue on ScalaDoc syntax. |
Test build #49850 has finished for PR 10222 at commit
|
Merged into master. Thanks! |
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.