Skip to content

Commit

Permalink
Update user guide for RFormula feature interactions
Browse files Browse the repository at this point in the history
  • Loading branch information
yanboliang committed Dec 9, 2015
1 parent f6883bb commit b08e063
Showing 1 changed file with 19 additions and 1 deletion.
20 changes: 19 additions & 1 deletion docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -2036,7 +2036,25 @@ System.out.println(output.select("userFeatures", "features").first());

## RFormula

`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html). It produces a vector column of features and a double column of labels. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If not already present in the DataFrame, the output label column will be created from the specified response variable in the formula.
`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html).
Currently we support a limited subset of the R operators, including '~', '.', ':', '+', and '-'.
The meanings of basic operator are:

* `~` separate target and terms
* `+` concat terms, “+ 0” means removing intercept
* `-` remove a term, “- 1” means removing intercept
* `:` interaction (multiplication for numeric values, or binarized categorical values)
* `.` all columns except target

Suppose `a` and `b` are double columns, we use the following simple examples to illustrate the effect of `RFormula`:

* `y ~ a + b` means model `y = w0 + w1 * a + w2 * b` where `w0` is the intercept and `w1, w2` are coefficients
* `y ~ a + b + a:b - 1` means model `y = w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients, which is the same as `y ~ (a + b)^2 + 0`

`RFormula` produces a vector column of features and a double or string column of label.
Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles.
If the label column is string type, it will be first transformed to double label with `StringIndexer`.
If the label does not already present in the DataFrame, the output label column will be created from the specified response variable in the formula.

**Examples**

Expand Down

0 comments on commit b08e063

Please sign in to comment.