From b08e0636f9fa3c260b1479156838bd65587434af Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Wed, 9 Dec 2015 16:26:33 +0800 Subject: [PATCH] Update user guide for RFormula feature interactions --- docs/ml-features.md | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 55e401221917e..fc247ce48fbfc 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -2036,7 +2036,25 @@ System.out.println(output.select("userFeatures", "features").first()); ## RFormula -`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html). It produces a vector column of features and a double column of labels. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If not already present in the DataFrame, the output label column will be created from the specified response variable in the formula. +`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html). +Currently we support a limited subset of the R operators, including '~', '.', ':', '+', and '-'. +The meanings of basic operator are: + +* `~` separate target and terms +* `+` concat terms, “+ 0” means removing intercept +* `-` remove a term, “- 1” means removing intercept +* `:` interaction (multiplication for numeric values, or binarized categorical values) +* `.` all columns except target + +Suppose `a` and `b` are double columns, we use the following simple examples to illustrate the effect of `RFormula`: + +* `y ~ a + b` means model `y = w0 + w1 * a + w2 * b` where `w0` is the intercept and `w1, w2` are coefficients +* `y ~ a + b + a:b - 1` means model `y = w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients, which is the same as `y ~ (a + b)^2 + 0` + +`RFormula` produces a vector column of features and a double or string column of label. +Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. +If the label column is string type, it will be first transformed to double label with `StringIndexer`. +If the label does not already present in the DataFrame, the output label column will be created from the specified response variable in the formula. **Examples**