[SPARK-15957] [ML] RFormula supports forcing to index label#13675
[SPARK-15957] [ML] RFormula supports forcing to index label#13675yanboliang wants to merge 5 commits intoapache:masterfrom
Conversation
|
Test build #60538 has finished for PR 13675 at commit
|
|
Test build #60598 has finished for PR 13675 at commit
|
There was a problem hiding this comment.
Rename to "forceIndexLabel" since we can still index String labels when indexLabel=false
|
Test build #3301 has finished for PR 13675 at commit
|
a7015ef to
3a8660f
Compare
|
Test build #66661 has finished for PR 13675 at commit
|
|
Test build #66662 has finished for PR 13675 at commit
|
| * Usually we index label only when it is string type. | ||
| * If the formula was used by classification algorithms, | ||
| * we can force to index label even it is numeric type by setting this param with true. | ||
| * Default: false. |
|
I just noticed that last item, but otherwise, this looks ready to me. Thanks! |
|
Test build #66693 has finished for PR 13675 at commit
|
|
Does this affect R code - could we add some R tests for this? |
|
@felixcheung This PR does not affect R code, I will send another PR to fix issues like SPARK-15153 which need to add some R tests. |
|
I'll merge this into master, thanks for review! @jkbradley @felixcheung |
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15430 from yanboliang/spark-15957-python.
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15430 from yanboliang/spark-15957-python.
## What changes were proposed in this pull request? ```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153). For regression, we will still to keep label as numeric type. In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#13675 from yanboliang/spark-15957.
…ceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15430 from yanboliang/spark-15957-python.
What changes were proposed in this pull request?
RFormulawill index label only when it is string type currently. If the label is numeric type and we useRFormulato present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label byStringIndexerwhether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with SPARK-15153.For regression, we will still to keep label as numeric type.
In this PR, we add a param
indexLabelto control whether to force to index label forRFormula.How was this patch tested?
Unit tests.