[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

mengxr · 2015-05-28T19:00:48Z

This PR contains two major changes to OneHotEncoder:

more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
change includeFirst to dropLast and leave the default to true. There are couple benefits:

a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
c. If users use StringIndex, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

…to dropLast

sryza · 2015-05-28T19:15:45Z

The original rationale for includeFirst=true was to be consistent with scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

You think it's more important to be consistent with those tutorials?

sryza · 2015-05-28T19:18:46Z

To clarify, I mean that scikit-learn uses a component in the vector for every category. No preference on includeFirst vs. includeLast.

SparkQA · 2015-05-28T20:20:01Z

Test build #33672 has finished for PR 6466 at commit 00dfd96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class OneHotEncoder(override val uid: String) extends Transformer

mengxr · 2015-05-28T20:26:50Z

@sryza I think keeping all categories by default is a bad practice. It throws in an extra column that carries no extra info and causes numerical problems. I checked several tutorials online and it seems that only sklearn keeps all by default.

sryza · 2015-05-28T20:42:46Z

Ok, that's fine with me. Can we include a note in the doc that mentions we differ from sklearn?

mengxr · 2015-05-28T21:17:47Z

Good idea! Added a sentence in the doc.

SparkQA · 2015-05-28T22:38:12Z

Test build #33680 has finished for PR 6466 at commit 171b276.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class OneHotEncoder(override val uid: String) extends Transformer

sryza · 2015-05-28T23:00:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala

+    if (outputAttrGroup.size < 0) {
+      // If the number of attributes is unknown, we check the values from the input column.
+      val numAttrs = dataset.select(col(inputColName).cast(DoubleType)).map(_.getDouble(0))
+        .aggregate(0.0)(


So this means that if we have a bunch of input columns with unknown attributes that we want to encode as categorical, we'll need to run a job for each one to determine its cardinality. Thousands of categorical columns is probably not unreasonable for some workloads we'd like to target. Any thoughts on how we might be able to smush these into a single job?

We could add a parameter to specify the storage level of input to control whether the transformer should cache the input or not. If there are hundreds of categorical columns, VectorIndexer may be a better fit.

It would make sense to vectorize a lot of these feature transformers. We should discuss how best to add this support in future releases. (My initial thought is to have each feature transformer accept single scalar columns, Vector columns, and sequences of scalar columns. Internally, we could have a single implementation using Vector columns.)

Relatedly, I'm not sure how many columns Spark SQL has been tested with. Keeping data in Vectors might be necessary for many ML datasets.

+1 on making OneHotEncoder support vector input.

Even if we cache the data, thousands passes is still pretty suboptimal.

Regarding @jkbradley 's suggestion, those seem like good ideas. I don't think I have a comprehensive enough understanding of the spark.ml APIs to understand whether anything done here makes it hard to add those in the future. But assuming it doesn't, this patch LGTM.

If we first group all categorical columns into a single vector column, then we only need a single pass to generate nominal attributes, and then the OneHotEncoder can encode this vector column in a single pass. This should be easy to add later. We can accept both double and vector columns instead of only double.

SparkQA · 2015-05-29T07:14:37Z

Test build #33714 has finished for PR 6466 at commit a280dca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ttributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast (cherry picked from commit 23452be) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2015-05-29T07:51:56Z

Merged into master and branch-1.4.

mengxr · 2015-05-29T07:55:57Z

Merged into master and branch-1.4.

…ttributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes apache#6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast

mengxr added 2 commits May 28, 2015 10:53

update OneHotEncoder to handle ML attributes and change includeFirst …

208ddad

…to dropLast

update OneHotEncoder in Python

00dfd96

mengxr force-pushed the SPARK-7912 branch from d5ac64b to 00dfd96 Compare May 28, 2015 19:01

mengxr changed the title ~~[SPARK-7912] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast~~ [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast May 28, 2015

mention the difference between our impl vs sklearn's

171b276

sryza reviewed May 28, 2015
View reviewed changes

mengxr added 2 commits May 28, 2015 21:24

Merge remote-tracking branch 'apache/master' into SPARK-7912

d8f234d

fix tests

a280dca

asfgit closed this in 23452be May 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

mengxr commented May 28, 2015

sryza commented May 28, 2015

sryza commented May 28, 2015

SparkQA commented May 28, 2015

mengxr commented May 28, 2015

sryza commented May 28, 2015

mengxr commented May 28, 2015

SparkQA commented May 28, 2015

sryza May 28, 2015

mengxr May 29, 2015

jkbradley May 29, 2015

jkbradley May 29, 2015

mengxr May 29, 2015

sryza May 29, 2015

mengxr May 29, 2015

SparkQA commented May 29, 2015

mengxr commented May 29, 2015

mengxr commented May 29, 2015

[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

Conversation

mengxr commented May 28, 2015

sryza commented May 28, 2015

sryza commented May 28, 2015

SparkQA commented May 28, 2015

mengxr commented May 28, 2015

sryza commented May 28, 2015

mengxr commented May 28, 2015

SparkQA commented May 28, 2015

sryza May 28, 2015

Choose a reason for hiding this comment

mengxr May 29, 2015

Choose a reason for hiding this comment

jkbradley May 29, 2015

Choose a reason for hiding this comment

jkbradley May 29, 2015

Choose a reason for hiding this comment

mengxr May 29, 2015

Choose a reason for hiding this comment

sryza May 29, 2015

Choose a reason for hiding this comment

mengxr May 29, 2015

Choose a reason for hiding this comment

SparkQA commented May 29, 2015

mengxr commented May 29, 2015

mengxr commented May 29, 2015