-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466
Conversation
The original rationale for includeFirst=true was to be consistent with scikit-learn You think it's more important to be consistent with those tutorials? |
To clarify, I mean that scikit-learn uses a component in the vector for every category. No preference on includeFirst vs. includeLast. |
Test build #33672 has finished for PR 6466 at commit
|
@sryza I think keeping all categories by default is a bad practice. It throws in an extra column that carries no extra info and causes numerical problems. I checked several tutorials online and it seems that only sklearn keeps all by default. |
Ok, that's fine with me. Can we include a note in the doc that mentions we differ from sklearn? |
Good idea! Added a sentence in the doc. |
Test build #33680 has finished for PR 6466 at commit
|
if (outputAttrGroup.size < 0) { | ||
// If the number of attributes is unknown, we check the values from the input column. | ||
val numAttrs = dataset.select(col(inputColName).cast(DoubleType)).map(_.getDouble(0)) | ||
.aggregate(0.0)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means that if we have a bunch of input columns with unknown attributes that we want to encode as categorical, we'll need to run a job for each one to determine its cardinality. Thousands of categorical columns is probably not unreasonable for some workloads we'd like to target. Any thoughts on how we might be able to smush these into a single job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a parameter to specify the storage level of input to control whether the transformer should cache the input or not. If there are hundreds of categorical columns, VectorIndexer
may be a better fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to vectorize a lot of these feature transformers. We should discuss how best to add this support in future releases. (My initial thought is to have each feature transformer accept single scalar columns, Vector columns, and sequences of scalar columns. Internally, we could have a single implementation using Vector columns.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatedly, I'm not sure how many columns Spark SQL has been tested with. Keeping data in Vectors might be necessary for many ML datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on making OneHotEncoder support vector input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we cache the data, thousands passes is still pretty suboptimal.
Regarding @jkbradley 's suggestion, those seem like good ideas. I don't think I have a comprehensive enough understanding of the spark.ml APIs to understand whether anything done here makes it hard to add those in the future. But assuming it doesn't, this patch LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we first group all categorical columns into a single vector column, then we only need a single pass to generate nominal attributes, and then the OneHotEncoder can encode this vector column in a single pass. This should be easy to add later. We can accept both double and vector columns instead of only double.
Test build #33714 has finished for PR 6466 at commit
|
…ttributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast (cherry picked from commit 23452be) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Merged into master and branch-1.4. |
Merged into master and branch-1.4. |
…ttributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes apache#6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
…ttributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes apache#6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
This PR contains two major changes to
OneHotEncoder
:more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
change
includeFirst
todropLast
and leave the default totrue
. There are couple benefits:a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
c. If users use
StringIndex
, the last element is the least frequent one.Sorry for including two changes in one PR! I'll update the user guide in another PR.
@jkbradley @sryza