Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast #6466

Closed
wants to merge 5 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 28, 2015

This PR contains two major changes to OneHotEncoder:

  1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index

  2. change includeFirst to dropLast and leave the default to true. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use StringIndex, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

@jkbradley @sryza

@mengxr mengxr changed the title [SPARK-7912] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast May 28, 2015
@sryza
Copy link
Contributor

sryza commented May 28, 2015

The original rationale for includeFirst=true was to be consistent with scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

You think it's more important to be consistent with those tutorials?

@sryza
Copy link
Contributor

sryza commented May 28, 2015

To clarify, I mean that scikit-learn uses a component in the vector for every category. No preference on includeFirst vs. includeLast.

@SparkQA
Copy link

SparkQA commented May 28, 2015

Test build #33672 has finished for PR 6466 at commit 00dfd96.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class OneHotEncoder(override val uid: String) extends Transformer

@mengxr
Copy link
Contributor Author

mengxr commented May 28, 2015

@sryza I think keeping all categories by default is a bad practice. It throws in an extra column that carries no extra info and causes numerical problems. I checked several tutorials online and it seems that only sklearn keeps all by default.

  1. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm
  2. http://en.wikiversity.org/wiki/Dummy_variable_(statistics)
  3. http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

@sryza
Copy link
Contributor

sryza commented May 28, 2015

Ok, that's fine with me. Can we include a note in the doc that mentions we differ from sklearn?

@mengxr
Copy link
Contributor Author

mengxr commented May 28, 2015

Good idea! Added a sentence in the doc.

@SparkQA
Copy link

SparkQA commented May 28, 2015

Test build #33680 has finished for PR 6466 at commit 171b276.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class OneHotEncoder(override val uid: String) extends Transformer

if (outputAttrGroup.size < 0) {
// If the number of attributes is unknown, we check the values from the input column.
val numAttrs = dataset.select(col(inputColName).cast(DoubleType)).map(_.getDouble(0))
.aggregate(0.0)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that if we have a bunch of input columns with unknown attributes that we want to encode as categorical, we'll need to run a job for each one to determine its cardinality. Thousands of categorical columns is probably not unreasonable for some workloads we'd like to target. Any thoughts on how we might be able to smush these into a single job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a parameter to specify the storage level of input to control whether the transformer should cache the input or not. If there are hundreds of categorical columns, VectorIndexer may be a better fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to vectorize a lot of these feature transformers. We should discuss how best to add this support in future releases. (My initial thought is to have each feature transformer accept single scalar columns, Vector columns, and sequences of scalar columns. Internally, we could have a single implementation using Vector columns.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly, I'm not sure how many columns Spark SQL has been tested with. Keeping data in Vectors might be necessary for many ML datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making OneHotEncoder support vector input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we cache the data, thousands passes is still pretty suboptimal.

Regarding @jkbradley 's suggestion, those seem like good ideas. I don't think I have a comprehensive enough understanding of the spark.ml APIs to understand whether anything done here makes it hard to add those in the future. But assuming it doesn't, this patch LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we first group all categorical columns into a single vector column, then we only need a single pass to generate nominal attributes, and then the OneHotEncoder can encode this vector column in a single pass. This should be easy to add later. We can accept both double and vector columns instead of only double.

@SparkQA
Copy link

SparkQA commented May 29, 2015

Test build #33714 has finished for PR 6466 at commit a280dca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 29, 2015
…ttributes and change includeFirst to dropLast

This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast

(cherry picked from commit 23452be)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor Author

mengxr commented May 29, 2015

Merged into master and branch-1.4.

@asfgit asfgit closed this in 23452be May 29, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 29, 2015

Merged into master and branch-1.4.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…ttributes and change includeFirst to dropLast

This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ttributes and change includeFirst to dropLast

This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants