Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13969][ML] Add FeatureHasher transformer #18513

Closed
wants to merge 13 commits into from

Conversation

Projects
None yet
5 participants
@MLnick
Copy link
Contributor

commented Jul 3, 2017

This PR adds a FeatureHasher transformer, modeled on scikit-learn and Vowpal wabbit.

The transformer operates on multiple input columns in one pass. Current behavior is:

  • for numerical columns, the values are assumed to be real values and the feature index is hash(columnName) while feature value is feature_value
  • for string columns, the values are assumed to be categorical and the feature index is hash(column_name=feature_value), while feature value is 1.0
  • For hash collisions, feature values will be summed
  • null (missing) values are ignored

The following dataframe illustrates the basic semantics:

+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features                                 |
+---+------+-----+---------+------+-----------------------------------------+
|3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+

How was this patch tested?

New unit tests and manual experiments.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 3, 2017

Test build #79092 has finished for PR 18513 at commit 9edb3bd.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2017

Note 1: this is distinct from HashingTF which handles vectorizing text to term frequencies (analogous to HashingVectorizer). Thie feature hasher could be extended to also handle Seq[String] input columns. But I feel it conflates concerns - e.g. HashingTF handles min term frequencies, binarization etc.

However we could later add basic support for Seq[String] columns - this would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets hashed into one feature vector (can be combined with namespaces later).

Note 2: some potential follow ups:

  • support specifying categorical columns explicitly. This would be to allow forcing some columns that are in numerical format to be treated as categorical. Strings would still be treated as categorical.
  • support using the sign of hashed value as sign of feature value, and then support non_negative param (see scikit-learn)
  • support feature namespaces and feature interactions similar to Vowpal Wabbit (see here for an outline of the code used). This could provide an efficient and scalable form of PolynomialExpansion.

cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2017

I've moved HashingTF numFeatures param to sharedParams which results in the MiMa failure since it would now be marked final. Can't quite recall what we've done previously in this case - whether we accept that it breaks user code, but that in most cases users should not have really been extending or overriding these params. Or leave it as is.

I'm ok with the latter - numFeatures is not really that necessary to be a shared param.

@hhbyyh
Copy link
Contributor

left a comment

This should be useful for reducing the dimensions.

Agree that numFeature does not need to be in SharedParams.

I understand this may be WIP, several general comments.

import org.apache.spark.util.Utils
import org.apache.spark.util.collection.OpenHashMap


This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 10, 2017

Contributor

comment

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 12, 2017

Author Contributor

Yup, forgot that!


/** @group setParam */
@Since("2.3.0")
def setNumFeatures(value: Int): this.type = set(numFeatures, value)

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 10, 2017

Contributor

need a way to know the default value.

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 12, 2017

Author Contributor

Not sure what you mean exactly

override def transformSchema(schema: StructType): StructType = {
val fields = schema($(inputCols).toSet)
require(fields.map(_.dataType).forall { case dt =>
dt.isInstanceOf[NumericType] || dt.isInstanceOf[StringType]

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 10, 2017

Contributor

require message

val field = dataset.schema(colName)
field.dataType match {
case DoubleType | StringType => dataset(field.name)
case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 10, 2017

Contributor

Is it possible to avoid casting to Double, since one key target of Feature Hashing is reducing memory usage.

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 12, 2017

Author Contributor

Fair point, have updated to handle this.


implicit val vectorEncoder = ExpressionEncoder[Vector]()

test("params") {

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 10, 2017

Contributor

Maybe add a test for the Unicode column name (like Chinese, "中文")

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2017

@hhbyyh thanks for the comments. Have updated accordingly.

Thought about it and while numFeatures could be shared, it's only 2 transformers, so to avoid any binary compat issues I backed out the shared param version.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 12, 2017

Test build #79558 has finished for PR 18513 at commit b580a5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@sethah

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2017

Just to clarify:

  • If I want to treat a column as categorical that is represented by integers, I'd have to map those integers to strings, right? I believe that's one of your bullets above.
  • This is going to one-hot encoding on categorical columns, effectively, which is going to create linearly dependent columns since there is no parameter to drop the last column. Maybe there's a good solution, but I don't think we have to address it here. Just wanted to check.
@sethah
Copy link
Contributor

left a comment

Nice PR! The tests are great. Only minor comments.

.setNumFeatures(n)
val output = hasher.transform(df)
val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
require(attrGroup.numAttributes === Some(n))

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

make this an assert


val hashFeatures = udf { row: Row =>
val map = new OpenHashMap[Int, Double]()
$(inputCols).foreach { case colName =>

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

case does nothing here

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

also, I think you'll serialize the entire object here by using $(inputCols). Maybe you can make a local pointer to it before the udf.

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 17, 2017

Author Contributor

Ah thanks - this was left over from a previous code version


override def transformSchema(schema: StructType): StructType = {
val fields = schema($(inputCols).toSet)
fields.foreach { case fieldSchema =>

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

case does nothing

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 17, 2017

Author Contributor

Again, think it was left over from some previous version, will update


import HashingTFSuite.murmur3FeatureIdx

implicit val vectorEncoder = ExpressionEncoder[Vector]()

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

private

hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
}

override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)

This comment has been minimized.

Copy link
@sethah

sethah Jul 14, 2017

Contributor

since tags on all public methods (copy, transformSchema, transform)

@hhbyyh
Copy link
Contributor

left a comment

I'm a little worried that categorical data will be overwhelmed by the Double values in the case of hash collision.
If it's better, we can divide the output vector space into two parts. Since all the real value columns will be mapped to the same feature index, we can just leave out certain range for the real values. E.g., if there're 5 categorical values and 3 Double values, and numFeature=100, then we can just leave the last 3 indexes(97, 98, 99) for Double values and map the categorical values into the first 97 indexes. But I guess there's problem when there're a lot of Double columns and user just want to combine and shrink them.

Another option is that we can just inform user the option in the document and demo it in the example code, where users can use one FeatureHasher for categorical value and another for Doubles and then assemble the output features.

Just to bring up the idea and sorry it's a little late. Other parts LGTM.

f.dataType.isInstanceOf[NumericType]
}.map(_.name).toSet

def getDouble(x: Any): Double = {

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 15, 2017

Contributor

maybe val getDouble...

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 17, 2017

Author Contributor

why?

This comment has been minimized.

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 19, 2017

Author Contributor

Hmm, this is a method not a function - so I don't think it will be faster to do val in this case?

val metadata = outputSchema($(outputCol)).metadata
dataset.select(
col("*"),
hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))

This comment has been minimized.

Copy link
@hhbyyh

hhbyyh Jul 15, 2017

Contributor

.map(col)

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2017

@hhbyyh can you elaborate on your concerns in comment #18513 (review)?

I tend to agree that the hasher is perhaps best used for categorical features, while known real features could be "assembled" onto the resulting hashed feature vector. However, one nice thing about hashing is it can handle everything at once in one pass. In practice even with very high cardinality categorical features and some real features, for the "normal" settings of hash bits, hash collision rate is relatively low, and has very little impact on performance (at least from my experiments). Of course it assumes highly sparse data - if the data is not sparse then it's usually best to use other mechanisms.

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2017

@sethah thanks for reviewing.

For the 1st question:

Yes, currently categorical columns that are numerical would need to be explicitly encoded as strings. I mentioned it as a follow up improvement. It's easy to handle, it's just the API for this I'm not certain of yet, here are the two options I see:

  1. User can specify param categoricalCols to explicitly set categorical cols. But, do we then assume that all other columns not in that list, that are strings, are categorical? i.e. this param is effectively only for numeric columns that must be treated as categorical? Or do we ignore all other non-numerical columns? etc
  2. User can specify param realCols to explicitly set the numeric columns. All other columns are treated as categorical.

We could potentially offer both formats, though I tend to gravitate towards potentially (2) above, since the default use case will be encoding many (usually high cardinality) categorical columns, with maybe a few real columns in there.

For the second issue:

There is no way (at least that I know of) to provide a dropLast feature, since we don't know how many features there are - the whole point of hashing is not to keep the feature <-> index mapping for speed and memory efficiency.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 18, 2017

Test build #79699 has finished for PR 18513 at commit 990b816.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@sethah

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2017

Let's make sure to create doc and python JIRAs before this gets merged btw.

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2017

@hhbyyh

hhbyyh approved these changes Jul 20, 2017

Copy link
Contributor

left a comment

Thanks for the reply. The PR looks good to me.

* to map features to indices in the feature vector.
*
* The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
* (representing a real feature) or string (representing a categorical feature). Boolean columns

This comment has been minimized.

Copy link
@sethah

sethah Jul 21, 2017

Contributor

It might be good to make the behavior for each type of column clearer here. Specifically for numeric columns that are meant to be categories. Something like:

/**
 * Behavior
 *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
 *                    feature value to its index in the feature vector. Numeric features are never
 *                    treated as categorical, even when they are integers. You must convert
 *                    categorical columns to strings first.
 *  -String columns: ...
 *  -Boolean columns: ...
 */

Anyway, this is a very minor suggestion and I think it's also ok to leave as is.

@sethah

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2017

LGTM!

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2017

Thanks @sethah @hhbyyh for the review. I updated the behavior doc string as suggested.

Any other comments? cc @srowen @jkbradley @yanboliang

@SparkQA

This comment has been minimized.

Copy link

commented Jul 25, 2017

Test build #79934 has finished for PR 18513 at commit a91b53f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jul 26, 2017

Test build #79961 has finished for PR 18513 at commit d6a3117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@yanboliang
Copy link
Contributor

left a comment

I originally has the same concern with @hhbyyh that categorical data will be overwhelmed by the Double values in the case of hash collision, but the comments convinced me. This looks pretty good now. Thanks.

s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
s"Column $fieldName was $dataType")
}
val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))

This comment has been minimized.

Copy link
@yanboliang

yanboliang Jul 27, 2017

Contributor

It seems that we didn't store Attributes in the AttributeGroup, but we did it in VectorAssembler, and both of FeatureHasher and VectorAssembler can be followed with ML algorithms directly. I'd like to confirm is it intentional?I understand this may be due to performance considerations, and users may not interested to know the attribute of hashed features. We can leave as it is, until we find it affects some scenarios.

This comment has been minimized.

Copy link
@MLnick

MLnick Jul 27, 2017

Author Contributor

Feature hashing doesn't keep the feature -> idx mapping for memory efficiency, so by extension it won't keep attribute info. This is by design, and the tradeoff is speed & efficiency vs. not being able to do the reverse mapping (or knowing the cardinality of each feature, for example).

If users want to keep the mapping & attribute info, then of course they can just use one-hot encoding and vector assembler.

This comment has been minimized.

Copy link
@yanboliang

yanboliang Jul 27, 2017

Contributor

@MLnick Thanks for clarifying.

@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2017

jenkins retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 16, 2017

Test build #80724 has finished for PR 18513 at commit d6a3117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@MLnick

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2017

Merged to master. Thanks all for reviews.

@asfgit asfgit closed this in 0bb8d1f Aug 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.