New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13969][ML] Add FeatureHasher transformer #18513
Conversation
Test build #79092 has finished for PR 18513 at commit
|
Note 1: this is distinct from However we could later add basic support for Note 2: some potential follow ups:
cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk |
I've moved I'm ok with the latter - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be useful for reducing the dimensions.
Agree that numFeature does not need to be in SharedParams.
I understand this may be WIP, several general comments.
import org.apache.spark.util.Utils | ||
import org.apache.spark.util.collection.OpenHashMap | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, forgot that!
|
||
/** @group setParam */ | ||
@Since("2.3.0") | ||
def setNumFeatures(value: Int): this.type = set(numFeatures, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a way to know the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean exactly
override def transformSchema(schema: StructType): StructType = { | ||
val fields = schema($(inputCols).toSet) | ||
require(fields.map(_.dataType).forall { case dt => | ||
dt.isInstanceOf[NumericType] || dt.isInstanceOf[StringType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require message
val field = dataset.schema(colName) | ||
field.dataType match { | ||
case DoubleType | StringType => dataset(field.name) | ||
case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid casting to Double, since one key target of Feature Hashing is reducing memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, have updated to handle this.
|
||
implicit val vectorEncoder = ExpressionEncoder[Vector]() | ||
|
||
test("params") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test for the Unicode column name (like Chinese, "中文")
@hhbyyh thanks for the comments. Have updated accordingly. Thought about it and while |
Test build #79558 has finished for PR 18513 at commit
|
Just to clarify:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! The tests are great. Only minor comments.
.setNumFeatures(n) | ||
val output = hasher.transform(df) | ||
val attrGroup = AttributeGroup.fromStructField(output.schema("features")) | ||
require(attrGroup.numAttributes === Some(n)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this an assert
|
||
val hashFeatures = udf { row: Row => | ||
val map = new OpenHashMap[Int, Double]() | ||
$(inputCols).foreach { case colName => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case does nothing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I think you'll serialize the entire object here by using $(inputCols)
. Maybe you can make a local pointer to it before the udf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks - this was left over from a previous code version
|
||
override def transformSchema(schema: StructType): StructType = { | ||
val fields = schema($(inputCols).toSet) | ||
fields.foreach { case fieldSchema => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case does nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, think it was left over from some previous version, will update
|
||
import HashingTFSuite.murmur3FeatureIdx | ||
|
||
implicit val vectorEncoder = ExpressionEncoder[Vector]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata)) | ||
} | ||
|
||
override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since tags on all public methods (copy, transformSchema, transform)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried that categorical data will be overwhelmed by the Double values in the case of hash collision.
If it's better, we can divide the output vector space into two parts. Since all the real value columns will be mapped to the same feature index, we can just leave out certain range for the real values. E.g., if there're 5 categorical values and 3 Double values, and numFeature=100, then we can just leave the last 3 indexes(97, 98, 99) for Double values and map the categorical values into the first 97 indexes. But I guess there's problem when there're a lot of Double columns and user just want to combine and shrink them.
Another option is that we can just inform user the option in the document and demo it in the example code, where users can use one FeatureHasher for categorical value and another for Doubles and then assemble the output features.
Just to bring up the idea and sorry it's a little late. Other parts LGTM.
f.dataType.isInstanceOf[NumericType] | ||
}.map(_.name).toSet | ||
|
||
def getDouble(x: Any): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe val getDouble...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read it from here, but never tested it.
https://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a method not a function - so I don't think it will be faster to do val
in this case?
val metadata = outputSchema($(outputCol)).metadata | ||
dataset.select( | ||
col("*"), | ||
hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.map(col)
@hhbyyh can you elaborate on your concerns in comment #18513 (review)? I tend to agree that the hasher is perhaps best used for categorical features, while known real features could be "assembled" onto the resulting hashed feature vector. However, one nice thing about hashing is it can handle everything at once in one pass. In practice even with very high cardinality categorical features and some real features, for the "normal" settings of hash bits, hash collision rate is relatively low, and has very little impact on performance (at least from my experiments). Of course it assumes highly sparse data - if the data is not sparse then it's usually best to use other mechanisms. |
@sethah thanks for reviewing. For the 1st question: Yes, currently categorical columns that are numerical would need to be explicitly encoded as strings. I mentioned it as a follow up improvement. It's easy to handle, it's just the API for this I'm not certain of yet, here are the two options I see:
We could potentially offer both formats, though I tend to gravitate towards potentially (2) above, since the default use case will be encoding many (usually high cardinality) categorical columns, with maybe a few real columns in there. For the second issue: There is no way (at least that I know of) to provide a |
Test build #79699 has finished for PR 18513 at commit
|
Let's make sure to create doc and python JIRAs before this gets merged btw. |
Created https://issues.apache.org/jira/browse/SPARK-21468 and https://issues.apache.org/jira/browse/SPARK-21469 for docs and Python API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. The PR looks good to me.
* to map features to indices in the feature vector. | ||
* | ||
* The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric | ||
* (representing a real feature) or string (representing a categorical feature). Boolean columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to make the behavior for each type of column clearer here. Specifically for numeric columns that are meant to be categories. Something like:
/**
* Behavior
* -Numeric columns: For numeric features, the hash value of the column name is used to map the
* feature value to its index in the feature vector. Numeric features are never
* treated as categorical, even when they are integers. You must convert
* categorical columns to strings first.
* -String columns: ...
* -Boolean columns: ...
*/
Anyway, this is a very minor suggestion and I think it's also ok to leave as is.
LGTM! |
Thanks @sethah @hhbyyh for the review. I updated the behavior doc string as suggested. Any other comments? cc @srowen @jkbradley @yanboliang |
Test build #79934 has finished for PR 18513 at commit
|
Test build #79961 has finished for PR 18513 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " + | ||
s"Column $fieldName was $dataType") | ||
} | ||
val attrGroup = new AttributeGroup($(outputCol), $(numFeatures)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we didn't store Attributes
in the AttributeGroup
, but we did it in VectorAssembler
, and both of FeatureHasher
and VectorAssembler
can be followed with ML algorithms directly. I'd like to confirm is it intentional?I understand this may be due to performance considerations, and users may not interested to know the attribute of hashed features. We can leave as it is, until we find it affects some scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature hashing doesn't keep the feature -> idx mapping for memory efficiency, so by extension it won't keep attribute info. This is by design, and the tradeoff is speed & efficiency vs. not being able to do the reverse mapping (or knowing the cardinality of each feature, for example).
If users want to keep the mapping & attribute info, then of course they can just use one-hot encoding and vector assembler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Thanks for clarifying.
jenkins retest this please |
Test build #80724 has finished for PR 18513 at commit
|
Merged to master. Thanks all for reviews. |
This PR adds a
FeatureHasher
transformer, modeled on scikit-learn and Vowpal wabbit.The transformer operates on multiple input columns in one pass. Current behavior is:
hash(columnName)
while feature value isfeature_value
hash(column_name=feature_value)
, while feature value is1.0
null
(missing) values are ignoredThe following dataframe illustrates the basic semantics:
How was this patch tested?
New unit tests and manual experiments.