New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29914][ML] ML models attach metadata in transform
/transformSchema
#26547
Conversation
Test build #113875 has finished for PR 26547 at commit
|
Test build #113876 has finished for PR 26547 at commit
|
5c621bb
to
9aa6ae0
Compare
Test build #113982 has finished for PR 26547 at commit
|
Test build #113994 has finished for PR 26547 at commit
|
transform
/transformSchema
transform
/transformSchema
@viirya hi, I noticed that you had some works on attach output attributes. Would you like to help reviewing this? Thanks |
also friendly ping @srowen |
This PR aims to attach inferrable attributes to output columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big change. Is there any downside? do any of these take non-trivial extra time to compute and update? conversely, does adding them help anything else optimize its operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pinging me. I will be in flight today and can not review this. I may have time to take look in next days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change adds metadata to many classes, is metadata useful for them all?
|
||
val vectorSize = data.head.size | ||
|
||
// Can not infer size of ouput vector, since no metadata is provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo ouput
vecSize: Int): Unit = { | ||
import dataframe.sparkSession.implicits._ | ||
val group = AttributeGroup.fromStructField(dataframe.schema(vecColName)) | ||
assert(group.size === vecSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some error message to explain it when the condition fails?
There should not be non-trival cost in update schema, since its logic is simple (similar operations like
Some downstream impls in the pipeline will try to use the meta if provided, otherwise it need to trigger a job, such as a Thanks for reviewing. |
@viirya Thanks for reviewing!
I think it maybe nice to provide as much metadata as possible metadata in the output datasets, since downstream impls may use it in some way. |
I think the change is OK if it improves consistency. |
val attr = if (numValues == 2) { | ||
BinaryAttribute.defaultAttr | ||
.withName(colName) | ||
} else { | ||
NominalAttribute.defaultAttr | ||
.withName(colName) | ||
.withNumValues(numValues) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. Is numValues == 2 case always BinaryAttribute? NominalAttribute can not have two number of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I found that existing impls like Bucketizer
will check whether numValues==2
,
so I think it is safe to only use NominalAttribute
here.
} | ||
|
||
/** | ||
* Update the metadata of an existing column. If this column do not exist, append it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has update and overwrite two functions. We should add to this doc.
def updateField( | ||
schema: StructType, | ||
field: StructField, | ||
overrideMeta: Boolean = true): StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override or overwrite?
@@ -202,13 +203,23 @@ class DecisionTreeClassificationModel private[ml] ( | |||
rootNode.predictImpl(features).prediction | |||
} | |||
|
|||
@Since("1.4.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm..3.0.0?
efe911f
to
a690eb7
Compare
Test build #114399 has finished for PR 26547 at commit
|
Test build #114401 has finished for PR 26547 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK by me if you're done and tests pass.
retest this please |
Test build #114813 has finished for PR 26547 at commit
|
3d26d74
to
3eb87f6
Compare
Test build #114821 has finished for PR 26547 at commit
|
Merged to master, thanks all for reviewing! |
val attrs: Array[Attribute] = vocabulary.map(_ => new NumericAttribute) | ||
val field = new AttributeGroup($(outputCol), attrs).toStructField() | ||
outputSchema = SchemaUtils.updateField(outputSchema, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vocabulary is a big number, for example 1 << 18 by default. We will keep a big attribute array here. Do we actually need this metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this just moved old code. so just wondering if this will be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, I think we may change this place by only attach a size
. I will send a follow up.
sorry for late. looks fine to me. |
@viirya Thanks very much for helping review this PR! |
…Schema` ### What changes were proposed in this pull request? 1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute` 2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses` 3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k` 4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees` 5, `leafCol` in DecisionTree add `NominalAttribute` 6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize 7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize ### Why are the changes needed? Appened metadata can be used in downstream ops, like `Classifier.getNumClasses` There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. ### Does this PR introduce any user-facing change? Yes, add some metadatas in transformed dataset. ### How was this patch tested? existing testsuites and added testsuites Closes apache#26547 from zhengruifeng/add_output_vecSize. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
colName: String, | ||
numValues: Int): Unit = { | ||
import dataframe.sparkSession.implicits._ | ||
val n = Attribute.fromStructField(dataframe.schema(colName)) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala compiler prints the warning here:
Warning:(88, 38) match may not be exhaustive.
It would fail on the following inputs: NumericAttribute(), UnresolvedAttribute
val n = Attribute.fromStructField(dataframe.schema(colName)) match {
Just in case, do we cover all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's all the cases that need to be covered. The warning could be avoided by adding a case that throws an exception. That kind of cleanup is fine across the code. It won't matter too much here as it'll already generate an exception (correctly)
What changes were proposed in this pull request?
1,
predictionCol
inml.classification
&ml.clustering
addNominalAttribute
2,
rawPredictionCol
inml.classification
addAttributeGroup
containing vectorsize=numClasses
3,
probabilityCol
inml.classification
&ml.clustering
addAttributeGroup
containing vectorsize=numClasses
/k
4,
leafCol
in GBT/RF addAttributeGroup
containing vectorsize=numTrees
5,
leafCol
in DecisionTree addNominalAttribute
6,
outputCol
in models inml.feature
addAttributeGroup
containing vectorsize7,
outputCol
inUnaryTransformer
s inml.feature
addAttributeGroup
containing vectorsizeWhy are the changes needed?
Appened metadata can be used in downstream ops, like
Classifier.getNumClasses
There are many impls (like
Binarizer
/Bucketizer
/VectorAssembler
/OneHotEncoder
/FeatureHasher
/HashingTF
/VectorSlicer
/...) in.ml
that append appropriate metadata intransform
/transformSchema
method.However there are also many impls return no metadata in transformation, even some metadata like
vector.size
/numAttrs
/attrs
can be ealily inferred.Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.
How was this patch tested?
existing testsuites and added testsuites