Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29914][ML] ML models attach metadata in transform/transformSchema #26547

Closed
wants to merge 14 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Nov 15, 2019

What changes were proposed in this pull request?

1, predictionCol in ml.classification & ml.clustering add NominalAttribute
2, rawPredictionCol in ml.classification add AttributeGroup containing vectorsize=numClasses
3, probabilityCol in ml.classification & ml.clustering add AttributeGroup containing vectorsize=numClasses/k
4, leafCol in GBT/RF add AttributeGroup containing vectorsize=numTrees
5, leafCol in DecisionTree add NominalAttribute
6, outputCol in models in ml.feature add AttributeGroup containing vectorsize
7, outputCol in UnaryTransformers in ml.feature add AttributeGroup containing vectorsize

Why are the changes needed?

Appened metadata can be used in downstream ops, like Classifier.getNumClasses

There are many impls (like Binarizer/Bucketizer/VectorAssembler/OneHotEncoder/FeatureHasher/HashingTF/VectorSlicer/...) in .ml that append appropriate metadata in transform/transformSchema method.

However there are also many impls return no metadata in transformation, even some metadata like vector.size/numAttrs/attrs can be ealily inferred.

Does this PR introduce any user-facing change?

Yes, add some metadatas in transformed dataset.

How was this patch tested?

existing testsuites and added testsuites

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113875 has finished for PR 26547 at commit f3a42d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113876 has finished for PR 26547 at commit 5c621bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2019

Test build #113982 has finished for PR 26547 at commit 9aa6ae0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2019

Test build #113994 has finished for PR 26547 at commit efe911f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-29914][ML] ML models append metadata in transform/transformSchema [SPARK-29914][ML] ML models attach metadata in transform/transformSchema Nov 19, 2019
@zhengruifeng
Copy link
Contributor Author

@viirya hi, I noticed that you had some works on attach output attributes. Would you like to help reviewing this? Thanks

@zhengruifeng
Copy link
Contributor Author

also friendly ping @srowen

@zhengruifeng
Copy link
Contributor Author

This PR aims to attach inferrable attributes to output columns.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a big change. Is there any downside? do any of these take non-trivial extra time to compute and update? conversely, does adding them help anything else optimize its operation?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pinging me. I will be in flight today and can not review this. I may have time to take look in next days.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change adds metadata to many classes, is metadata useful for them all?


val vectorSize = data.head.size

// Can not infer size of ouput vector, since no metadata is provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo ouput

vecSize: Int): Unit = {
import dataframe.sparkSession.implicits._
val group = AttributeGroup.fromStructField(dataframe.schema(vecColName))
assert(group.size === vecSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some error message to explain it when the condition fails?

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Nov 21, 2019

@srowen

do any of these take non-trivial extra time to compute and update?

There should not be non-trival cost in update schema, since its logic is simple (similar operations like withColumns are wildly used ) and should not affect the fit/transfrom much.

does adding them help anything else optimize its operation?

Some downstream impls in the pipeline will try to use the meta if provided, otherwise it need to trigger a job, such as a first job to get vecter size, or a whole pass to get numClasses. Providing more inferrable metadata will help to minimize the computation cost of whole pipeline.

Thanks for reviewing.

@zhengruifeng
Copy link
Contributor Author

@viirya Thanks for reviewing!

this change adds metadata to many classes, is metadata useful for them all?

I think it maybe nice to provide as much metadata as possible metadata in the output datasets, since downstream impls may use it in some way.
Currently, some impls provide metadata, others not. I guess there is not a clear criteria to attach the metadata.

@srowen
Copy link
Member

srowen commented Nov 23, 2019

I think the change is OK if it improves consistency.

Comment on lines 138 to 145
val attr = if (numValues == 2) {
BinaryAttribute.defaultAttr
.withName(colName)
} else {
NominalAttribute.defaultAttr
.withName(colName)
.withNumValues(numValues)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. Is numValues == 2 case always BinaryAttribute? NominalAttribute can not have two number of values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
I found that existing impls like Bucketizer will check whether numValues==2,
so I think it is safe to only use NominalAttribute here.

}

/**
* Update the metadata of an existing column. If this column do not exist, append it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has update and overwrite two functions. We should add to this doc.

def updateField(
schema: StructType,
field: StructField,
overrideMeta: Boolean = true): StructType = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override or overwrite?

@@ -202,13 +203,23 @@ class DecisionTreeClassificationModel private[ml] (
rootNode.predictImpl(features).prediction
}

@Since("1.4.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm..3.0.0?

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114399 has finished for PR 26547 at commit a690eb7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114401 has finished for PR 26547 at commit 3d26d74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK by me if you're done and tests pass.

@zhengruifeng
Copy link
Contributor Author

@srowen Thanks for reviewing!
@viirya I had updated this PR according to your comments, could you please having a glance at it?

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114813 has finished for PR 26547 at commit 3d26d74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114821 has finished for PR 26547 at commit 3eb87f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks all for reviewing!

@zhengruifeng zhengruifeng deleted the add_output_vecSize branch December 4, 2019 08:41
Comment on lines +337 to +339
val attrs: Array[Attribute] = vocabulary.map(_ => new NumericAttribute)
val field = new AttributeGroup($(outputCol), attrs).toStructField()
outputSchema = SchemaUtils.updateField(outputSchema, field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocabulary is a big number, for example 1 << 18 by default. We will keep a big attribute array here. Do we actually need this metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this just moved old code. so just wondering if this will be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I think we may change this place by only attach a size. I will send a follow up.

@viirya
Copy link
Member

viirya commented Dec 4, 2019

sorry for late. looks fine to me.

@zhengruifeng
Copy link
Contributor Author

@viirya Thanks very much for helping review this PR!

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…Schema`

### What changes were proposed in this pull request?
1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute`
2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses`
3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k`
4, `leafCol` in GBT/RF  add `AttributeGroup` containing vectorsize=`numTrees`
5, `leafCol` in DecisionTree  add `NominalAttribute`
6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize
7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize

### Why are the changes needed?
Appened metadata can be used in downstream ops, like `Classifier.getNumClasses`

There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method.

However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred.

### Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.

### How was this patch tested?
existing testsuites and added testsuites

Closes apache#26547 from zhengruifeng/add_output_vecSize.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
colName: String,
numValues: Int): Unit = {
import dataframe.sparkSession.implicits._
val n = Attribute.fromStructField(dataframe.schema(colName)) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala compiler prints the warning here:

Warning:(88, 38) match may not be exhaustive.
It would fail on the following inputs: NumericAttribute(), UnresolvedAttribute
    val n = Attribute.fromStructField(dataframe.schema(colName)) match {

Just in case, do we cover all cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's all the cases that need to be covered. The warning could be avoided by adding a case that throws an exception. That kind of cleanup is fine across the code. It won't matter too much here as it'll already generate an exception (correctly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants