[SPARK-29914][ML] ML models attach metadata in `transform`/`transformSchema` #26547

zhengruifeng · 2019-11-15T10:54:34Z

What changes were proposed in this pull request?

1, predictionCol in ml.classification & ml.clustering add NominalAttribute
2, rawPredictionCol in ml.classification add AttributeGroup containing vectorsize=numClasses
3, probabilityCol in ml.classification & ml.clustering add AttributeGroup containing vectorsize=numClasses/k
4, leafCol in GBT/RF add AttributeGroup containing vectorsize=numTrees
5, leafCol in DecisionTree add NominalAttribute
6, outputCol in models in ml.feature add AttributeGroup containing vectorsize
7, outputCol in UnaryTransformers in ml.feature add AttributeGroup containing vectorsize

Why are the changes needed?

Appened metadata can be used in downstream ops, like Classifier.getNumClasses

There are many impls (like Binarizer/Bucketizer/VectorAssembler/OneHotEncoder/FeatureHasher/HashingTF/VectorSlicer/...) in .ml that append appropriate metadata in transform/transformSchema method.

However there are also many impls return no metadata in transformation, even some metadata like vector.size/numAttrs/attrs can be ealily inferred.

Does this PR introduce any user-facing change?

Yes, add some metadatas in transformed dataset.

How was this patch tested?

existing testsuites and added testsuites

SparkQA · 2019-11-15T11:39:19Z

Test build #113875 has finished for PR 26547 at commit f3a42d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T11:50:08Z

Test build #113876 has finished for PR 26547 at commit 5c621bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-18T04:31:44Z

Test build #113982 has finished for PR 26547 at commit 9aa6ae0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-18T06:43:43Z

Test build #113994 has finished for PR 26547 at commit efe911f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-11-19T09:47:59Z

@viirya hi, I noticed that you had some works on attach output attributes. Would you like to help reviewing this? Thanks

zhengruifeng · 2019-11-19T09:49:12Z

also friendly ping @srowen

zhengruifeng · 2019-11-19T09:51:26Z

This PR aims to attach inferrable attributes to output columns.

srowen

It's a big change. Is there any downside? do any of these take non-trivial extra time to compute and update? conversely, does adding them help anything else optimize its operation?

viirya

thanks for pinging me. I will be in flight today and can not review this. I may have time to take look in next days.

viirya

this change adds metadata to many classes, is metadata useful for them all?

viirya · 2019-11-20T03:59:12Z

mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala

+
+    val vectorSize = data.head.size
+
+    // Can not infer size of ouput vector, since no metadata is provided


nit: typo ouput

viirya · 2019-11-20T04:00:45Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+      vecSize: Int): Unit = {
+    import dataframe.sparkSession.implicits._
+    val group = AttributeGroup.fromStructField(dataframe.schema(vecColName))
+    assert(group.size === vecSize)


Can we add some error message to explain it when the condition fails?

zhengruifeng · 2019-11-21T02:07:05Z

@srowen

do any of these take non-trivial extra time to compute and update?

There should not be non-trival cost in update schema, since its logic is simple (similar operations like withColumns are wildly used ) and should not affect the fit/transfrom much.

does adding them help anything else optimize its operation?

Some downstream impls in the pipeline will try to use the meta if provided, otherwise it need to trigger a job, such as a first job to get vecter size, or a whole pass to get numClasses. Providing more inferrable metadata will help to minimize the computation cost of whole pipeline.

Thanks for reviewing.

zhengruifeng · 2019-11-21T02:24:10Z

@viirya Thanks for reviewing!

this change adds metadata to many classes, is metadata useful for them all?

I think it maybe nice to provide as much metadata as possible metadata in the output datasets, since downstream impls may use it in some way.
Currently, some impls provide metadata, others not. I guess there is not a clear criteria to attach the metadata.

srowen · 2019-11-23T14:20:11Z

I think the change is OK if it improves consistency.

viirya · 2019-11-24T07:29:56Z

mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala

+    val attr = if (numValues == 2) {
+      BinaryAttribute.defaultAttr
+        .withName(colName)
+    } else {
+      NominalAttribute.defaultAttr
+        .withName(colName)
+        .withNumValues(numValues)
+    }


Not sure about this. Is numValues == 2 case always BinaryAttribute? NominalAttribute can not have two number of values?

Good point.
I found that existing impls like Bucketizer will check whether numValues==2,
so I think it is safe to only use NominalAttribute here.

viirya · 2019-11-24T07:32:00Z

mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala

+  }
+
+  /**
+   * Update the metadata of an existing column. If this column do not exist, append it.


This method has update and overwrite two functions. We should add to this doc.

viirya · 2019-11-24T07:32:03Z

mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala

+  def updateField(
+      schema: StructType,
+      field: StructField,
+      overrideMeta: Boolean = true): StructType = {


override or overwrite?

viirya · 2019-11-24T07:37:54Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

@@ -202,13 +203,23 @@ class DecisionTreeClassificationModel private[ml] (
    rootNode.predictImpl(features).prediction
  }

+  @Since("1.4.0")


hmm..3.0.0?

SparkQA · 2019-11-25T09:38:54Z

Test build #114399 has finished for PR 26547 at commit a690eb7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-25T11:11:06Z

Test build #114401 has finished for PR 26547 at commit 3d26d74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

OK by me if you're done and tests pass.

zhengruifeng · 2019-12-02T03:31:20Z

@srowen Thanks for reviewing!
@viirya I had updated this PR according to your comments, could you please having a glance at it?

zhengruifeng · 2019-12-04T01:40:48Z

retest this please

SparkQA · 2019-12-04T02:47:40Z

Test build #114813 has finished for PR 26547 at commit 3d26d74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-04T04:22:35Z

Test build #114821 has finished for PR 26547 at commit 3eb87f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-12-04T08:41:25Z

Merged to master, thanks all for reviewing!

viirya · 2019-12-04T16:27:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+      val attrs: Array[Attribute] = vocabulary.map(_ => new NumericAttribute)
+      val field = new AttributeGroup($(outputCol), attrs).toStructField()
+      outputSchema = SchemaUtils.updateField(outputSchema, field)


vocabulary is a big number, for example 1 << 18 by default. We will keep a big attribute array here. Do we actually need this metadata?

looks like this just moved old code. so just wondering if this will be a problem.

Sounds reasonable, I think we may change this place by only attach a size. I will send a follow up.

viirya · 2019-12-04T16:36:30Z

sorry for late. looks fine to me.

zhengruifeng · 2019-12-05T02:16:26Z

@viirya Thanks very much for helping review this PR!

…Schema` ### What changes were proposed in this pull request? 1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute` 2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses` 3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k` 4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees` 5, `leafCol` in DecisionTree add `NominalAttribute` 6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize 7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize ### Why are the changes needed? Appened metadata can be used in downstream ops, like `Classifier.getNumClasses` There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. ### Does this PR introduce any user-facing change? Yes, add some metadatas in transformed dataset. ### How was this patch tested? existing testsuites and added testsuites Closes apache#26547 from zhengruifeng/add_output_vecSize. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

MaxGekk · 2019-12-08T18:08:01Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+      colName: String,
+      numValues: Int): Unit = {
+    import dataframe.sparkSession.implicits._
+    val n = Attribute.fromStructField(dataframe.schema(colName)) match {


Scala compiler prints the warning here:

Warning:(88, 38) match may not be exhaustive. It would fail on the following inputs: NumericAttribute(), UnresolvedAttribute val n = Attribute.fromStructField(dataframe.schema(colName)) match {

Just in case, do we cover all cases?

I think that's all the cases that need to be covered. The warning could be avoided by adding a case that throws an exception. That kind of cleanup is fine across the code. It won't matter too much here as it'll already generate an exception (correctly)

dongjoon-hyun added the ML label Nov 15, 2019

zhengruifeng force-pushed the add_output_vecSize branch from 5c621bb to 9aa6ae0 Compare November 18, 2019 03:47

zhengruifeng changed the title ~~[SPARK-29914][ML] ML models append metadata in transform/transformSchema~~ [SPARK-29914][ML] ML models attach metadata in transform/transformSchema Nov 19, 2019

srowen reviewed Nov 19, 2019

View reviewed changes

viirya reviewed Nov 20, 2019

View reviewed changes

viirya reviewed Nov 24, 2019

View reviewed changes

zhengruifeng force-pushed the add_output_vecSize branch from efe911f to a690eb7 Compare November 25, 2019 09:24

srowen approved these changes Nov 30, 2019

View reviewed changes

zhengruifeng added 3 commits December 4, 2019 11:01

create pr

e7e0499

create pr

0aca294

create pr

9be329c

zhengruifeng added 11 commits December 4, 2019 11:01

create pr

66c165a

add test

d3e7260

update pca

76bab4b

update test

d214070

update UnaryTransformer

6d71367

update OVR

5145b4f

update GMM && regressors

1777df0

update schemautils

58cc3db

del BinaryAttribute && update methods

43cce95

fix building error

d1cbe13

resolve conflicts

3eb87f6

zhengruifeng force-pushed the add_output_vecSize branch from 3d26d74 to 3eb87f6 Compare December 4, 2019 03:11

zhengruifeng closed this in 710ddab Dec 4, 2019

zhengruifeng deleted the add_output_vecSize branch December 4, 2019 08:41

viirya reviewed Dec 4, 2019

View reviewed changes

zhengruifeng mentioned this pull request Dec 5, 2019

[SPARK-29914][ML][FOLLOWUP] CountVectorizer del big attribute array #26767

Closed

MaxGekk reviewed Dec 8, 2019

View reviewed changes


		val vectorSize = data.head.size

		// Can not infer size of ouput vector, since no metadata is provided

[SPARK-29914][ML] ML models attach metadata in transform/transformSchema #26547

[SPARK-29914][ML] ML models attach metadata in transform/transformSchema #26547

Conversation

zhengruifeng commented Nov 15, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 15, 2019

SparkQA commented Nov 15, 2019

SparkQA commented Nov 18, 2019

SparkQA commented Nov 18, 2019

zhengruifeng commented Nov 19, 2019

zhengruifeng commented Nov 19, 2019

zhengruifeng commented Nov 19, 2019

srowen left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Nov 21, 2019 • edited

zhengruifeng commented Nov 21, 2019

srowen commented Nov 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 25, 2019

SparkQA commented Nov 25, 2019

srowen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Dec 2, 2019

zhengruifeng commented Dec 4, 2019

SparkQA commented Dec 4, 2019

SparkQA commented Dec 4, 2019

zhengruifeng commented Dec 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Dec 4, 2019

zhengruifeng commented Dec 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-29914][ML] ML models attach metadata in `transform`/`transformSchema` #26547

[SPARK-29914][ML] ML models attach metadata in `transform`/`transformSchema` #26547

zhengruifeng commented Nov 15, 2019 •

edited

zhengruifeng commented Nov 21, 2019 •

edited