Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Nov 19, 2019

What changes were proposed in this pull request?

Summarizer support more metrics: sum, std

Why are the changes needed?

Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in NaiveBayes: we want the sum of vectors, mean & weightSum need to computed then multiplied
in StandardScaler,AFTSurvivalRegression,LinearRegression,LinearSVC,LogisticRegression: we need to obtain variance and then sqrt it to get std

Does this PR introduce any user-facing change?

yes, new metrics are exposed to end users

How was this patch tested?

added testsuites

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114086 has finished for PR 26596 at commit dc5f57b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114087 has finished for PR 26596 at commit cf0c043.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114090 has finished for PR 26596 at commit 4523533.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
Vectors.dense(realStd)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reuse common code between std and variance?

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114133 has finished for PR 26596 at commit 40be7c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


We provide vector column summary statistics for `Dataframe` through `Summarizer`.
Available metrics are the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Available metrics are the column-wise max, min, mean, sum, variance, std, squared sum, and number of nonzeros,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see providing the sum maybe, but sum of squares (not squared sum right?)? is that useful? I know it's available as a statistic, but i think the question is what will people expect to see here as compared to say https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html This doesn't provide sum although it provides percentiles, but I don't know if we should compute those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see providing stddev instead of variance, but now we'd provide both, which is a little redundant, and we don't want to take away variance as it would unnecessarily break backwards compatibility. I don't know if the sqrt is such a big deal to compute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is mainly for convenience:
1, sum: existing summarizer does not provide sum of weights, so if I want to compute sum in NaiveBayes, I need to use two udafs like .agg(sum(w).as("weightSum"), Summarizer.metrics("mean", "count").summary(validateUDF(col($(featuresCol))), w)), and then I need to call BLAS.scal(weightSum, mean)
2, std: I noticed that std is more widely used than variance;
3, weightSum: it is provided in mllib.MultivariateOnlineSummarizer, I donot know why Summarizer do not expose it.

In general I guess that the exisintg OPs of Vectors are not enough, if we can provide new OPs like sqrt,square,abs (like np.sqrt,np.square,np.abs in numpy), then we do not need to add above std,sumL2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum and std I can understand. Maybe not weightSum? not sure.
What new ops are you imagining here as an alternative? You can already take the sqrt of variance without any new ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will update the metrics to only add std and sum.

What new ops are you imagining here as an alternative?

I guess we may add ops sqrt, square in ml.Vector.

@zhengruifeng zhengruifeng force-pushed the summarizer_add_metrics branch from 40be7c6 to dcf792b Compare November 25, 2019 09:38
@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114400 has finished for PR 26596 at commit dcf792b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def sum: Vector = {
require(requestedMetrics.contains(Sum))
require(totalWeightSum > 0, s"Nothing has been added to this summarizer.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we need to require this? It's valid to ask for the sum over nothing (0) in a way that isn't valid for variance / stdev. It is checked below, but see comments below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still require(totalWeightSum > 0),
since if no instance is added to the summarizer, then we do not know the size of vector and can not return a valid vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the answer is certainly a vector of 0s in this case. It's different from central moments where there is no value when there is no data. I don't feel strongly about it though.

- variance: a vector tha contains the coefficient-wise variance.
- std: a vector tha contains the coefficient-wise standard deviation.
- count: the count of all vectors seen.
- weightSum: the sum of all weights seen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a supported metric? I don't see it in the scala code above. Same with other new values below.

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114697 has finished for PR 26596 at commit 886bdfa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng deleted the summarizer_add_metrics branch December 2, 2019 06:45
@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks all for reviewing!

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
### What changes were proposed in this pull request?
Summarizer support more metrics: sum, std

### Why are the changes needed?
Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in `NaiveBayes`: we want the sum of vectors,  mean & weightSum need to computed then multiplied
in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std

### Does this PR introduce any user-facing change?
yes, new metrics are exposed to end users

### How was this patch tested?
added testsuites

Closes apache#26596 from zhengruifeng/summarizer_add_metrics.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants