-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-29959][ML][PYSPARK] Summarizer support more metrics #26596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #114086 has finished for PR 26596 at commit
|
|
Test build #114087 has finished for PR 26596 at commit
|
|
Test build #114090 has finished for PR 26596 at commit
|
| } | ||
| Vectors.dense(realStd) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reuse common code between std and variance?
|
Test build #114133 has finished for PR 26596 at commit
|
docs/ml-statistics.md
Outdated
|
|
||
| We provide vector column summary statistics for `Dataframe` through `Summarizer`. | ||
| Available metrics are the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count. | ||
| Available metrics are the column-wise max, min, mean, sum, variance, std, squared sum, and number of nonzeros, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see providing the sum maybe, but sum of squares (not squared sum right?)? is that useful? I know it's available as a statistic, but i think the question is what will people expect to see here as compared to say https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html This doesn't provide sum although it provides percentiles, but I don't know if we should compute those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also see providing stddev instead of variance, but now we'd provide both, which is a little redundant, and we don't want to take away variance as it would unnecessarily break backwards compatibility. I don't know if the sqrt is such a big deal to compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is mainly for convenience:
1, sum: existing summarizer does not provide sum of weights, so if I want to compute sum in NaiveBayes, I need to use two udafs like .agg(sum(w).as("weightSum"), Summarizer.metrics("mean", "count").summary(validateUDF(col($(featuresCol))), w)), and then I need to call BLAS.scal(weightSum, mean)
2, std: I noticed that std is more widely used than variance;
3, weightSum: it is provided in mllib.MultivariateOnlineSummarizer, I donot know why Summarizer do not expose it.
In general I guess that the exisintg OPs of Vectors are not enough, if we can provide new OPs like sqrt,square,abs (like np.sqrt,np.square,np.abs in numpy), then we do not need to add above std,sumL2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum and std I can understand. Maybe not weightSum? not sure.
What new ops are you imagining here as an alternative? You can already take the sqrt of variance without any new ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will update the metrics to only add std and sum.
What new ops are you imagining here as an alternative?
I guess we may add ops sqrt, square in ml.Vector.
40be7c6 to
dcf792b
Compare
|
Test build #114400 has finished for PR 26596 at commit
|
| */ | ||
| def sum: Vector = { | ||
| require(requestedMetrics.contains(Sum)) | ||
| require(totalWeightSum > 0, s"Nothing has been added to this summarizer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, do we need to require this? It's valid to ask for the sum over nothing (0) in a way that isn't valid for variance / stdev. It is checked below, but see comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still require(totalWeightSum > 0),
since if no instance is added to the summarizer, then we do not know the size of vector and can not return a valid vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the answer is certainly a vector of 0s in this case. It's different from central moments where there is no value when there is no data. I don't feel strongly about it though.
python/pyspark/ml/stat.py
Outdated
| - variance: a vector tha contains the coefficient-wise variance. | ||
| - std: a vector tha contains the coefficient-wise standard deviation. | ||
| - count: the count of all vectors seen. | ||
| - weightSum: the sum of all weights seen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a supported metric? I don't see it in the scala code above. Same with other new values below.
|
Test build #114697 has finished for PR 26596 at commit
|
|
Merged to master, thanks all for reviewing! |
### What changes were proposed in this pull request? Summarizer support more metrics: sum, std ### Why are the changes needed? Those metrics are widely used, it will be convenient to directly obtain them other than a conversion. in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std ### Does this PR introduce any user-facing change? yes, new metrics are exposed to end users ### How was this patch tested? added testsuites Closes apache#26596 from zhengruifeng/summarizer_add_metrics. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
Summarizer support more metrics: sum, std
Why are the changes needed?
Those metrics are widely used, it will be convenient to directly obtain them other than a conversion.
in
NaiveBayes: we want the sum of vectors, mean & weightSum need to computed then multipliedin
StandardScaler,AFTSurvivalRegression,LinearRegression,LinearSVC,LogisticRegression: we need to obtainvarianceand then sqrt it to get stdDoes this PR introduce any user-facing change?
yes, new metrics are exposed to end users
How was this patch tested?
added testsuites