Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39446][MLLIB] Add relevance score for nDCG evaluation #36843

Closed
wants to merge 11 commits into from

Conversation

uchiiii
Copy link
Contributor

@uchiiii uchiiii commented Jun 11, 2022

What changes were proposed in this pull request?

  • To add relevance score to evaluate nDCG in the function ndcgAt.
  • To extend the interface of constructor of RankingMetrics class.

Why are the changes needed?

  • The precise definition of nDCG is here on Wikipedia, where relevance score is used. Currently, the implementation of nDCG on spark (MLlib) treats this as binary (0 or 1). This PR is to extend the ndcgAt function to be able to treat relevance score.

Does this PR introduce any user-facing change?

  • I extended the interface of RankingMetrics class for ndcgAt function, so now it accepts RDD[(Array[T], Array[T], Array[Double])] or RDD[(Array[T], Array[T])]) as the constructor arguments while the RDD[(Array[T], Array[T])]) was only accepted.

How was this patch tested?

@github-actions github-actions bot added the MLLIB label Jun 11, 2022
@uchiiii uchiiii changed the title [WIP] Add relevance score for nDCG evaluation [WIP][MLLIB] Add relevance score for nDCG evaluation Jun 11, 2022
@uchiiii uchiiii changed the title [WIP][MLLIB] Add relevance score for nDCG evaluation [WIP][SPARK-39446][MLLIB] Add relevance score for nDCG evaluation Jun 11, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@uchiiii uchiiii changed the title [WIP][SPARK-39446][MLLIB] Add relevance score for nDCG evaluation [SPARK-39446][MLLIB] Add relevance score for nDCG evaluation Jun 13, 2022
class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])
extends Logging with Serializable {
class RankingMetrics[T: ClassTag](
predictionAndLabels: RDD[(Array[T], Array[T], Array[(T, Double)])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why does the last need to be (T, Double) pairs? Wouldn't this be an attribute of the ground truth? Looking this up via Map seems clunky later. The problem is not changing the binary signature, but, at least, how about a third array of Double only, that is parallel to the second array?

Copy link
Contributor Author

@uchiiii uchiiii Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review.
You definitely have a point. I changed from (T, Double) to Double.

if (i < pred.length && labSet.contains(pred(i))) {
dcg += gain
predictionAndLabels
.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert these changes that made one line into 3? the original was better IMHO

maxDcg += gain
}
} else {
if (i < pred.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need labSet.contains(pred(i))?
I would imagine the only difference is the computation of gain, between these two cases

Copy link
Contributor Author

@uchiiii uchiiii Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We manage this by relMap.getOrElse(pred(i), 0.0) below.

I would imagine the only difference is the computation of gain, between these two cases

Basically yes. So we could write in the way you suggest without getOrElse.

countRelevantItemRatio(pred, lab, k, k)
}.mean()
predictionAndLabels.map { case (pred, lab, _) =>
countRelevantItemRatio(pred, lab, k, k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unindent - just match how the code was

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed !

@@ -35,8 +35,15 @@ import org.apache.spark.rdd.RDD
* @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs.
*/
@Since("1.2.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we need to update the @Since tags. I think you can somehow add a @Since annotation to the default constructor args? as it's since 3.4.0. I'm not sure exactly where it goes. The old constructor can remain since 1.2.0. If that doesn't work, maybe we can leave this constructor and add the new one as a new def this(...) since 3.4.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes and updated the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should refer to BinaryClassificationMetrics and MulticlassMetrics, in which RDD[_ <: Product] was used as the input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since end users now mainly use the .ml, is there any plan to expose this function to .ml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry in advance for my poor understanding. I have some questions.

  • What is .ml ? Do you mean org.apache.spark.ml ?
  • RDD[_ <: Product] is used to makeMulticlassMetric class available to .ml ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first one - yeah this change is in the 'older' .mllib package. I don't think there is an equivalent for it in the DataFrame-based .ml packages, so, maybe we can ignore that here. But if nDCG is supported in the .ml package somewhere and I forgot it, would be good to add it there too.

The declaration suggested here might actually work for both input types without a separate constructor. Try it maybe? if it works, yes, that is simpler, and lets this API support even more inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation!

Could you review this?
#36920

@srowen srowen closed this in 8120899 Jun 16, 2022
zhengruifeng pushed a commit that referenced this pull request Jun 29, 2022
…thon

### What changes were proposed in this pull request?
- Updated `RankingMetrics` for Java and Python
- Modified the interface for Java and Python
- Added test for Java

### Why are the changes needed?

- To expose the change in #36843 to Java and Python.
- To update the document for Java and Python.

### Does this PR introduce _any_ user-facing change?

- Java users can use a JavaRDD of (predicted ranking, ground truth set, relevance value of ground truth set) for `RankingMetrics`

### How was this patch tested?
- Added test for Java

Closes #37019 from uchiiii/modify_ranking_metrics_for_java_and_python.

Authored-by: uchiiii <uchikun17@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
srowen pushed a commit that referenced this pull request Jul 28, 2023
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: #36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes #42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request Jul 28, 2023
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: #36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes #42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 72af2c0)
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request Jul 28, 2023
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: #36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes #42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 72af2c0)
Signed-off-by: Sean Owen <srowen@gmail.com>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: apache#36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes apache#42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 72af2c0)
Signed-off-by: Sean Owen <srowen@gmail.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…cgAk

### What changes were proposed in this pull request?

This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data"

The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal.

The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal.

This warning was added in the following PR: apache#36843

### Why are the changes needed?

With the current logic, RankingMetrics will:
- raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input)
- not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala

Closes apache#42207 from guilhem-depop/patch-1.

Authored-by: Guilhem Vuillier <101632595+guilhem-depop@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants