Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] SPARK-2329 Add multi-label evaluation metrics #1270

Closed
wants to merge 10 commits into from

Conversation

avulanov
Copy link
Contributor

Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329

Multi-class measures are currently in the following pull request: #1155

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

* @return Accuracy.
*/
lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) =>
labels.intersect(predictions).size.toDouble / labels.union(predictions).size}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the intersect is called multiple times in different metrics, how about take it out so it is only calculated once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to extract "labels.intersect(predictions).size" as a lazy val? Will it then be calculated only once? The operation is made with Scala Set (not with RDD). Another option might be to store in RDD all intermediate calculations (including intersection) that are used in six different measures. In this case, I will need to make fold on the six-element tuple, which will look kind of scary, but it will be the most effective way to compute everything at once.

@BaiGang
Copy link
Contributor

BaiGang commented Jul 2, 2014

@avulanov Cool Alexander. Are you working on a multi-label classifier? We are expecting a multi-class multi-label classifier. I'm planning to implement the MultiBoost.MH on Spark, not sure if you've already started working on it.

@avulanov
Copy link
Contributor Author

avulanov commented Jul 2, 2014

@BaiGang Thanks! I'm implementing the decomposition of multiclass and multilabel problems to binary classification problems that can be solved with built-in MLLib classifiers. I use one-vs-one and one-vs-all approaches. As far as I understand, MultiBoost.MH is a C++ implementation of Adaboost.MH and the latter uses another kind of problem decomposition in addition to boosting. So, our efforts are complimentary. Lets stay in touch. Btw, I would be glad to benchmark your classifier with the classification tasks that I'm solving.

@BaiGang
Copy link
Contributor

BaiGang commented Jul 9, 2014

@avulanov Thanks Alexander! I just started to implement the base learner. The algorithms described in the MultiBoost document and the paper are straightforward but it will take some efforts to implement optimally on Apache Spark. I will keep you notified when I get the skeletons up.

@mengxr
Copy link
Contributor

mengxr commented Jul 16, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA tests have started for PR 1270. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16709/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 16, 2014

QA results for PR 1270:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class MultilabelMetrics(predictionAndLabels:RDD[(Set[Double], Set[Double])]) extends Logging{

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16709/consoleFull

@avulanov
Copy link
Contributor Author

@mengxr I've fixed Scala style

* Evaluator for multilabel classification.
* @param predictionAndLabels an RDD of (predictions, labels) pairs, both are non-null sets.
*/
class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another feasible representation of predictions/labels is mllib.linalg.Vector. It's basically a Vector of +1s and -1s, either dense or sparse. So it will be great if we add another function to do the transformation.

It's up to you. Transforming the data outside this evaluation module is also OK. : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDD[(Set[Double], Set[Double])] may be hard for Java users. We can ask users to input RDD[(Array[Double], Array[Double])], requiring that the labels are ordered. It is easier for Java users and faster to compute intersection and other set operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr We need to ensure that they don't contain repeating elements as well. It should be an optional constructor, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Set and Double are Scala types. It is very hard for Java users to construct such RDDs. Also, the input labels and output predictions are usually stored as Array[Double]. Shall we change the input to RDD[(Array[Double], Array[Double])] and internally convert it to RDD[(Set[Double], Set[Double])] and cache? We can put a contract that both labels and predictions are unique and ordered within a single instance. We don't need it if we use Set internally. But later we can switch to Array[Double] based solution for speed, because those are very small arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr Can we have RDD[(java.util.HashSet[Double], java.util.HashSet[Double])] as an optional constructor? Internally, we will use scala.collection.JavaConversions.asScalaSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avulanov Let's think what is more natural for the input data to a multi-label classifier and the output from the model it produces. They should match the input type here, so we can chain them easily. If we use either Java or Scala Set, we are going to have compatibility issues on the other. Also, set stores small objects, which increase GC pressure. These are the reasons I recommend using Array[Double].

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2014

this is ok to test

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have started for PR 1270 at commit 1843f73.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2014

@avulanov Sorry for getting back late! For the implementation, shall we define an aggregator and then compute all necessary information in a single pass, instead of trigger a job for each?

For the metric names, I think our reference is Mining Multi-label Data and we should follow the naming there:

  1. strictAccuracy -> subsetAccuracy
  2. microPrecisionDoc -> microPrecision (and update other metric names)
  3. add precision, recall, fMeasure and accuracy (example-based)
  4. For per-class metrics, I suggest removingClass and overload the metric method, as in MulticlassMetrics.

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have finished for PR 1270 at commit 1843f73.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])])
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • logDebug("isMulticlass = " + metadata.isMulticlass)

@avulanov
Copy link
Contributor Author

avulanov commented Sep 9, 2014

@mengxr Thank you for comments!

  1. I'll do the renaming stuff you suggested. It's worth implementing fMeasure with parameter as in MulticlassMetrics.
  2. For a single pass computation, I can aggregate left and right intersections and differences per class and per doc (total 6). Did you mean the same?
    3)Should we discuss [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484 as well?

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1270 at commit cf4222b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1270 at commit cf4222b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])])

* Returns Hamming-loss
*/
lazy val hammingLoss: Double = (predictionAndLabels.map { case (predictions, labels) =>
labels.diff(predictions).size + predictions.diff(labels).size}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be faster: labels.size + predictions.size - 2 * labels.intersect(labels).size

private lazy val numDocs: Long = predictionAndLabels.count

private lazy val numLabels: Long = predictionAndLabels.flatMap { case (_, labels) =>
labels}.distinct.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictionAndLabels.values.flatMap(l => l).distinct().count()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr Could you elaborate on this?

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 1270 at commit 517a594.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 1270 at commit 517a594.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])])

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

@avulanov Just want to check with you on the input type. I still feel Array[Double] is more suitable than Set[Double]. BitSet may be better for storage, but we want to have simple types for Python/Java APIs.

@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

@avulanov We are close to the feature freeze deadline. Do you plan to update the PR? If you are busy, do you mind me taking it over? Thanks!

@avulanov
Copy link
Contributor Author

@mengxr It was a busy month for me (moved to Bay area), but I am able update it this week. I currently also work on #1290 together with @bgreeven. When is the deadline?

@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

Welcome to the bay area! The deadline (soft for MLlib) is this Saturday. But since the only thing that needs to change is the input type, it should be trivial to update. Using Array is more consistent across other APIs and it also works nice with the new dataset APIs.

@avulanov
Copy link
Contributor Author

@mengxr Thanks! I've replaced Set with Array, fixed two functions that didn't pass test (due to union working differently on Arrays) and added a note that Arrays must have unique elements.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22624 has started for PR 1270 at commit fc8175e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22624 has finished for PR 1270 at commit fc8175e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MultilabelMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])])

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22624/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Oct 31, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22659 has started for PR 1270 at commit fc8175e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22659 has finished for PR 1270 at commit fc8175e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22659/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 1, 2014

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 62d01d2 Nov 1, 2014
@avulanov
Copy link
Contributor Author

avulanov commented Nov 1, 2014

@mengxr Thank you!

sunchao pushed a commit to sunchao/spark that referenced this pull request Dec 8, 2021
### What changes were proposed in this pull request?

This PR bumps the ADT version to 1.1.0.

### Why are the changes needed?

These changes are needed to avoid dependencies on a preview release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Locally.
wangyum added a commit that referenced this pull request May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants