[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #6761

acidghost · 2015-06-11T14:50:49Z

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This is made available along with the label.

@jkbradley @srowen

srowen · 2015-06-11T17:05:30Z

Ok to test

acidghost · 2015-06-11T21:58:50Z

Is it normal that the test didn't started yet?

srowen · 2015-06-12T04:59:28Z

No, I triggered it manually. There may still be a problem with the PR builder

SparkQA · 2015-06-12T07:55:07Z

Test build #900 timed out for PR 6761 at commit 7f53d08 after a configured wait of 175m.

acidghost · 2015-06-12T11:37:03Z

@srowen I can't figure it out how why the tests failed.. I don't think that my PR is the cause of it

srowen · 2015-06-12T12:16:10Z

Yes, look at the PR builder queue. I think something's wrong with it.

acidghost · 2015-06-12T12:47:52Z

Yep, the build 901 has failed on the same test suite with java.lang.OutOfMemoryError: Java heap space

acidghost · 2015-06-12T14:49:10Z

@srowen Should somebody trigger a new test?

SparkQA · 2015-06-12T17:52:01Z

Test build #902 timed out for PR 6761 at commit 7f53d08 after a configured wait of 175m.

acidghost · 2015-06-14T17:17:00Z

@srowen Is the PR builder working again?

srowen · 2015-06-14T17:25:03Z

@acidghost you can click the links to see for yourself

srowen · 2015-06-14T17:25:10Z

Jenkins, retest this please

acidghost · 2015-06-14T18:14:27Z

Sorry to having bothered you, but I'm new to Jenkins and I don't really know how the inner things work..

SparkQA · 2015-06-14T19:08:46Z

Test build #34891 has finished for PR 6761 at commit 7f53d08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-06-15T13:08:08Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

+      case Bernoulli =>
+        val prob = bernoulliCalculation(testData)
+        posteriorProbabilities(prob)
+      case _ =>


Will this generate a compiler warning without this case or does it already prove there aren't other possibilities? in which case I think you can let it produce a match error if it ever somehow hits that case. UnknownError seems to strong as it's at best an IllegalStateException or something, not a VM error.

I didn't code that so didn't took any decision on that, but now that you highlighted it I think you're right. Removing the last case doesn't generate any compiler warning nor error so I just removed it.

I don't get any warning from my IDE but trying to run sbt mllib/compile gives me the following error:

[info] Compiling 15 Java sources to /media/SB-1TB/workarea/apache-spark/unsafe/target/scala-2.10/classes... [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:20: error: cannot find symbol [error] import javax.annotation.Nullable; [error] ^ [error] symbol: class Nullable [error] location: package javax.annotation [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:20: error: cannot find symbol [error] import javax.annotation.Nullable; [error] ^ [error] symbol: class Nullable [error] location: package javax.annotation [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:24: error: package javax.annotation.concurrent does not exist [error] import javax.annotation.concurrent.GuardedBy; [error] ^ [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:37: error: cannot find symbol [error] MemoryBlock(@Nullable Object obj, long offset, long length) { [error] ^ [error] symbol: class Nullable [error] location: class MemoryBlock [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:28: error: cannot find symbol [error] @Nullable [error] ^ [error] symbol: class Nullable [error] location: class MemoryLocation [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:33: error: cannot find symbol [error] public MemoryLocation(@Nullable Object obj, long offset) { [error] ^ [error] symbol: class Nullable [error] location: class MemoryLocation [error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:42: error: cannot find symbol [error] @GuardedBy("this") [error] ^ [error] symbol: class GuardedBy [error] location: class ExecutorMemoryManager [error] 7 errors [error] (unsafe/compile:compile) javac returned nonzero exit code [error] Total time: 1 s, completed Jun 15, 2015 8:00:11 PM

acidghost · 2015-06-16T14:11:50Z

Okay so I committed the updated changes. I am a bit disappointed by the results that I'm obtaining with my project. I hope that there's someone else that can test this functionality..

srowen · 2015-06-16T14:21:03Z

Re: Infinity, when do you see that? If you subtract the largest (least negative) log probability from everything then the largest log prob is 0 so its exp is 1.

acidghost · 2015-06-16T14:27:21Z

Sorry @srowen, I was erroneously subtracting the smallest..

Thank you @squito, my main error was scaling using multiplication instead of addition in the log space as already pointed out by @srowen.

Something else I was missing is that NaiveBayes has some tendency to "overestimate" its confidence. Is this a characteristic of generative algorithms?

Thanks @srowen for the patience and time you're giving.

Do you have any clue on how to unit test this?

srowen · 2015-06-16T18:43:47Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

 import java.lang.{Iterable => JIterable}

+import breeze.linalg.{max, min}


No longer needed?

srowen · 2015-06-16T18:44:58Z

Yeah a unit test would be good. I imagine you can add calls to new methods in existing tests? just verify that the posteriors are correct, or sane?

acidghost · 2015-06-16T19:07:21Z

For verifying that the posterior probabilities are correct, you mean
concordant with the predict method results?

What do you mean instead for sane? That they should sum to 1 approximately?
(and with which accepted error?)
On 16 Jun 2015 20:45, "Sean Owen" notifications@github.com wrote:

Yeah a unit test would be good. I imagine you can add calls to new methods
in existing tests? just verify that the posteriors are correct, or sane?

—
Reply to this email directly or view it on GitHub
#6761 (comment).

srowen · 2015-06-16T19:27:07Z

Yes, would be good to check that the predicted class is where the max probability occurs, that they sum to 1 too. I don't know if there's a standard epsilon but use the surrounding code as a guideline to what seems like the general practice for this test. In some cases, maybe the toy example in a test means you know exactly what the probabilities should be and you can verify them directly, because you do want to check that they're not just some set of numbers that happens to sum to 1 and is in the right order. If that's not easy maybe try to make a really trivial example model where the output probabilities are easy to know.

jkbradley · 2015-06-17T20:59:16Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

    }
  }

+  def predictProbabilities(testData: RDD[Vector]): RDD[Map[Double, Double]] = {


I'd still prefer a Vector instead of a Map. That will make this method more Java-friendly and should (I think) also be more efficient in terms of object creation.

So, in the end, should I make it return a Vector or a Map?

The labels are not in the right order, so I need to transform them to array indexes.

I would go with @jkbradley 's suggestion here

jkbradley · 2015-06-17T21:02:21Z

@acidghost Thanks for the updated PR! I'll try to make a closer pass soon. In the meantime, unit tests do sound great. I think one of the best ways to test correctness would be to compare with another library (ideally R or sklearn) on a tiny dataset. See e.g. the test for LogisticRegression: [https://github.com/apache/spark/blob/0fc4b96f3e3bf81724ac133a6acc97c1b77271b4/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala]

acidghost · 2015-06-19T08:45:19Z

@srowen @jkbradley now we have a "sanity" unit test. What is missing is testing that the probabilities have the correct values.

acidghost · 2015-06-21T12:08:47Z

I'm doing the correctness checking through comparison of the posterior probabilities with the ones given by e1071 in R.

The problem is that even with an epsilon of 0.1 the test fails with errors like:
0.9999997639182312 was greater than 0.7099445, but 0.9999997639182312 was not less than 0.9099444999999999

So the epsilon of 0.1 is not enought. The smallest working epsilon is 0.31. Any suggestion @srowen @jkbradley ?

srowen · 2015-06-22T08:22:16Z

@acidghost the epsilon shouldn't be anywhere near as large as 0.1 for a probability. This looks like a problem with the test then, as it clearly expects ~0.81 but the answer is very near 1. Are you sure the output from R is what you intend, and the result is the probability you are trying to compare to?

srowen · 2015-06-22T08:34:52Z

mllib/src/test/scala/org/apache/spark/mllib/classification/NaiveBayesSuite.scala

+      val sum = probabilities.sum
+      // Check that prediction probabilities sum up to one
+      // with an epsilon of 10^-2
+      assert(sum > 0.99 && sum < 1.01)


We have a syntax like assert(sum ~== 1.0 relTol 0.01) for this. I think the bound should be tighter here. I would not expect probabilities to sum to 1.001 or 1.0001.

srowen · 2015-06-22T08:59:08Z

Your R code looks good to me, just scanning it. I wonder if somehow the examples aren't in the same order in both cases? or the classes aren't somehow in the same order? maybe that would explain the failure.

acidghost · 2015-06-22T08:59:33Z

Indeed I used the wrong data in the R scripts... I used the training data also for predictions, and those are not the same used in the validation methods in the suite...

I'm going to rewrite the R script right now.

acidghost · 2015-06-22T11:34:32Z

@srowen I updated the R scripts to use train & test data.

Now with epsilon 0.1 the multinomial test passes and the bernoulli test fails with: 0.5280344109727607 was not greater than 0.731882

EDIT: something is wrong with the order of examples.. both in multinomial and bernoulli I have the same examples with wrong predictions!

acidghost · 2015-06-22T13:00:44Z

I just tried saving all the predictions from R in a file and then loading them in Spark to test them, but almost all of them have discordant predictions regarding R ones..

srowen · 2015-06-22T13:25:29Z

Have you ruled out the example ordering? it looks like the Spark test does produce the same examples deterministically but might be worth just printing what it's using to make sure it matches your expectation.

Also how did you generate the Bernoulli results? the e1071 implementation looks like it's only doing Multinomial, but I am not any expert on that package. Is the Multinomial result correct? then I'd expect maybe it's that you're not actually getting the results from a Bernoulli model from this R package? it's not just Multinomial but with counts > 1 mapped to 1.

acidghost · 2015-06-24T18:50:55Z

@srowen So I wasn't able to find anything more about e1071 so opted for scikit learn which features both multinomial and bernoulli models (link).

This is the Python code I'm using:

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
import pandas as pd
import numpy as np

multi = MultinomialNB()

multi_train = pd.read_csv('multinomial.data.train', header = -1)
multi_train_labels = multi_train.iloc[:, 0]
multi_train_features = multi_train.iloc[:, 1:]

multi_model = multi.fit(multi_train_features, multi_train_labels)

multi_test = pd.read_csv('multinomial.data.test', header = -1)
multi_test_labels = multi_test.iloc[:, 0]
multi_test_features = multi_test.iloc[:, 1:]
multi_pred = multi_model.predict(multi_test_features)

print("Multinomial:\nNumber of mislabeled points out of a total %d points : %d" % (len(multi_test),(multi_test_labels != multi_pred).sum()))

multi_probs = multi_model.predict_proba(multi_test_features)
np.savetxt('multinomial.probs', multi_probs, delimiter=" ")



bernoulli = BernoulliNB()

bernoulli_train = pd.read_csv('bernoulli.data.train', header = -1)
bernoulli_train_labels = bernoulli_train.iloc[:, 0]
bernoulli_train_features = bernoulli_train.iloc[:, 1:]

bernoulli_model = bernoulli.fit(bernoulli_train_features, bernoulli_train_labels)

bernoulli_test = pd.read_csv('bernoulli.data.test', header = -1)
bernoulli_test_labels = bernoulli_test.iloc[:, 0]
bernoulli_test_features = bernoulli_test.iloc[:, 1:]
bernoulli_pred = bernoulli_model.predict(bernoulli_test_features)

print("Bernoulli:\nNumber of mislabeled points out of a total %d points : %d" % (len(bernoulli_test),(bernoulli_test_labels != bernoulli_pred).sum()))

bernoulli_probs = bernoulli_model.predict_proba(bernoulli_test_features)
np.savetxt('bernoulli.probs', bernoulli_probs, delimiter=" ")

Those are the result from the tests:

the bernoulli test passes (the predictions are all the same as the scikit's ones),
for the multinomial I get different things at every run.. On average 20 examples over 1000 have different predictions and on average 85 probability pairs over 1000 * 3 are discordant.

The strangest thing is that I get different values at every run! For example in one run 16 examples diverge and in the other are 24 (and are neither the same examples). I am sure that I'm using the right data and they should be in the right order (or else the fact that the bernoulli one passes is even stranger).

srowen · 2015-06-25T10:20:34Z

So you get different results in scikit? Between R and scikit, do you have a stable result for multinomial and bernoulli respectively? and do they match mllib? then that's probably good enough.

If it's still not working I think we can hand-craft a very simple problem where the answer can just be calculated by hand.

Thanks for your hard work on developing a test for this. It is definitely additive.

acidghost · 2015-06-26T07:14:18Z

I found that e1071 uses a Gaussian distribution (page 34), so I wouldn't use the results from that package.

The mllib predictions test (sum to one and more that 80% correct predictions) both pass for Bernoulli and Multinomial.

Comparing the scikit and mllib probabilities I have a stable result (all matches) only with the Bernoulli. With the Multinomial I get different results at every run. If I could use another library to compute the probabilities, I would compare those with the mllib ones, as you suggest. Do you know any with both Bernoulli and Multinomial models?

Anyway is strange that only the Multinomial results are wrong. Might it be that the data generation function for Multinomial data is more random? Or is it the prediction algorithm?

srowen · 2015-06-26T08:53:04Z

OK, maybe this is getting too far down a rabbit hole to hard-code some results from an implementation that we're not 100% sure is saying what we want. maybe it's simpler to just directly compute in the test what the probabilities should be, given the model.

For example, in the case of Multinomial, you have this vector pi of C values, and this matrix theta with C rows and D columns. The probability of class 0 is the sum of pi(0) and the dot product of row 0 of theta with your data, with that whole sum exponentiated by e to get a final unnormalized probability for class 0. Then the unnormalized probs over all classes are normalized to sum to 1. Those results ought to be very close to the output of the model -- since it's what the model computes almost word for word!

In that sense it almost feels redundant, but, it is coding the definition of the prediction in the test, which is appropriate. Later if the implementation changes, the test is still checking vs the naive straightforward computation.

For Bernoulli, it's similar except that you're adding pi(0), and then adding the elements of theta where the input is 1, but log(1-exp(theta)) where the input is 0.

I realize that's not a great description so I can assist writing this part if it would help

Also I noticed a potential tiny inaccuracy in how the Naive bayes Bernoulli computation works. math.log(1.0 - math.exp(value)) becomes inaccurate when value is pretty negative. math.log1p(-math.exp(value)) is more accurate in this case. It could matter at some level if we're asserting about the exact probability, and probabilities are often tiny in the output.

srowen · 2015-06-30T11:57:48Z

@acidghost would it help if I created a rough draft of the code I'm sketching above, which maybe you can bake into your test? with any luck, that's it, it verifies the result.

acidghost · 2015-06-30T14:16:43Z

@srowen Yeah, sorry for the scarce dedication in the last days, a draft of the idea you're presenting would really help me! As you can imagine this test was becoming not so pleasant to work with.. So yeah, your further help is greatly appreciated!

As soon as you get back to me I'll try to make it fit into the test.

srowen · 2015-06-30T16:38:14Z

Here is roughly the code to compute expected class probabilities for a given piece of input:

  def expectedMultinomialProbabilities(model: NaiveBayesModel, testData: Vector) = {
    val piVector = new BDV(model.pi)
    // model.labels is row-major; treat it as col-major representation of transpose, and transpose:
    val thetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten).t
    val logClassProbs = piVector + (thetaMatrix * testData.toBreeze)
    logClassProbs.toArray.map(math.exp)
  }

  def expectedBernoulliProbabilities(model: NaiveBayesModel, testData: Vector) = {
    val piVector = new BDV(model.pi)
    val thetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten).t
    val negThetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten.map(v => math.log(1.0 - math.exp(v)))).t
    val testBreeze = testData.toBreeze
    val negTestBreeze = new BDV(Array.fill(testBreeze.size)(1.0)) - testBreeze
    val logClassProbs = piVector + (thetaMatrix * testBreeze) + (negThetaMatrix * negTestBreeze)
    logClassProbs.toArray.map(math.exp)
  }

I intentionally approached it differently from the implementation in NaiveBayesModel; this is probably more straightforward but less performant, but that's the point of a cross-check in a test I suppose.
I haven't tested it; maybe I can test my test tomorrow. It should at least compile and be 90+% there.

jkbradley · 2015-06-30T21:52:45Z

@acidghost Sorry for the delay, just now catching up! One quick comment: Should epsilon be set to 0? It looks like e1071 disables smoothing by default. If @srowen 's code does not fix the test issue, I can take a closer look. Thanks for the careful testing!

srowen · 2015-07-08T10:13:35Z

@acidghost are you still working on this? I can bring it home if you're occupied otherwise. Credit to you for the change of course.

acidghost · 2015-07-11T07:56:47Z

@srowen not at the moment, I'm having pretty busy days at work.. If you take care of this PR it would be much appreciated!

srowen · 2015-07-13T18:53:05Z

@acidghost take a look at #7376 You can close this PR.

…yesModel Add predictProbabilities to Naive Bayes, return class probabilities. Continues #6761 Author: Sean Owen <sowen@cloudera.com> Closes #7376 from srowen/SPARK-4362 and squashes the following commits: 23d5a76 [Sean Owen] Fix model.labels -> model.theta 95d91fb [Sean Owen] Check that predicted probabilities sum to 1 b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities

get prediction probabilities of naive bayes classification

7f53d08

acidghost mentioned this pull request Jun 11, 2015

[MLLIB] SPARK-4362: Added classProbabilities method for Naive Bayes #3626

Closed

srowen reviewed Jun 15, 2015
View reviewed changes

Andrea Jemmett added 4 commits June 15, 2015 20:01

private NaiveBayesModel calculations & pattern matching fallback

83673fc

prediction probabilities calculation

35259a5

better log probs scaling

63c8d20

prediction probs calculation

41b6cb8

srowen reviewed Jun 16, 2015
View reviewed changes

jkbradley reviewed Jun 17, 2015
View reviewed changes

acidghost added 2 commits June 19, 2015 10:13

prediction probabilities as Vector

29a1147

testing prediction probabilities are "sane"

843da0c

initial tests for posteriors correctness

49f0d59

srowen reviewed Jun 22, 2015
View reviewed changes

better epsilon for sum to one

2ce7853

srowen mentioned this pull request Jul 13, 2015

[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #7376

Closed

acidghost closed this Jul 14, 2015

		@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

		import java.lang.{Iterable => JIterable}

		import breeze.linalg.{max, min}

[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #6761

[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #6761

Conversation

acidghost commented Jun 11, 2015

srowen commented Jun 11, 2015

acidghost commented Jun 11, 2015

srowen commented Jun 12, 2015

SparkQA commented Jun 12, 2015

acidghost commented Jun 12, 2015

srowen commented Jun 12, 2015

acidghost commented Jun 12, 2015

acidghost commented Jun 12, 2015

SparkQA commented Jun 12, 2015

acidghost commented Jun 14, 2015

srowen commented Jun 14, 2015

srowen commented Jun 14, 2015

acidghost commented Jun 14, 2015

SparkQA commented Jun 14, 2015

srowen Jun 15, 2015

Choose a reason for hiding this comment

acidghost Jun 15, 2015

Choose a reason for hiding this comment

acidghost commented Jun 16, 2015

srowen commented Jun 16, 2015

acidghost commented Jun 16, 2015

srowen Jun 16, 2015

Choose a reason for hiding this comment

srowen commented Jun 16, 2015

acidghost commented Jun 16, 2015

srowen commented Jun 16, 2015

jkbradley Jun 17, 2015

Choose a reason for hiding this comment

acidghost Jun 19, 2015

Choose a reason for hiding this comment

srowen Jun 19, 2015

Choose a reason for hiding this comment

jkbradley commented Jun 17, 2015

acidghost commented Jun 19, 2015

acidghost commented Jun 21, 2015

srowen commented Jun 22, 2015

srowen Jun 22, 2015

Choose a reason for hiding this comment

srowen commented Jun 22, 2015

acidghost commented Jun 22, 2015

acidghost commented Jun 22, 2015

acidghost commented Jun 22, 2015

srowen commented Jun 22, 2015

acidghost commented Jun 24, 2015

srowen commented Jun 25, 2015

acidghost commented Jun 26, 2015

srowen commented Jun 26, 2015

srowen commented Jun 30, 2015

acidghost commented Jun 30, 2015

srowen commented Jun 30, 2015

jkbradley commented Jun 30, 2015

srowen commented Jul 8, 2015

acidghost commented Jul 11, 2015

srowen commented Jul 13, 2015