Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #6761

Closed
wants to merge 9 commits into from

Conversation

acidghost
Copy link

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This is made available along with the label.

@jkbradley @srowen

@srowen
Copy link
Member

srowen commented Jun 11, 2015

Ok to test

@acidghost
Copy link
Author

Is it normal that the test didn't started yet?

@srowen
Copy link
Member

srowen commented Jun 12, 2015

No, I triggered it manually. There may still be a problem with the PR builder

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #900 timed out for PR 6761 at commit 7f53d08 after a configured wait of 175m.

@acidghost
Copy link
Author

@srowen I can't figure it out how why the tests failed.. I don't think that my PR is the cause of it

@srowen
Copy link
Member

srowen commented Jun 12, 2015

Yes, look at the PR builder queue. I think something's wrong with it.

@acidghost
Copy link
Author

Yep, the build 901 has failed on the same test suite with java.lang.OutOfMemoryError: Java heap space

@acidghost
Copy link
Author

@srowen Should somebody trigger a new test?

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #902 timed out for PR 6761 at commit 7f53d08 after a configured wait of 175m.

@acidghost
Copy link
Author

@srowen Is the PR builder working again?

@srowen
Copy link
Member

srowen commented Jun 14, 2015

@acidghost you can click the links to see for yourself

@srowen
Copy link
Member

srowen commented Jun 14, 2015

Jenkins, retest this please

@acidghost
Copy link
Author

Sorry to having bothered you, but I'm new to Jenkins and I don't really know how the inner things work..

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34891 has finished for PR 6761 at commit 7f53d08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case Bernoulli =>
val prob = bernoulliCalculation(testData)
posteriorProbabilities(prob)
case _ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this generate a compiler warning without this case or does it already prove there aren't other possibilities? in which case I think you can let it produce a match error if it ever somehow hits that case. UnknownError seems to strong as it's at best an IllegalStateException or something, not a VM error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't code that so didn't took any decision on that, but now that you highlighted it I think you're right. Removing the last case doesn't generate any compiler warning nor error so I just removed it.

I don't get any warning from my IDE but trying to run sbt mllib/compile gives me the following error:

[info] Compiling 15 Java sources to /media/SB-1TB/workarea/apache-spark/unsafe/target/scala-2.10/classes...
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:20: error: cannot find symbol
[error] import javax.annotation.Nullable;
[error]                        ^
[error]   symbol:   class Nullable
[error]   location: package javax.annotation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:20: error: cannot find symbol
[error] import javax.annotation.Nullable;
[error]                        ^
[error]   symbol:   class Nullable
[error]   location: package javax.annotation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:24: error: package javax.annotation.concurrent does not exist
[error] import javax.annotation.concurrent.GuardedBy;
[error]                                   ^
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:37: error: cannot find symbol
[error]   MemoryBlock(@Nullable Object obj, long offset, long length) {
[error]                ^
[error]   symbol:   class Nullable
[error]   location: class MemoryBlock
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:28: error: cannot find symbol
[error]   @Nullable
[error]    ^
[error]   symbol:   class Nullable
[error]   location: class MemoryLocation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:33: error: cannot find symbol
[error]   public MemoryLocation(@Nullable Object obj, long offset) {
[error]                          ^
[error]   symbol:   class Nullable
[error]   location: class MemoryLocation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:42: error: cannot find symbol
[error]   @GuardedBy("this")
[error]    ^
[error]   symbol:   class GuardedBy
[error]   location: class ExecutorMemoryManager
[error] 7 errors
[error] (unsafe/compile:compile) javac returned nonzero exit code
[error] Total time: 1 s, completed Jun 15, 2015 8:00:11 PM

@acidghost
Copy link
Author

Okay so I committed the updated changes. I am a bit disappointed by the results that I'm obtaining with my project. I hope that there's someone else that can test this functionality..

@srowen
Copy link
Member

srowen commented Jun 16, 2015

Re: Infinity, when do you see that? If you subtract the largest (least negative) log probability from everything then the largest log prob is 0 so its exp is 1.

@acidghost
Copy link
Author

Sorry @srowen, I was erroneously subtracting the smallest..

Thank you @squito, my main error was scaling using multiplication instead of addition in the log space as already pointed out by @srowen.

Something else I was missing is that NaiveBayes has some tendency to "overestimate" its confidence. Is this a characteristic of generative algorithms?

Thanks @srowen for the patience and time you're giving.

Do you have any clue on how to unit test this?

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

import java.lang.{Iterable => JIterable}

import breeze.linalg.{max, min}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed?

@srowen
Copy link
Member

srowen commented Jun 16, 2015

Yeah a unit test would be good. I imagine you can add calls to new methods in existing tests? just verify that the posteriors are correct, or sane?

@acidghost
Copy link
Author

For verifying that the posterior probabilities are correct, you mean
concordant with the predict method results?

What do you mean instead for sane? That they should sum to 1 approximately?
(and with which accepted error?)
On 16 Jun 2015 20:45, "Sean Owen" notifications@github.com wrote:

Yeah a unit test would be good. I imagine you can add calls to new methods
in existing tests? just verify that the posteriors are correct, or sane?


Reply to this email directly or view it on GitHub
#6761 (comment).

@srowen
Copy link
Member

srowen commented Jun 16, 2015

Yes, would be good to check that the predicted class is where the max probability occurs, that they sum to 1 too. I don't know if there's a standard epsilon but use the surrounding code as a guideline to what seems like the general practice for this test. In some cases, maybe the toy example in a test means you know exactly what the probabilities should be and you can verify them directly, because you do want to check that they're not just some set of numbers that happens to sum to 1 and is in the right order. If that's not easy maybe try to make a really trivial example model where the output probabilities are easy to know.

}
}

def predictProbabilities(testData: RDD[Vector]): RDD[Map[Double, Double]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still prefer a Vector instead of a Map. That will make this method more Java-friendly and should (I think) also be more efficient in terms of object creation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the end, should I make it return a Vector or a Map?

The labels are not in the right order, so I need to transform them to array indexes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with @jkbradley 's suggestion here

@jkbradley
Copy link
Member

@acidghost Thanks for the updated PR! I'll try to make a closer pass soon. In the meantime, unit tests do sound great. I think one of the best ways to test correctness would be to compare with another library (ideally R or sklearn) on a tiny dataset. See e.g. the test for LogisticRegression: [https://github.com/apache/spark/blob/0fc4b96f3e3bf81724ac133a6acc97c1b77271b4/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala]

@acidghost
Copy link
Author

@srowen @jkbradley now we have a "sanity" unit test. What is missing is testing that the probabilities have the correct values.

@acidghost
Copy link
Author

I'm doing the correctness checking through comparison of the posterior probabilities with the ones given by e1071 in R.

The problem is that even with an epsilon of 0.1 the test fails with errors like:
0.9999997639182312 was greater than 0.7099445, but 0.9999997639182312 was not less than 0.9099444999999999

So the epsilon of 0.1 is not enought. The smallest working epsilon is 0.31. Any suggestion @srowen @jkbradley ?

@srowen
Copy link
Member

srowen commented Jun 22, 2015

@acidghost the epsilon shouldn't be anywhere near as large as 0.1 for a probability. This looks like a problem with the test then, as it clearly expects ~0.81 but the answer is very near 1. Are you sure the output from R is what you intend, and the result is the probability you are trying to compare to?

val sum = probabilities.sum
// Check that prediction probabilities sum up to one
// with an epsilon of 10^-2
assert(sum > 0.99 && sum < 1.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a syntax like assert(sum ~== 1.0 relTol 0.01) for this. I think the bound should be tighter here. I would not expect probabilities to sum to 1.001 or 1.0001.

@srowen
Copy link
Member

srowen commented Jun 22, 2015

Your R code looks good to me, just scanning it. I wonder if somehow the examples aren't in the same order in both cases? or the classes aren't somehow in the same order? maybe that would explain the failure.

@acidghost
Copy link
Author

Indeed I used the wrong data in the R scripts... I used the training data also for predictions, and those are not the same used in the validation methods in the suite...

I'm going to rewrite the R script right now.

@acidghost
Copy link
Author

@srowen I updated the R scripts to use train & test data.

Now with epsilon 0.1 the multinomial test passes and the bernoulli test fails with: 0.5280344109727607 was not greater than 0.731882

EDIT: something is wrong with the order of examples.. both in multinomial and bernoulli I have the same examples with wrong predictions!

@acidghost
Copy link
Author

I just tried saving all the predictions from R in a file and then loading them in Spark to test them, but almost all of them have discordant predictions regarding R ones..

@srowen
Copy link
Member

srowen commented Jun 22, 2015

Have you ruled out the example ordering? it looks like the Spark test does produce the same examples deterministically but might be worth just printing what it's using to make sure it matches your expectation.

Also how did you generate the Bernoulli results? the e1071 implementation looks like it's only doing Multinomial, but I am not any expert on that package. Is the Multinomial result correct? then I'd expect maybe it's that you're not actually getting the results from a Bernoulli model from this R package? it's not just Multinomial but with counts > 1 mapped to 1.

@acidghost
Copy link
Author

@srowen So I wasn't able to find anything more about e1071 so opted for scikit learn which features both multinomial and bernoulli models (link).

This is the Python code I'm using:

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
import pandas as pd
import numpy as np

multi = MultinomialNB()

multi_train = pd.read_csv('multinomial.data.train', header = -1)
multi_train_labels = multi_train.iloc[:, 0]
multi_train_features = multi_train.iloc[:, 1:]

multi_model = multi.fit(multi_train_features, multi_train_labels)

multi_test = pd.read_csv('multinomial.data.test', header = -1)
multi_test_labels = multi_test.iloc[:, 0]
multi_test_features = multi_test.iloc[:, 1:]
multi_pred = multi_model.predict(multi_test_features)

print("Multinomial:\nNumber of mislabeled points out of a total %d points : %d" % (len(multi_test),(multi_test_labels != multi_pred).sum()))

multi_probs = multi_model.predict_proba(multi_test_features)
np.savetxt('multinomial.probs', multi_probs, delimiter=" ")



bernoulli = BernoulliNB()

bernoulli_train = pd.read_csv('bernoulli.data.train', header = -1)
bernoulli_train_labels = bernoulli_train.iloc[:, 0]
bernoulli_train_features = bernoulli_train.iloc[:, 1:]

bernoulli_model = bernoulli.fit(bernoulli_train_features, bernoulli_train_labels)

bernoulli_test = pd.read_csv('bernoulli.data.test', header = -1)
bernoulli_test_labels = bernoulli_test.iloc[:, 0]
bernoulli_test_features = bernoulli_test.iloc[:, 1:]
bernoulli_pred = bernoulli_model.predict(bernoulli_test_features)

print("Bernoulli:\nNumber of mislabeled points out of a total %d points : %d" % (len(bernoulli_test),(bernoulli_test_labels != bernoulli_pred).sum()))

bernoulli_probs = bernoulli_model.predict_proba(bernoulli_test_features)
np.savetxt('bernoulli.probs', bernoulli_probs, delimiter=" ")

Those are the result from the tests:

  • the bernoulli test passes (the predictions are all the same as the scikit's ones),
  • for the multinomial I get different things at every run.. On average 20 examples over 1000 have different predictions and on average 85 probability pairs over 1000 * 3 are discordant.

The strangest thing is that I get different values at every run! For example in one run 16 examples diverge and in the other are 24 (and are neither the same examples). I am sure that I'm using the right data and they should be in the right order (or else the fact that the bernoulli one passes is even stranger).

@srowen
Copy link
Member

srowen commented Jun 25, 2015

So you get different results in scikit? Between R and scikit, do you have a stable result for multinomial and bernoulli respectively? and do they match mllib? then that's probably good enough.

If it's still not working I think we can hand-craft a very simple problem where the answer can just be calculated by hand.

Thanks for your hard work on developing a test for this. It is definitely additive.

@acidghost
Copy link
Author

I found that e1071 uses a Gaussian distribution (page 34), so I wouldn't use the results from that package.

The mllib predictions test (sum to one and more that 80% correct predictions) both pass for Bernoulli and Multinomial.

Comparing the scikit and mllib probabilities I have a stable result (all matches) only with the Bernoulli. With the Multinomial I get different results at every run. If I could use another library to compute the probabilities, I would compare those with the mllib ones, as you suggest. Do you know any with both Bernoulli and Multinomial models?

Anyway is strange that only the Multinomial results are wrong. Might it be that the data generation function for Multinomial data is more random? Or is it the prediction algorithm?

@srowen
Copy link
Member

srowen commented Jun 26, 2015

OK, maybe this is getting too far down a rabbit hole to hard-code some results from an implementation that we're not 100% sure is saying what we want. maybe it's simpler to just directly compute in the test what the probabilities should be, given the model.

For example, in the case of Multinomial, you have this vector pi of C values, and this matrix theta with C rows and D columns. The probability of class 0 is the sum of pi(0) and the dot product of row 0 of theta with your data, with that whole sum exponentiated by e to get a final unnormalized probability for class 0. Then the unnormalized probs over all classes are normalized to sum to 1. Those results ought to be very close to the output of the model -- since it's what the model computes almost word for word!

In that sense it almost feels redundant, but, it is coding the definition of the prediction in the test, which is appropriate. Later if the implementation changes, the test is still checking vs the naive straightforward computation.

For Bernoulli, it's similar except that you're adding pi(0), and then adding the elements of theta where the input is 1, but log(1-exp(theta)) where the input is 0.

I realize that's not a great description so I can assist writing this part if it would help

Also I noticed a potential tiny inaccuracy in how the Naive bayes Bernoulli computation works. math.log(1.0 - math.exp(value)) becomes inaccurate when value is pretty negative. math.log1p(-math.exp(value)) is more accurate in this case. It could matter at some level if we're asserting about the exact probability, and probabilities are often tiny in the output.

@srowen
Copy link
Member

srowen commented Jun 30, 2015

@acidghost would it help if I created a rough draft of the code I'm sketching above, which maybe you can bake into your test? with any luck, that's it, it verifies the result.

@acidghost
Copy link
Author

@srowen Yeah, sorry for the scarce dedication in the last days, a draft of the idea you're presenting would really help me! As you can imagine this test was becoming not so pleasant to work with.. So yeah, your further help is greatly appreciated!

As soon as you get back to me I'll try to make it fit into the test.

@srowen
Copy link
Member

srowen commented Jun 30, 2015

Here is roughly the code to compute expected class probabilities for a given piece of input:

  def expectedMultinomialProbabilities(model: NaiveBayesModel, testData: Vector) = {
    val piVector = new BDV(model.pi)
    // model.labels is row-major; treat it as col-major representation of transpose, and transpose:
    val thetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten).t
    val logClassProbs = piVector + (thetaMatrix * testData.toBreeze)
    logClassProbs.toArray.map(math.exp)
  }

  def expectedBernoulliProbabilities(model: NaiveBayesModel, testData: Vector) = {
    val piVector = new BDV(model.pi)
    val thetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten).t
    val negThetaMatrix = new BDM(model.theta(0).length, model.theta.length, model.theta.flatten.map(v => math.log(1.0 - math.exp(v)))).t
    val testBreeze = testData.toBreeze
    val negTestBreeze = new BDV(Array.fill(testBreeze.size)(1.0)) - testBreeze
    val logClassProbs = piVector + (thetaMatrix * testBreeze) + (negThetaMatrix * negTestBreeze)
    logClassProbs.toArray.map(math.exp)
  }

I intentionally approached it differently from the implementation in NaiveBayesModel; this is probably more straightforward but less performant, but that's the point of a cross-check in a test I suppose.
I haven't tested it; maybe I can test my test tomorrow. It should at least compile and be 90+% there.

@jkbradley
Copy link
Member

@acidghost Sorry for the delay, just now catching up! One quick comment: Should epsilon be set to 0? It looks like e1071 disables smoothing by default. If @srowen 's code does not fix the test issue, I can take a closer look. Thanks for the careful testing!

@srowen
Copy link
Member

srowen commented Jul 8, 2015

@acidghost are you still working on this? I can bring it home if you're occupied otherwise. Credit to you for the change of course.

@acidghost
Copy link
Author

@srowen not at the moment, I'm having pretty busy days at work.. If you take care of this PR it would be much appreciated!

@srowen
Copy link
Member

srowen commented Jul 13, 2015

@acidghost take a look at #7376 You can close this PR.

@acidghost acidghost closed this Jul 14, 2015
asfgit pushed a commit that referenced this pull request Jul 14, 2015
…yesModel

Add predictProbabilities to Naive Bayes, return class probabilities.

Continues #6761

Author: Sean Owen <sowen@cloudera.com>

Closes #7376 from srowen/SPARK-4362 and squashes the following commits:

23d5a76 [Sean Owen] Fix model.labels -> model.theta
95d91fb [Sean Owen] Check that predicted probabilities sum to 1
b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants