New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel #6761
Conversation
Ok to test |
Is it normal that the test didn't started yet? |
No, I triggered it manually. There may still be a problem with the PR builder |
Test build #900 timed out for PR 6761 at commit |
@srowen I can't figure it out how why the tests failed.. I don't think that my PR is the cause of it |
Yes, look at the PR builder queue. I think something's wrong with it. |
Yep, the build 901 has failed on the same test suite with |
@srowen Should somebody trigger a new test? |
Test build #902 timed out for PR 6761 at commit |
@srowen Is the PR builder working again? |
@acidghost you can click the links to see for yourself |
Jenkins, retest this please |
Sorry to having bothered you, but I'm new to Jenkins and I don't really know how the inner things work.. |
Test build #34891 has finished for PR 6761 at commit
|
case Bernoulli => | ||
val prob = bernoulliCalculation(testData) | ||
posteriorProbabilities(prob) | ||
case _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this generate a compiler warning without this case or does it already prove there aren't other possibilities? in which case I think you can let it produce a match error if it ever somehow hits that case. UnknownError
seems to strong as it's at best an IllegalStateException
or something, not a VM error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't code that so didn't took any decision on that, but now that you highlighted it I think you're right. Removing the last case doesn't generate any compiler warning nor error so I just removed it.
I don't get any warning from my IDE but trying to run sbt mllib/compile
gives me the following error:
[info] Compiling 15 Java sources to /media/SB-1TB/workarea/apache-spark/unsafe/target/scala-2.10/classes...
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:20: error: cannot find symbol
[error] import javax.annotation.Nullable;
[error] ^
[error] symbol: class Nullable
[error] location: package javax.annotation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:20: error: cannot find symbol
[error] import javax.annotation.Nullable;
[error] ^
[error] symbol: class Nullable
[error] location: package javax.annotation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:24: error: package javax.annotation.concurrent does not exist
[error] import javax.annotation.concurrent.GuardedBy;
[error] ^
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:37: error: cannot find symbol
[error] MemoryBlock(@Nullable Object obj, long offset, long length) {
[error] ^
[error] symbol: class Nullable
[error] location: class MemoryBlock
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:28: error: cannot find symbol
[error] @Nullable
[error] ^
[error] symbol: class Nullable
[error] location: class MemoryLocation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryLocation.java:33: error: cannot find symbol
[error] public MemoryLocation(@Nullable Object obj, long offset) {
[error] ^
[error] symbol: class Nullable
[error] location: class MemoryLocation
[error] /media/SB-1TB/workarea/apache-spark/unsafe/src/main/java/org/apache/spark/unsafe/memory/ExecutorMemoryManager.java:42: error: cannot find symbol
[error] @GuardedBy("this")
[error] ^
[error] symbol: class GuardedBy
[error] location: class ExecutorMemoryManager
[error] 7 errors
[error] (unsafe/compile:compile) javac returned nonzero exit code
[error] Total time: 1 s, completed Jun 15, 2015 8:00:11 PM
Okay so I committed the updated changes. I am a bit disappointed by the results that I'm obtaining with my project. I hope that there's someone else that can test this functionality.. |
Re: Infinity, when do you see that? If you subtract the largest (least negative) log probability from everything then the largest log prob is 0 so its exp is 1. |
Sorry @srowen, I was erroneously subtracting the smallest.. Thank you @squito, my main error was scaling using multiplication instead of addition in the log space as already pointed out by @srowen. Something else I was missing is that NaiveBayes has some tendency to "overestimate" its confidence. Is this a characteristic of generative algorithms? Thanks @srowen for the patience and time you're giving. Do you have any clue on how to unit test this? |
@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification | |||
|
|||
import java.lang.{Iterable => JIterable} | |||
|
|||
import breeze.linalg.{max, min} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed?
Yeah a unit test would be good. I imagine you can add calls to new methods in existing tests? just verify that the posteriors are correct, or sane? |
For verifying that the posterior probabilities are correct, you mean What do you mean instead for sane? That they should sum to 1 approximately?
|
Yes, would be good to check that the predicted class is where the max probability occurs, that they sum to 1 too. I don't know if there's a standard epsilon but use the surrounding code as a guideline to what seems like the general practice for this test. In some cases, maybe the toy example in a test means you know exactly what the probabilities should be and you can verify them directly, because you do want to check that they're not just some set of numbers that happens to sum to 1 and is in the right order. If that's not easy maybe try to make a really trivial example model where the output probabilities are easy to know. |
} | ||
} | ||
|
||
def predictProbabilities(testData: RDD[Vector]): RDD[Map[Double, Double]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still prefer a Vector instead of a Map. That will make this method more Java-friendly and should (I think) also be more efficient in terms of object creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the end, should I make it return a Vector
or a Map
?
The labels are not in the right order, so I need to transform them to array indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with @jkbradley 's suggestion here
@acidghost Thanks for the updated PR! I'll try to make a closer pass soon. In the meantime, unit tests do sound great. I think one of the best ways to test correctness would be to compare with another library (ideally R or sklearn) on a tiny dataset. See e.g. the test for LogisticRegression: [https://github.com/apache/spark/blob/0fc4b96f3e3bf81724ac133a6acc97c1b77271b4/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala] |
@srowen @jkbradley now we have a "sanity" unit test. What is missing is testing that the probabilities have the correct values. |
I'm doing the correctness checking through comparison of the posterior probabilities with the ones given by e1071 in R. The problem is that even with an epsilon of 0.1 the test fails with errors like: So the epsilon of 0.1 is not enought. The smallest working epsilon is 0.31. Any suggestion @srowen @jkbradley ? |
@acidghost the epsilon shouldn't be anywhere near as large as 0.1 for a probability. This looks like a problem with the test then, as it clearly expects ~0.81 but the answer is very near 1. Are you sure the output from R is what you intend, and the result is the probability you are trying to compare to? |
val sum = probabilities.sum | ||
// Check that prediction probabilities sum up to one | ||
// with an epsilon of 10^-2 | ||
assert(sum > 0.99 && sum < 1.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a syntax like assert(sum ~== 1.0 relTol 0.01)
for this. I think the bound should be tighter here. I would not expect probabilities to sum to 1.001 or 1.0001.
Your R code looks good to me, just scanning it. I wonder if somehow the examples aren't in the same order in both cases? or the classes aren't somehow in the same order? maybe that would explain the failure. |
Indeed I used the wrong data in the R scripts... I used the training data also for predictions, and those are not the same used in the validation methods in the suite... I'm going to rewrite the R script right now. |
@srowen I updated the R scripts to use train & test data. Now with epsilon 0.1 the multinomial test passes and the bernoulli test fails with: EDIT: something is wrong with the order of examples.. both in multinomial and bernoulli I have the same examples with wrong predictions! |
I just tried saving all the predictions from R in a file and then loading them in Spark to test them, but almost all of them have discordant predictions regarding R ones.. |
Have you ruled out the example ordering? it looks like the Spark test does produce the same examples deterministically but might be worth just printing what it's using to make sure it matches your expectation. Also how did you generate the Bernoulli results? the e1071 implementation looks like it's only doing Multinomial, but I am not any expert on that package. Is the Multinomial result correct? then I'd expect maybe it's that you're not actually getting the results from a Bernoulli model from this R package? it's not just Multinomial but with counts > 1 mapped to 1. |
@srowen So I wasn't able to find anything more about e1071 so opted for scikit learn which features both multinomial and bernoulli models (link). This is the Python code I'm using: from sklearn.naive_bayes import MultinomialNB, BernoulliNB
import pandas as pd
import numpy as np
multi = MultinomialNB()
multi_train = pd.read_csv('multinomial.data.train', header = -1)
multi_train_labels = multi_train.iloc[:, 0]
multi_train_features = multi_train.iloc[:, 1:]
multi_model = multi.fit(multi_train_features, multi_train_labels)
multi_test = pd.read_csv('multinomial.data.test', header = -1)
multi_test_labels = multi_test.iloc[:, 0]
multi_test_features = multi_test.iloc[:, 1:]
multi_pred = multi_model.predict(multi_test_features)
print("Multinomial:\nNumber of mislabeled points out of a total %d points : %d" % (len(multi_test),(multi_test_labels != multi_pred).sum()))
multi_probs = multi_model.predict_proba(multi_test_features)
np.savetxt('multinomial.probs', multi_probs, delimiter=" ")
bernoulli = BernoulliNB()
bernoulli_train = pd.read_csv('bernoulli.data.train', header = -1)
bernoulli_train_labels = bernoulli_train.iloc[:, 0]
bernoulli_train_features = bernoulli_train.iloc[:, 1:]
bernoulli_model = bernoulli.fit(bernoulli_train_features, bernoulli_train_labels)
bernoulli_test = pd.read_csv('bernoulli.data.test', header = -1)
bernoulli_test_labels = bernoulli_test.iloc[:, 0]
bernoulli_test_features = bernoulli_test.iloc[:, 1:]
bernoulli_pred = bernoulli_model.predict(bernoulli_test_features)
print("Bernoulli:\nNumber of mislabeled points out of a total %d points : %d" % (len(bernoulli_test),(bernoulli_test_labels != bernoulli_pred).sum()))
bernoulli_probs = bernoulli_model.predict_proba(bernoulli_test_features)
np.savetxt('bernoulli.probs', bernoulli_probs, delimiter=" ") Those are the result from the tests:
The strangest thing is that I get different values at every run! For example in one run 16 examples diverge and in the other are 24 (and are neither the same examples). I am sure that I'm using the right data and they should be in the right order (or else the fact that the bernoulli one passes is even stranger). |
So you get different results in scikit? Between R and scikit, do you have a stable result for multinomial and bernoulli respectively? and do they match mllib? then that's probably good enough. If it's still not working I think we can hand-craft a very simple problem where the answer can just be calculated by hand. Thanks for your hard work on developing a test for this. It is definitely additive. |
I found that e1071 uses a Gaussian distribution (page 34), so I wouldn't use the results from that package. The mllib predictions test (sum to one and more that 80% correct predictions) both pass for Bernoulli and Multinomial. Comparing the scikit and mllib probabilities I have a stable result (all matches) only with the Bernoulli. With the Multinomial I get different results at every run. If I could use another library to compute the probabilities, I would compare those with the mllib ones, as you suggest. Do you know any with both Bernoulli and Multinomial models? Anyway is strange that only the Multinomial results are wrong. Might it be that the data generation function for Multinomial data is more random? Or is it the prediction algorithm? |
OK, maybe this is getting too far down a rabbit hole to hard-code some results from an implementation that we're not 100% sure is saying what we want. maybe it's simpler to just directly compute in the test what the probabilities should be, given the model. For example, in the case of Multinomial, you have this vector pi of C values, and this matrix theta with C rows and D columns. The probability of class 0 is the sum of pi(0) and the dot product of row 0 of theta with your data, with that whole sum exponentiated by e to get a final unnormalized probability for class 0. Then the unnormalized probs over all classes are normalized to sum to 1. Those results ought to be very close to the output of the model -- since it's what the model computes almost word for word! In that sense it almost feels redundant, but, it is coding the definition of the prediction in the test, which is appropriate. Later if the implementation changes, the test is still checking vs the naive straightforward computation. For Bernoulli, it's similar except that you're adding pi(0), and then adding the elements of theta where the input is 1, but log(1-exp(theta)) where the input is 0. I realize that's not a great description so I can assist writing this part if it would help Also I noticed a potential tiny inaccuracy in how the Naive bayes Bernoulli computation works. |
@acidghost would it help if I created a rough draft of the code I'm sketching above, which maybe you can bake into your test? with any luck, that's it, it verifies the result. |
@srowen Yeah, sorry for the scarce dedication in the last days, a draft of the idea you're presenting would really help me! As you can imagine this test was becoming not so pleasant to work with.. So yeah, your further help is greatly appreciated! As soon as you get back to me I'll try to make it fit into the test. |
Here is roughly the code to compute expected class probabilities for a given piece of input:
I intentionally approached it differently from the implementation in |
@acidghost Sorry for the delay, just now catching up! One quick comment: Should epsilon be set to 0? It looks like e1071 disables smoothing by default. If @srowen 's code does not fix the test issue, I can take a closer look. Thanks for the careful testing! |
@acidghost are you still working on this? I can bring it home if you're occupied otherwise. Credit to you for the change of course. |
@srowen not at the moment, I'm having pretty busy days at work.. If you take care of this PR it would be much appreciated! |
@acidghost take a look at #7376 You can close this PR. |
…yesModel Add predictProbabilities to Naive Bayes, return class probabilities. Continues #6761 Author: Sean Owen <sowen@cloudera.com> Closes #7376 from srowen/SPARK-4362 and squashes the following commits: 23d5a76 [Sean Owen] Fix model.labels -> model.theta 95d91fb [Sean Owen] Check that predicted probabilities sum to 1 b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This is made available along with the label.
@jkbradley @srowen