Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional parents #34

Closed
tom-christie opened this issue Nov 30, 2015 · 24 comments
Closed

Conditional parents #34

tom-christie opened this issue Nov 30, 2015 · 24 comments
Labels

Comments

@tom-christie
Copy link

First of all, I love this package, the documentation, and especially your (@jluttine) enthusiasm and willingness to help people learn how to use it. You are a huge credit to the community, so thanks!

Second, I have a question. I'm trying to implement the "Mixed Number Subtraction" Bayesian network example discussed in this this paper and in Chapter 11 of the book "Bayesian Networks for Educational Assessment." The basic structure is that there's a 'proficiency model' of latent variables THETA, and an 'evidence model' of measurable/observable variables X. The full problem looks like this:

hello

The latent variables (Skills) are modeled as Bernoulli with Beta priors. Each observable variable (Items) has one or more latent variables that maps to it such for a given observable variable X, if all of the THETA values that map to it have a value of 1, then X has a prior of PI=Beta(20,5). Otherwise, X has a prior PI=Beta(5,20).

The full plate notation looks like this:

plates

I'm omitting the Evidence Model (s) from my implementation until I can get just the i and j plates to work. The delta is a deterministic indicator variable that says, for an item X, whether subject i has all of the prerequisite skills - the prior on X changes if the answer to this (encoded by delta) is true or false. I'm also just setting i=1 to have a single subject.

For a single variable THETA an 10 items I can get the following to work:

lambda1 = Beta([[20,5]])
# using a Categorical here instead of Bernoulli because 
# otherwise it won't work in the Gate below - I get a NoConverterError
theta1 = Categorical(lambda1) 
pi = Beta([[20,5], [5,20]], plates=(10,2)) #two possible beta priors for each for X_1 through X_10
delta = Gate(theta1,pi, gated_plate=-1) #which one is used is based on the value of theta1
X = Binomial(1,delta)
X.observe([1,1,1,1,0,0,0,0,0,0]) #10 separate observations, a single one for each X_i

This seems to work great - i.e. the condition when X depends on the value of only a single THETA. However, I also have the situation in which the prior for theta2 depends on the value of theta1, and the prior for X depends on the value of BOTH variables. What I have so far is:

lambda1 = Dirichlet([20,5]) 
lambda2 = Dirichlet([[5,20],[20,5]])
theta1 = Categorical(lambda1)
theta2 = Categorical(Gate(theta1,lambda2)) #the prior for theta2 depends on the value of theta1

pi1 = Beta([[[5,20], [20,5]]],plates=(10,2)) #two possible beta priors for each question
pi2 = Beta([[[5,20], [20,5]]],plates=(10,2))
delta1 = Gate(theta1,pi1, gated_plate=-1)
delta2 = Gate(theta2,pi2, gated_plate=-1)

X1 = Binomial(1,delta1, plates=(10,))
X2 = Binomial(1,delta2, plates=(10,))

X1.observe([1,1,1,1,1,1,1,1,1,1])
X2.observe([0,0,0,0,0,0,0,0,0,0])

I suspect there's a more compact way to write all the pi1, pi2, etc, but this seems to work. Now what I need to add is a pi3/delta3/X3 where the prior depends on the value of BOTH theta1 and theta2. I suspect the result has to do with a clever use of gates and mixtures like in issue #23, but I don't really understand how you use mixtures with non-Gaussian nodes.

I would really appreciate any guidance! I have this full model implemented in PyMC2 but it's super slow and uses SO much memory. I would really love to switch to variational Bayes using this package.

Edit: Once I get this working I'd be very happy to contribute it as an example in the repo if you're interested.

Edit 2: Fixed an error in the code

jluttine added a commit to bayespy/bayespy-notebooks that referenced this issue Dec 3, 2015
@jluttine
Copy link
Member

jluttine commented Dec 3, 2015

Note: This message is a Jupyter Notebook available for download or for interactive session. It is a response to BayesPy issue #34.

Thanks for the positive feedback! Glad to hear you like the package. :)

So, you asked about conditioning theta3 on both theta1 and theta2. It is easy to do by "nesting" Mixture nodes as follows:

from bayespy.nodes import Categorical, Beta, Mixture
lambda1 = Beta([20,5]) 
lambda2 = Beta([[5,20],[20,5]])
lambda3 = Beta([ [[5,20],[20,5]],
                 [[3,40],[40,3]] ])
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical, lambda2)
theta3 = Mixture(theta1, Mixture, theta2, Categorical, lambda3)

You could use Gate nodes as you did. Doesn't really matter in your case. There are cases where only either one works correctly, but in your case it's a matter of preference. With nested Gate nodes, theta3 = Categorical(Gate(theta1, Gate(theta2, lambda3))) should work. I used Mixture so you could see an alternative.

I also used Beta instead of Dirichlet. Doesn't matter, just a thought. And thanks for reporting that Bernoulli isn't accepted by Gate/Mixture node. I made a bug report of it. I used Mixture for X nodes too, instead of Gate, and Bernoulli instead of Binomial with one trial. I also removed delta nodes. Don't really affect anything, just another way of writing.

pi1 = Beta([[[5,20], [20,5]]], plates=(10,2))
pi2 = Beta([[[5,20], [20,5]]], plates=(10,2))
pi3 = Beta([[[5,20], [20,5]]], plates=(10,2))

from bayespy.nodes import Bernoulli
X1 = Mixture(theta1, Bernoulli, pi1)
X2 = Mixture(theta2, Bernoulli, pi2)
X3 = Mixture(theta3, Bernoulli, pi3)

X1.observe([1,1,1,1,1,1,1,1,1,1])
X2.observe([0,0,0,0,0,0,0,0,0,0])
X3.observe([0,1,0,1,0,1,0,1,0,1])

Then just run the inference:

from bayespy.inference import VB
Q = VB(X1, X2, X3, pi1, pi2, pi3, theta3, theta2, theta1, lambda1, lambda2, lambda3)
Q.update(repeat=100)

A minor detail: I list the nodes in such an order that child nodes are before their parents, thus theta3 is before theta2. I think this makes sense when you observe the leave nodes and thus that information flows from children to parents.

Finally, you can, for instance, look at the posterior probabilities for theta:

print(theta1.get_moments()[0])
print(theta2.get_moments()[0])
print(theta3.get_moments()[0])

You may have noticed that you need to use separate nodes for each X1, X2, X3 etc. This is because theta nodes have dependencies that can't be represented by any single built-in node but need to be separated. Thus, all the child nodes need to be separate too. If you have a large amount of X1, ..., Xn nodes, you probably want to generate them programmatically, maybe using list comprehensions or for loops. There could be a node class to concatenate several nodes into a single "array", but it's not implemented yet, and I doubt it would make any significant difference here.

I hope this helps. If something was unclear or if I misunderstood something, please comment.

@tom-christie
Copy link
Author

Hi @jluttine, thanks for the help and suggestions! So I think I actually had two distinct questions (not sure if I wrote them very distinctly), and your answer definitely answered one, namely conditioning theta3 on theta1 and theta2, so thank you very much!

For the other question, I need a deterministic node (delta in the model) to be True if and only if both theta1 and theta2 are True (assuming 1 and True are equivalent here). That means that delta is entirely deterministic, dependent on its parents. I was thinking that I needed some sort of mixture or nested gate. I see that the deltas are just convenience nodes and that we could make the prior for X depend on the values of both theta1 and theta2 directly, but I'm not sure we've achieved that yet...please correct me if I'm wrong!

So to make this clearer, I need:

prior-for-X        theta1        theta2
pi=Beta(5,20)       0               0
pi=Beta(5,20)       0               1
pi=Beta(5,20)       1               0
pi=Beta(20,5)       1               1

In the model, the status of theta1 and theta2 is coded by the indicator variable delta, and then delta is used as a Gate for which Beta() to use.

Additionally, to make sure I understand what's going on I've tried instantiating variables and sampling from them. If I do this:

lambda1 = Beta([20,5]) 
X1 = Bernoulli(lambda1)
samples = [X1.random() for i in range(100000)]
print(sum(samples)/len(samples))

I regularly get a value of about 0.813 (rather than 0.8). Is this because the inference hasn't been run yet? And of course, I can't run the inference without data... so what's happening here? Just to make sure I'm not nuts I ran the same in PyMC and get 0.8 as expected:

import numpy as np
import pymc
theta1 = pymc.Beta('theta1', alpha=20, beta=5)
y1 = pymc.Bernoulli('y1', p=theta1, observed=False)
model = pymc.MCMC([y1, theta1])
model.sample(iter=10000, burn=1000, thin=2)
print
print "Mean is:",np.mean(m1.trace('y1')[:])

Thanks again for your prompt and thorough responses to my question and others'. I wasn't sure whether to ask this here or on Stackoverflow, because I know it's not your job to answer my questions :-) But are so responsive that I decided to ask here in any case.

@tom-christie
Copy link
Author

OK, regarding the deterministic node I MIGHT have found a solution.

from bayespy.nodes import Categorical, Beta, Mixture, Bernoulli, Binomial, Dirichlet, Gate
lambda1 = Beta([20,5]) 
lambda2 = Beta([[5,20],  #in which order are the rows here?
                [20,5]])
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical, lambda2)

#set up a group of nodes where p = 1 iff theta1=True and theta2=True
lambda3 = Beta([ [[1,1000],[1,1000]],
                 [[1,1000],[1000,1]] ]) 
p = Mixture(theta1, Mixture, theta2, Categorical, lambda3)

#possible priors for X
pi = Beta([[5,20], [20,5]])
#which prior to use with X depends on the value of p
X_prior = Gate(p,pi)
X = Bernoulli(X_prior, plates=(10,)) #10 observations

#simulate answering 10 questions incorrectly
X.observe([0,0,0,0,0,0,0,0,0,0])

#answering incorrectly means we should get low values for thetas
#lambda1 should be ~Beta([20,6]) 
#lambda2 should be ~Beta([[6,20],[20,5]])
Q = bp.inference.VB(X, X_prior, pi, p, lambda3, theta2, theta1, lambda2, lambda1)
Q.update(repeat=20)

print(lambda1)
print(lambda2)
print(theta1.get_moments()[0])
print(theta2.get_moments()[0])

I get the predicted result:

Converged at iteration 4.
 ~ Beta(a, b)
  a = 
20.01955714039129
  b = 
5.9804428596087105

#the b has a second parameter of 6, so I must have mixed up the order in the array above
 ~ Beta(a, b)
  a = 
[  5.00004194  20.00210246]
  b = 
[ 20.0195152    5.97834039]

[ 0.01955714  0.98044286]
[ 0.0021444  0.9978556]

with the warning:

/Users/thomas.christie/.virtualenvs/python3/lib/python3.5/
site-packages/bayespy/inference/vmp/vmp.py:678: 
UserWarning: Lower bound decreased 6.663974e-03!     
Bug somewhere or numerical inaccuracy?
  "numerical inaccuracy?" % L_diff)

I'm going to work on the more complex part now, but I just wanted to report that I think this part at least seems to be working...

@jluttine
Copy link
Member

jluttine commented Dec 5, 2015

Just a very quick comment. I looked into this yesterday a bit, and I think I need to implement a few simple nodes. Your idea is correct in principle, but in practice stochastic intermediate nodes with almost deterministic properties cause problems for the VB approximation. Thus, you might get a very bad posterior approximation which captures only some mode.

I was planning to implement Take node (similar to np.take) and ConditionalProbabilityTable. Either one would help you construct your model easily.

I`ll get back to you in a few days with more detailed comments.

@tom-christie
Copy link
Author

That sounds great. Sorry to create more work. I originally thought that an _and() version of your _or() function in #23 might be of some help but I wasn't quite sure how to apply it to this case.

jluttine added a commit to bayespy/bayespy-notebooks that referenced this issue Dec 8, 2015
@jluttine
Copy link
Member

jluttine commented Dec 8, 2015

Note: This message is a Jupyter Notebook available for download or for interactive session. It is a response to BayesPy issue #34.

It's always fun to implement new features that might be useful for someone, so no worries! I implemented Take node which should enable you to construct your model, I hope. It is similar to numpy.take, so you may want to take a look at that too. I didn't make a new release yet, so you need to use develop branch. Use, for instance,

pip install https://github.com/bayespy/bayespy/archive/develop.zip

If you notice any bugs, please report. You can read the docstring here: https://github.com/bayespy/bayespy/blob/develop/bayespy/inference/vmp/nodes/take.py#L14

Anyway, so I removed lambda3 and theta3, as theta3 was supposed to be just a deterministic function of theta1 and theta2. Here are the parameters of your model:

from bayespy.nodes import Categorical, Beta, Mixture
lambda1 = Beta([20,5]) 
lambda2 = Beta([[5,20],[20,5]])
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical, lambda2)

pi1 = Beta([[[5,20], [20,5]]], plates=(10,2))
pi2 = Beta([[[5,20], [20,5]]], plates=(10,2))
pi3 = Beta([[[5,20], [20,5]]], plates=(10,2))

In order to create AND operation for theta1 and theta2, I map the two elements along pi3 the last plate axis into a 2x2 matrix so that I can use two nested Mixture nodes:

from bayespy.nodes import Bernoulli, Take
X1 = Mixture(theta1, Bernoulli, pi1)
X2 = Mixture(theta2, Bernoulli, pi2)
X3 = Mixture(theta1, Mixture, theta2, Bernoulli, Take(pi3, [[0, 0], [0, 1]]))

X1.observe([0,1,0,1,0,1,0,1,0,1])
X2.observe([0,1,0,1,0,1,0,1,0,1])
X3.observe([1,1,1,1,1,1,1,1,1,1])

Basically, you can think of that 2x2 index table as a table which tells how to map values of theta1 and theta2 into plate indices for pi3. Can't find the words to explain it in a simple way, but it's very simple.. :)

from bayespy.inference import VB
Q = VB(X1, X2, X3, pi1, pi2, pi3, theta2, theta1, lambda1, lambda2)
Q.update(repeat=100)
print(theta1.get_moments()[0])
print(theta2.get_moments()[0])
Iteration 1: loglike=-2.490204e+01 (0.005 seconds)
Iteration 2: loglike=-2.451496e+01 (0.005 seconds)
Iteration 3: loglike=-2.454422e+01 (0.004 seconds)
Converged at iteration 3.
[  1.09166015e-05   9.99989083e-01]
[  2.96225661e-06   9.99997038e-01]


/home/jluttine/Workspace/bayespy/bayespy/inference/vmp/vmp.py:678: UserWarning: Lower bound decreased 2.926577e-02! Bug somewhere or numerical inaccuracy?
  "numerical inaccuracy?" % L_diff)

Oh, it looks like there is a bug in lower bound computation.. I need to take a look.. Anyway, the result looks reasonable for that data.

I'm planning to implement a node which would allow users to construct complex discrete graphs as a part of a model and perform "exact"/non-factorized inference within that set of variables. See issue #37 for details. But currently, when constructing these discrete variable graphs with BayesPy it factorizes with respect to the nodes. Thus, it is important to understand what you lose: So, let me demonstrate one problem of the mean-field approximation by using a classic XOR example. There are two booleans v1 and v2. A third boolean variable v3 depends on the others in such a way that it is probably True if only either one of v1 or v2 is True. If both or neither of them is True, then v3 is probably False. Here's the model:

v1 = Categorical([0.5, 0.5])
v2 = Categorical([0.5, 0.5])
v3 = Mixture(v1, Mixture, v2, Categorical,
             [[[0.9,0.1], [0.1,0.9]],
              [[0.1,0.9], [0.9,0.1]]])
v3.observe(1)

Intuitively, because of the symmetry, observing v3 doesn't give any information about the marginal posterior of either v2 or v1. It only couples them. As we observed v3=1, there's high probability mass on (v1=1,v2=0) and (v1=0,v2=1), and low probability mass on (v1=0,v2=0) and (v1=1,v2=1). This probability table can't be represented as a product of two marginals. And because the used VB approximation factorizes as $q(v_1)q(v_2)$, it loses the coupling and captures only one mode. So, if you run the inference algorithm, you'll notice that the result has captured either mode (v1=1,v2=0) or (v1=0,v2=1). The mode is chosen randomly, which you'll notice if you re-run the inference several times:

v1.initialize_from_random()
v2.initialize_from_random()
Q = VB(v1, v2, v3)
Q.update(repeat=100, verbose=False)
print(v1.get_moments()[0])
print(v2.get_moments()[0])
[ 0.24894775  0.75105225]
[ 0.75086601  0.24913399]

The true marginals would have been [0.5, 0.5] for both v1 and v2. So, whenever you have very strong coupling between different nodes, the approximation tends to capture only one mode and loses the dependency between the variables. This is basic stuff which is good to keep in mind. Maybe it was obvious to you already, but just wanted to point it out. The new CategoricalGraph node would get rid of this problem because the node would handle all the dependencies among the variables.

And finally, you asked about getting an incorrect mean (0.813) when sampling a node. Yes, random() method of nodes does not give you samples from the generative prior nor does it give you samples from the true posterior. It gives you samples from the current state of the node. As the nodes are factorized, the initial state of the nodes isn't the exact prior distribution because in the prior, the variables are not factorized. Thus, in general, you will not get samples from the true prior, but from an approximation of the prior/posterior, or whatever happens to be the state of the node at the moment. If you add more prior "samples" to you beta distribution, you'll get a better mean estimate:

print(np.mean(Bernoulli(Beta([20, 5]), plates=(100000,)).random()))
print(np.mean(Bernoulli(Beta([200, 50]), plates=(100000,)).random()))
print(np.mean(Bernoulli(Beta([20000, 5000]), plates=(100000,)).random()))
print(np.mean(Bernoulli(Beta([2000000, 500000]), plates=(100000,)).random()))
0.81294
0.79931
0.79989
0.80094

I hope I understood your model correctly this time. Please don't hesitate to ask if you have further questions or comments!

@tom-christie
Copy link
Author

This is fantastic, thank you so much for your help! I will go through your suggestions in detail and try to implement the full model over the next couple of days and will let you know if I run into any snags. Either way I will report back. Thank you again!

@tom-christie
Copy link
Author

I have a quick followup about the interpretation of results. If I have the code:

lambda1 = Beta([20,5]) #a is virtual 'TRUE'/success count, b is virtual 'FALSE'/failure count
lambda2 = Beta([[5,20], #(a,b) if lambda1=FALSE
                [20,5]]) #(a,b) if lambda1=TRUE
    #where a indicates an additional theta2=true, and 
    #b indicates an additional theta2=false
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical, lambda2)

with data

#similar to the results we'd get if the true value for theta1 and theta2 are both True or 1
X1.observe([1,1,1,1,1,1,1,1,1,1])
X2.observe([1,1,1,1,1,1,1,1,1,1])
X3.observe([1,1,1,1,1,1,1,1,1,1]) 

and results

lambda1=  ~ Beta(a, b)
  a = 
20.000000000010093
  b = 
5.9999999999899085  #additional virtual theta1=FALSE/failure count

lambda2=  ~ Beta(a, b)
  a = 
[  5.  20.]
  b = 
[ 20.   6.] #additional virtual theta2=FALSE count, but an increment in the second row indicates that that lambda1=TRUE

[  1.00916393e-11   1.00000000e+00] #indicates theta1=TRUE
[  2.73820053e-12   1.00000000e+00] #indicates theta2=TRUE

The 6 in the lambda1, and in the second index of b in lambda2, suggests to me that the inference result means that the generator of the data (the person giving responses X, in the model) does NOT have proficiency theta1. But the direct theta1 moments suggest that they in fact do have those proficiencies.

I bet I'm mixing the order of something. I thought the order of parameters in the plate was index0=false, index1=true, so that

lambda2 = Beta([[5,20], # used if theta1=FALSE in theta2 = Mixture(theta1, Categorical, lambda2)
                [20,5]]) # used if theta1=TRUE in theta2 = Mixture(theta1, Categorical, lambda2)

But maybe I'm wrong about this and this is where my confusion lies?

Edit:
Just to clarify, if the individual generating responses X has theta1=TRUE and theta2=TRUE, I would think the results should be:

lambda1=  ~ Beta(a, b)
  a = 
21 #reflecting theta1=true
  b = 
5

lambda2=  ~ Beta(a, b)
  a = 
[  5.  21.] #a is incremented (since theta2=true) for index 1 (indicating theta1=true)
  b = 
[ 20.   5.] 

But clearly I'm not understanding something about the indexing!

Edit 2:

When I switch the indexing in the 2-D parameters to reflect [ [a-if-true,b-if-true],[a-if-false,b-if-false]] and have the model:

lambda1 = Beta([20,5]) #(a,b)
lambda2 = Beta([[20,5], #(a,b) if lambda1=TRUE
            [5,20]]) #(a,b) if lambda1=FALSE
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical, lambda2)

pi1 = Beta([[20,5], #now used if theta1 is TRUE
            [5,20]], plates=(10,2)) #for question 1
pi2 = Beta([[20,5], 
            [5,20]], plates=(10,2)) #for question 2
pi3 = Beta([[20,5], 
            [5,20]], plates=(10,2)) #for question 3

and I get the results I'd expect:

lambda1=  ~ Beta(a, b)
  a = 
20.99999998010658 #reflects 'observed' theta1=true
  b = 
5.0000000198934185

lambda2=  ~ Beta(a, b)
  a = 
[ 20.99999989   5.00000002] #reflects observed theta2=true (a is incremented) and theta1 is true (first index is used)
  b = 
[  5.00000009  20.        ]

[  9.99999980e-01   1.98934186e-08] #reflects theta1=true if first index means true
[  9.99999913e-01   8.69033118e-08] #reflects theta2=true if first index means true

I assumed that the true/false values for variables mapped to 1 and 0 respectively, making the indexes of the 2-D parameter arrays (like for lambda2 and the pi values) [0,1] or [FALSE,TRUE]. I can handle using [TRUE,FALSE] instead, but then shouldn't the Take node be:

Take(pi3, [[1, 0], [0, 0]]))

rather than

Take(pi3, [[0, 0], [0, 1]]))

?

@jluttine
Copy link
Member

Ok, I think the confusion comes from Beta node. You interpret Beta([20, 5]) as 20 virtual TRUE samples and 5 virtual FALSE samples, which is the correct interpretion of Beta distribution in general. This interpretation works well when Beta node is used with Bernoulli node. However, with Categorical node, it is interpreted as prior samples for the classes. In this case, the number of classes is two, so there are 20 samples for class 0 and 5 samples for class 1. And then class 1 is interpreted as TRUE, which contradicts how Beta distribution is intepreted typically.. So yes, confusing..

My conclusion from this would be that it's not good to use Beta node with Categorical because it will just confuse people. I think I suggested you Bernoulli and Beta nodes. Sorry about that. So I'd suggest you replace all Beta nodes with Dirichlet nodes and all Bernoulli nodes with Categorical nodes, so there should be much less confusion. Then, you can of course choose any interpretation for class 0 and class 1, so you probably want to interpret class 0 as FALSE and class 1 as TRUE. And these classes map to the plates of the same index in mixtures, so [[0, 0], [0, 1]] mapping should work.

So sorry about suggesting incorrect/confusing solutions, and thanks for pointing it out. :) I hope this helps you solve the issue.

@jluttine
Copy link
Member

I'm not sure how to fix this in BayesPy so that it'd be good. Would it be better if Beta([a, b]) is equivalent to Dirichlet([b, a]) (i.e., classes would map as FALSE->0 and TRUE->1) or just remove compatibility to use Beta as a prior for Categorical and tell the user to use Dirichlet instead. But I suppose the current behaviour that Beta([a, b]) is equivalent to Dirichlet([a, b]) is confusing. If you have any thoughts, I'd be glad to hear.

@tom-christie
Copy link
Author

Yes! That's exactly it. The model in the example I'm trying to replicate uses Beta -> Bernoulli, but I switched to Dirichlet -> Categorical because of the NoConverterError, thinking that the results should be equivalent. Then you suggested I switch back to Beta and I didn't think to switch back to Bernoulli, though I should have. So as I suspected, the issue was rooted in my confusion, though it wasn't in the place I thought it was.

I think this gives me everything I need. Thank you so much for your help. I've thought about building some convenience nodes to help with the multiple nesting of gates or mixture nodes with lots of parameters. If I come up with anything particularly useful I'll let you know for your consideration.

I don't think making Beta([a,b]) equivalent to Dirichlet([b,a]) would be wise because it's non-standard and will be confusing to others. Also removing compatibility where technically it should exist may also not be a good idea. Perhaps printing out some type of warning when Beta is used as a prior to Categorical would be sufficient, though not strictly necessary if your user knew what they were doing, unlike me apparently!

Most of the knowledge I have about Bayesian modeling is from the books "Bayesian Data Analysis" by Gelman and 'Doing Bayesian Data Analysis' by Kruschke. Neither of those talk about Mixture nodes or Gates as far as I'm aware, so I've had a little bit of trouble figuring out what's really going on in your implementation. The other implementation of this model I've done was in PyMC, which allows you to construct your own deterministic nodes, which obviates the need to use these nodes (I think...). I did find a recent paper by Microsoft describing Gate notation, though I haven't been able to find anything similar about a Mixture node. It might be worth adding a few references to the documentation to help users get a better idea of how you're using these nodes, and the plate notation that goes along with it.

Thank you again!

@jluttine
Copy link
Member

Glad to hear it became clearer. Even I had to think about a while how the Beta / Bernoulli / Categorical mappings go, so it probably isn't very obvious to new users. I have to think about it. Probably I'll add a warning at least, and then maybe use Beta([a, b]) to be equivalent to Dirichlet([b, a]) because then both Categorical(p) and Bernoulli(p) would be equivalent regardless p is Beta or Dirichlet. But yes, at least some message/warning should be printed (which could be easily filtered out). And yes, I should improve the docstring documentation of Mixture node, thanks for pointing it out.

I'm planning to add support for black box variational inference in which case one could write arbitrary deterministic or stochastic nodes easily (similar to PyMC), but using black box variational inference is not as efficient because it's extremely general. But in any case it would be a great and essential feature. It's on the TODO list...

And again, please do not hesitate to ask further questions or to report problems. It's extremely valuable to hear what problems users are facing.

@tom-christie
Copy link
Author

OK, I'm trying to build the dependencies between my latent variables and I keep running into different NoConverterErrors, one with Gates and one with Mixtures. Here are the two I've seen. It may be that I'm not doing something right...

lambda1 = Beta([5,20]) #Beta(a,b) but parameters switched since it's prior for Categorical not Bernoulli
lambda2 = Beta([[20,5], # if lambda1=FALSE
                [5,20]]) # if lambda1=TRUE

lambda5 = Beta([[20,5], # none (theta1,theta2) are true
               [12.5,12.5],# exactly one of (theta1,theta2) is true
               [5,20]]) # both of (theta1,theta2) are true
lambdaMN = Dirichlet([[5, 7, 15], # none of (theta1,theta2,theta5) are true
                      [7, 9, 11], # one of (theta1,theta2,theta5) is true
                      [11, 9, 7], # two of (theta1,theta2,theta5) are true
                      [15, 7, 5]]) # ALL of (theta1,theta2,theta5) are true

theta1 = Categorical(lambda1)
theta2 = Gate(theta1, lambda2) 

theta5 = Gate(theta1, 
              Gate(theta2, 
              Bernoulli(Take(lambda5, [[0, 1], 
                                       [1, 2]])
                       )
                  )
              )

gives

NoConverterError: No conversion defined from BetaMoments to CategoricalMoments

while

theta5 = Mixture(theta1, 
                 Mixture, theta2, 
                 Bernoulli, Take(lambda5, [[0, 1], 
                                           [1, 2]]))

gives

NoConverterError: No conversion defined from BetaMoments to CategoricalMoments

I'm not super confident using Mixtures/Gates/Take together yet so I'm not sure if this is a problem with my code or something just not implemented yet (like you mentioned above).

@jluttine
Copy link
Member

  1. For theta2, you are gating lambda2 which is Beta. You probably want to take a Categorical mixture instead?
theta2 = Mixture(theta1, Categorical, lambda2)
  1. Same thing, theta2 is not a categorical variable but a beta variable so you can't use it to choose the clusters in a mixture. Maybe I should report which node gives NoConverterError so it would be easier to debug.

In general, I would suggest avoiding Gate node. There are cases when you can't use Mixture and you must use Gate, but at least I find it clearer and less error-prone to use Mixture. But of course, it's a matter of preference (in this case).

@jluttine
Copy link
Member

You also commented about some convenience node. Here's a sketch that you may find useful. Probably not perfect as this but I hope it gives a starting point. Note that you don't necessarily need to write node classes, but you can use simple functions in some cases.

from bayespy.utils import misc
def MappedCategoricalMixture(thetas, indices, p):
    thetas = list(thetas)
    N = len(thetas)
    args = thetas[:1] + misc.zipper_merge((N-1) * [Mixture], thetas[1:])
    args = args + [Categorical, Take(p, indices)]
    return Mixture(*args)

theta5 = MappedCategoricalMixture([theta1, theta2], [[0, 1], [1, 2]], lambda5)

@jluttine
Copy link
Member

FYI, there is a bug when using nested mixtures. I'm currently looking at it, but at the moment the results are most probably incorrect and you may get warnings about lower bound decreasing. Sorry for the inconvenience, I'll inform you as soon as I get it fixed. #39

@jluttine
Copy link
Member

FYI, I have already tracked down the bug. Yep, nested mixtures were totally incorrect. I already know how to fix it, I'll probably try to do it tomorrow.

@tom-christie
Copy link
Author

Awesome, thank you for following up about that.

The MappedCategoricalMixture is fantastic and has already saved me lots of trial and error getting nested mixtures to work, so thank you!! It all seems to work if I don't use plates, but now I need to put some of the variables on a plate.

The question I'm working on now is getting plates to work with nested mixtures. I can get everything to work properly with the rest of the model. I modified the helper function to accept a plates parameter:

def MappedCategoricalMixture(thetas, indices, p, plates = None):
    thetas = list(thetas)
    N = len(thetas)
    args = thetas[:1] + misc.zipper_merge((N-1) * [Mixture], thetas[1:])
    args = args + [Categorical, Take(p, indices)]
    if plates:
        return Mixture(*args, plates=plates)
    else:
        return Mixture(*args)

The minimal example is:

# these are shared across all students so there is no plate here
lambda1 = Beta([5,20])
lambda2 = Beta([[20,5], # if lambda1=False
                [5,20]]) # if lambda1=True

lambda5 = Beta([[20,5], # none (theta1,theta2) are True
               [12.5,12.5],# exactly one of (theta1,theta2) is True
               [5,20]]) # both of (theta1,theta2) are True

#--------------------- plates i
numStudents = 11
theta1 = Categorical(lambda1,plates=(numStudents,)) #these are plated fine
theta2 = Mixture(theta1, Categorical,lambda2) #inherits plates from theta1 just fine
theta5 = MappedCategoricalMixture([theta1,theta2],  [[0, 1], [1, 2]], lambda5, plates=????)

Seemingly no matter what I put for the plates parameter in theta5 (including omitting the parameter), I get the exact same error:

ValueError: Shapes ([2], (11,)) do not broadcast
...
During handling of the above exception, another exception occurred:
....
ValueError: The plates of the parents do not broadcast.

11 is clearly the numStudents value, but I don't know where it gets [2] from - shouldn't plate dimensions be tuples in any case?

Section 2.3.3 of the documentation (Irregular Plates here) seems to suggest that the Mixture inherits plates from the first parameter (the categorical-like node or array), and that seems to be the case for theta2, but I can't get it to work for theta5.

The higher-level view of what I'm trying to do is get all of the theta variables on the same "i" plate like in this image:

I'm ignoring the "j" and "s" plates and am going to just make multiple nodes, but I do need the "i" plate, and I need all the thetas and Xs to be on the plate, but NOT the lambda or pi nodes. And like I said, it all works when I make a simpler model that doesn't have a nested mixture. I'd appreciate any suggestions!

Edit: If I don't plate the theta values and JUST plate the X values, I do seem to get a model that works:

lambda1 = Beta([5,20]) 
lambda2 = Beta([[20,5],
                [5,20]])
lambda5 = Beta([[20,5],
               [12.5,12.5],
               [5,20]]) 
lambdaMN = Dirichlet([[5, 7, 15],
                      [7, 9, 11],
                      [11, 9, 7],
                      [15, 7, 5]])

#proficiency model
theta1 = Categorical(lambda1)
theta2 = Mixture(theta1, Categorical,lambda2) 
theta5 = MappedCategoricalMixture([theta1,theta2],  [[0, 1], [1, 2]], lambda5)
thetaMN = MappedCategoricalMixture([theta1,theta2,theta5], [[[0,1],[1,2]], [[1,2],[2,3]]], lambdaMN)

pi1 = Beta([[20,5], 
            [5,20]]) #for question 1
pi2 = Beta([[20,5], 
            [5,20]]) #for question 2
pi3 = Beta([[20,5], 
            [5,20]]) #for question 3

X1 = Mixture(theta1, Categorical, pi1, plates=(10,)) # prior pi1 for X1 depends on Theta1
X2 = Mixture(theta2, Categorical, pi2, plates=(10,)) # prior pi2 for X2 depends on Theta2
X3 = MappedCategoricalMixture([theta1,theta2],[[0, 0], [0, 1]], pi3, plates=(10,))

This works great. So I'm not sure what I'm doing wrong above. Maybe the issue has to do with the nested Mixture inheriting plate values?

@jluttine
Copy link
Member

The plate broadcasting issue is most likely related to the nested mixture bug I discovered #39. So, with the current code, you most likely experience incorrect results and problems that plates don't match etc. Actually I would say that you do get incorrect results with certainty if using nested mixtures. They are the right way to go, the issue is in BayesPy. Thus, your code might be ok (I didn't take a careful look yet), but at least Mixture node is doing plate axis mapping incorrectly and I hope I'll get that issue fixed ASAP.

jluttine added a commit to bayespy/bayespy-notebooks that referenced this issue Dec 13, 2015
@jluttine
Copy link
Member

Note: This message is a Jupyter Notebook available for download or for interactive session. It is a response to BayesPy issue #34.

I fixed the bug in nested mixtures, so they should work now. I emphasize, that even if you had been able to use them before, all the results have been incorrect with almost certainty.

I added a convenience function MultiMixture which creates a mixture given multiple categorical variables. It hides some complexities. Thus, MappedCategoricalMixture simplifies into:

from bayespy.nodes import *
from bayespy.utils import misc
def MappedCategoricalMixture(thetas, indices, p, **kwargs):
    return MultiMixture(thetas, Categorical, Take(p, indices), **kwargs)

Note that you will need to use develop branch. And now you should be able to construct the model without errors:

lambda1 = Dirichlet([5,20])
lambda2 = Dirichlet([[20,5], # if lambda1=False
                     [5,20]]) # if lambda1=True

lambda5 = Dirichlet([[20,5], # none (theta1,theta2) are True
                     [12.5,12.5],# exactly one of (theta1,theta2) is True
                     [5,20]]) # both of (theta1,theta2) are True

numStudents = 11
theta1 = Categorical(lambda1, plates=(numStudents,))
theta2 = Mixture(theta1, Categorical, lambda2)
theta5 = MappedCategoricalMixture([theta1, theta2],
                                  [[0, 1], [1, 2]],
                                  lambda5)

As I wrote all this into BayesPy quite quickly, there might be errors. If you notice any weird issues, please report. And again, please do ask further questions if you have any, this is really helping me to improve BayesPy. Thank you for your patience with all these issues.

@tom-christie
Copy link
Author

I'm so pumped - everything seems to work now. I really appreciate your help, both in catching my mistakes and in your responsiveness to fixes in BayesPy.

I'm going to do some fairly thorough model validation. If anything seems awry I will let you know. Otherwise, I'll send you a fairly in-depth iPython notebook with the model write-up that you can use as another example if you wish.

@jluttine
Copy link
Member

Great! Glad to hear it finally works. 😋 And yes, it'd be great to have an IPython Notebook with explanations about this in the examples. You can make a pull request for the notebook. Currently, the other examples aren't notebooks, but I'm in the process of converting all examples into notebooks, so notebooks are preferred instead of ReST files (if you happend to wonder).

@jluttine
Copy link
Member

jluttine commented Nov 4, 2016

Might not be relevant to you anymore, but I just found a critical bug in Take node. I believe that any model that has used that node, has incorrect results. Just wanted to inform you in case it matters. I just fixed it and made a release 0.5.5.

@tom-christie
Copy link
Author

Great, thank you for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants