Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Latest commit 0fd196b
Jun 15, 2011
|Failed to load latest commit information.|
======================= PyMC Multilevel Example ======================= A nice example of using PyMC for multilevel (aka "Random Effects") modeling came through on the PyMC mailing list today. I've put it into a git repo so that I can play around with it a little, and collect up the feedback that the list generates. The original message, from Scott B, lays out the situation with refreshing precision. I continue to be impressed with the skill-level that new PyMC users frequently begin with (for fun, compare to the questions on the NetworkX mailing list sometime... I guess this is an indication of the relative success in the popularization of network science and Bayesian statistics). Anyways, here is the original message:: Hi All, New to this group and to PyMC (and mostly new to Python). In any case, I'm writing to ask for help in specifying a multilevel model (mixed model) for what I call partially clustered designs. An example of a partially clustered design is a 2-condition randomized psychotherapy study where subjects are randomly assigned to conditions. In condition 1, subjects are treated in small groups of say 8 people a piece. In the condition 2, usually a control, subjects are only assessed on the outcome (they don't receive an intervention). Thus you have clustering in condition 1 but not condition 2. The model for a 2-condition study looks like (just a single time point to keep things simpler): y_ij = b_0 + b_1*Tx + u_j*Tx + e_ij where y_ij is the outcome for the ith person in cluster j (in most multilevel modeling software and in PROC MCMC in SAS, subjects in the unclustered condition are all in clusters of just 1 person), b_0 is the overall intercept, b_1 is the treatment effect, Tx is a dummy coded variable coded as 1 for the clustered condition and 0 for the unclustered condition, u_j is the random effect for cluster j, and e_ij is the residual error. The variance among clusters is \sigma^2_u and the residual term is \sigma^2_e (ideally you would estimate a unique residual by condition). Because u_j interacts with Tx, the random effect is only contributes to the clustered condition. In my PyMC model, I expressed the model in matrix form - I find it easier to deal with especially for dealing with the cluster effects. Namely: Y = XB + ZU + R where X is an n x 2 design matrix for the overall intercept and intervention effect, B is a 1 x 2 matrix of regression coefficients, Z is an n x c design matrix for the cluster effects (where c is equal to the number of clusters in the clustered condition), and U is a c x 1 matrix of cluster effects. The way I've written the model below, I have R as an n x n diagonal matrix with \sigma^2_e on the diagonal and 0's on the off-diagonal. All priors below are based on a model fit in PROC MCMC in SAS. I'm trying to replicate the analyses in PyMC so I don't want to change the priors. The data for my code can be downloaded here: http://dl.dropbox.com/u/613463/run1.csv The data consist of 200 total subjects. 100 in the clustered condition and 100 in the unclustered. In the clustered condition there are 10 clusters of 10 people each. There is a single outcome variable. The PyMC set-up file can be downloaded here (also pasted below): http://dl.dropbox.com/u/613463/analysis1.py The file for running the PyMC set-up file can be downloaded here: http://dl.dropbox.com/u/613463/run_analysis1.py I have 3 specific questions about the model: 1 - Given the description of the model, have I successfully specified the model? The results are quite similar to PROC MCMC, though the cluster variance (\sigma^2_u) differs more than I would expect due to Monte Carlo error. The differences make me wonder if I haven't got it quite right in PyMC. 2 - Is there a better (or more efficient) way to set up the model? The model runs quickly but I am trying to learn Python and to get better at optimizing how to set up models (especially multilevel models). 3 - How can change my specification so that I can estimate unique residual variances for clustered and unclustered conditions? Right now I've estimated just a single residual variance. But I typically want separate estimates for the residual variances per intervention condition. Thanks so much for your help. My code follows my signature. Best, Scott Here are my notes on the matter: 1. This code worked for me without any modification. :) Although when I tried to run it twice in the same ipython session, it gave me strange complaints. (for pymc version 2.1alpha, wall time 78s). For the newest version in the git repo (pymc version 2.2grad, commit ca77b7aa28c75f6d0e8172dd1f1c3f2cf7358135, wall time 75s) it didn't complain. 2. I find the data wrangling section of the model quite opaque. If there is a difference between the PROC MCMC and PyMC results, this is the first place I would look. I've been searching for the most transparent ways to deal with data in Python, so I can share some of my findings as applied to this block of code. 3. The model could probably be faster. This is the time for me to recall the two cardinal rules of program optimization: 1) Don't Optimize, and 2) (for experts only) Don't Optimize Yet. That said, the biggest change to the time PyMC takes to run is in the step methods. But adjusting this is a delicate operation. More on this to follow. 4. Changing the specification is the true power of the PyMC approach, and why this is worth the extra effort, since a random effects model like yours is one line of STATA. So I'd like to write out in detail how to change it. More on this to follow. 5. Indentation should be 4 spaces. Diverging from this inane detail will make python people itchy. I think that speeding up the MCMC is more fun than cleaning up the data wrangling, so I'll assume that the data is being set up as desired for now and mess around with the step methods. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/03d0e64653a99085ac5f391acd0ad5839ddfaae4 is a small example of this, which doesn't speed things up, but could mean less samples are required of the chain. To actually speed things up (and also potentially reduce the number of samples required), I'm going to combine the beta0 and beta1 stochs. This reduces the wall time to 58s, but the diagnostic plots make me think the it needs a longer burn-in period. I like to throw out the first half of my samples, so I'll make that change, too. Pro tip: I added a <code>reload(analysis1)</code> to the run_analysis1.py script, so that my changes to the model automatically make it into ipython when I say <code>time run run_analysis1</code>. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/6bd55cef9c5d854f7755924c705ab8e111e20b3c I'm not keeping careful track of if the results change as I modify the step methods, which is probably not a recommended practice. But it is fun to see the methods get faster and faster as I group more and more in the AM step method. (43s wall time for M.B and M.U in one AM, commit here; 32s wall time for all stochs together, but the results are not pretty) https://github.com/aflaxman/pymc-multilevel-example/commit/49dea954c0c681ad74a2b427a2e6ca594e501c4f Sometimes I've had good luck with choosing initial values for the MCMC by maximizing the posterior or something related to it. Here is code to do that, which doesn't take very long (36s wall time, and respectible looking results are back) https://github.com/aflaxman/pymc-multilevel-example/commit/e8c376b8685c64c7183a8aeb14ec05b095042652 So that little burst of messing around has halved the run time. Too bad I didn't work on the things I was supposed to be doing for the last hour. Another hour of messing around has given me some confidence that 200K samples is certainly enough, and 20K samples probably is enough to guide model development. To assess this, I relied on the autocorrelation plot of the stoch traces, which I added to the Matplot.plot routine, since I think everyone should use it approach. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/e8545ca5bec6e3a1e5dbb7ba20bb957bc67c0d95 Now I'm ready to try simplifying the data wrangling section, and then everything will be in order to extend the model to have different variance for the clustered and unclustered observations. I like to use pylab's csv2rec function for loading data files. It deals nicely with the column headings, and converts data to the correct type pretty routinely. I'm going to do two commits in the process of changing this code, one which checks carefully that I'm not breaking anything, and then another to yield a nice short block of code. Commits linked here https://github.com/aflaxman/pymc-multilevel-example/commit/8c06499b287e1bbd9dd7a4146bc627d6e9187ad8 and here https://github.com/aflaxman/pymc-multilevel-example/commit/11ff14dab2377491e8d4d155acfd74b7ca6c93d0 Now, on to changing the model! I added a separate variance parameter for the treatment and control groups in a slightly sneaky way. It only took 10 keystrokes (roughly), but maybe it would be better to use more code and have a clearer approach. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/a46531085c673e62458b8b28153f50aebae036de Before: B: [-0.07 0.3 ] u: [-0.14 0.02 0.3 0.5 0.36 -1.02 0.33 0.23 -0.28 -0.35] var_e1: 0.94 var_u: 0.4 After: B: [-0.06 0.31] u: [-0.13 0. 0.28 0.53 0.39 -1.08 0.34 0.22 -0.31 -0.35] var_e1: [ 1.12 0.79] var_u: 0.41 Since I sneakily used the indicator run1.treat to index the var_e1 array, var_e1 is the control variance and var_e1 is the treatment variance. So the change in the model doesn't change any of the parameter means, but it does show that the variance of the control group is higher than for the treatment group. Too bad I didn't generate before-and-after numbers on the uncertainty intervals, that is something that might have changed in an interesting way with this more complicated model. In conclusion, here is why I think that this approach is worth the extra effort, compared to using xtmixed in STATA. Although the STATA model is almost done when figure out the single line <code>xtmixed y treat ||j:</code>, that is not quite what you want. The control groups are not supposed to have random effects, so you have to find a hack for that, maybe <code>xtmixed y treat ||j: treat, nointercept</code>. And then you want to add different variances for treatment and control, which is doable, but also hacky, maybe with <code>xtmixed y treat ||j: treat, nointercept ||treat:</code>. And now you've got a very short line, but one that is going to be very hard to understand in the future. And what if you want to fiddle with the priors? Or change the noise model? Much better to have 30 clear lines of PyMC, in my opinion. p.s. Django and Rails have gotten a lot of mileage out of emphasizing _convention_ in frequently performed tasks, and I think that PyMC models could also benefit from this approach. I'm sure I can't develop our conventions myself, but I have changed all the file names to move towards what I think we might want them to look like. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/603112c1de3d0fd8fa8fdc76a21ec3f290d30a06 My analyses often have these basic parts: data, model, fitting code, graphics code. Maybe your do to.