Radon Example Implementation #440

juanitorduz · 2022-01-14T14:21:15Z

This PR aims to solve #439.

I have pushed the first version of the models. Here are the formulas I used:

pooled_model: log_radon ~ floor
unpooled_model: log_radon ~ 0 + county + county:floor
partial_pooling_model: log_radon ~ 0 + (0 + 1|county)
varying_intercept_model: log_radon ~ 0 + (1|county) + floor # ! How to do complete one-hot-encoding for county?
varying_intercept_slope_model: log_radon ~ 0 + (floor|county) # ! How to do complete one-hot-encoding for county?

I could like to have all the countries as encoded (not removing the first one) plus I am not auto-scaling the data to make the results with the original PyMC example comparable. Does this look good? I am still getting used to the formula-like notation for hierarchical models (which is why I am working on this PR 😅 ).

After double-checking the model specification I will continue to add text and some additional plots.

review-notebook-app · 2022-01-14T14:21:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

review-notebook-app · 2022-01-14T22:38:02Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:01Z
----------------------------------------------------------------

We could convert the variable "floor" to categorical in pandas. This would help users see the advantage of Bambi when dealing with categorical predictors.

Right now, "floor" is 0 and 1, which is a dummy-coded categorical variable. But if we map 0 -> Basement and 1 -> First floor, we can take advantage of automatic dimension labelling.

This is what I mean

srrs_mn["floor"] = srrs_mn["floor"].map({0: "Basement", 1: "Floor"})

juanitorduz commented on 2022-01-15T18:56:36Z
----------------------------------------------------------------

Makes sense! Solved in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

review-notebook-app · 2022-01-14T22:38:02Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:02Z
----------------------------------------------------------------

"Estimate a single radon level for every floor level" would be clearer about the meaning of the predictor "x" I think

juanitorduz commented on 2022-01-16T20:02:37Z
----------------------------------------------------------------

Added!

review-notebook-app · 2022-01-14T22:38:03Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:03Z
----------------------------------------------------------------

If we decide to use the categorical version of floor, it would be good to remove the intercept from the formula so the first level of floor is not dropped.

If we keep the intercept, the first level is dropped to ensure full-rankness of the underlying design matrix*

It would be log_radon ~ 0 + floor

(*) We have priors, which are a kind of soft-constraind, and the model would fit even with rank-defficient design matrices. But that can lead to other problems with the sampler. That's why Bambi sometimes drops columns for categorical variables (brms does the same thing)

juanitorduz commented on 2022-01-15T18:57:50Z
----------------------------------------------------------------

This was added in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

review-notebook-app · 2022-01-14T22:38:04Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:04Z
----------------------------------------------------------------

This is what I mean when I suggest using the categorical version of floor: https://imgur.com/a/KKQZ1eL

(I uploaded the image when I was using First floor instead of Floor)

review-notebook-app · 2022-01-14T22:38:05Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:04Z
----------------------------------------------------------------

If you decide to use floor as categorical, this would be the updated code (I don't want to make you work a lot because of me! haha)

Note that the coordinate for floor is automatically set to floor_coord

pooled_results_posterior_stacked = pooled_results.posterior.stack(sample=('chain', 'draw'))
fig, ax = plt.subplots()

sns.kdeplot(

x=pooled_results_posterior_stacked["floor"].sel({"floor_coord": "Basement"}),

fill=True,

alpha=0.5,

label="Basement",

ax=ax

);
sns.kdeplot(

x=pooled_results_posterior_stacked["floor"].sel({"floor_coord": "Floor"}),

fill=True,

alpha=0.5,

label="Floor",

ax=ax

)
ax.legend(loc="upper left")

ax.set(title="Posterior Density for Floor Levels", xlabel="log(radon + 0.1)", ylabel="Density");

juanitorduz commented on 2022-01-15T19:10:13Z
----------------------------------------------------------------

Thanks for adding it to the notebook!

review-notebook-app · 2022-01-14T22:38:06Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:05Z
----------------------------------------------------------------

The equivalent formula for the model in the PyMC version is log_radon ~ 0 + county:floor which means no intercept and interaction between county and floor.

If we do something like log_radon ~ 0 + county + county:floor you would see Bambi drops columns in the interaction county:floor. Again that's because it keeps full-rankness in the design matrix.

0 + county:floor and 0 + county + county:floor encode the same information. They just do it in a different way.

This post from the Patsy documentation is very insightful. I confess I had to read it several times, in different days, before I started understanding in more depth :D

juanitorduz commented on 2022-01-15T19:01:14Z
----------------------------------------------------------------

Thanks for the input!

review-notebook-app · 2022-01-14T22:38:06Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:06Z
----------------------------------------------------------------

With the new formula we would have https://imgur.com/iRaqmKi which is more alike what is in the PyMC version

review-notebook-app · 2022-01-14T22:38:07Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:07Z
----------------------------------------------------------------

I also think you've found a bug, or something similar. I would have expected an error being raised when you set the prior for 1|county to rior(name="Normal", mu=0.0, sigma=10.0) . Since it is a varying effect, the value of sigma should be a prior too (the hyperprior).

Here we should have something along these lines

partial_pooling_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=10),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "sigma": bmb.Prior(name="Exponential", lam=1.0),
}
partial_pooling_model = bmb.Model(

formula="log_radon ~ 1 + (1|county)",

data=srrs_mn,

priors=partial_pooling_priors,

)

If you want to be closer to the PyMC specification, you could pass noncentered=Flase to bmb.Model because the PyMC version is noncentered.

If you're familiar with the PyMC way of using hierarchies, the Bambi way (borrowed from mixed effects models way) may be confusing in the beginning.

A good explanation is found in Chapter 16 from Bayes Rules book, specifically section 16.3.2

juanitorduz commented on 2022-01-15T19:06:32Z
----------------------------------------------------------------

I would actually suggest to add some of this comments (and the reference) to the notebook.

juanitorduz commented on 2022-01-15T19:07:44Z
----------------------------------------------------------------

As for future reference, the bug was solved in https://github.com//pull/441

review-notebook-app · 2022-01-14T22:38:08Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:08Z
----------------------------------------------------------------

Line #7.        formula="log_radon ~ 0 + (0 + 1|county)",

The 0 in 0 + 1 |county is not having any effect. The 0 means "remove the intercept, if there's one". Since there's an implicit intercept, it is being removed. But then we have 1|county which means "add the intercept". So that term is equivalent to 1|county.

Feel free to ask if there's something you'd like to double check in the formula specification :)

review-notebook-app · 2022-01-14T22:38:09Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:08Z
----------------------------------------------------------------

This is a similar issue than above

varying_intercept_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=10),
    "floor": bmb.Prior(name="Normal", mu=0.0, sigma=10.0),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "sigma": bmb.Prior(name="Exponential", lam=1),
}
varying_intercept_model = bmb.Model(

formula="log_radon ~ 0 + floor + (1|county)",

data=srrs_mn,

priors=varying_intercept_priors,

)

I think that updating the model solves your question about one-hot encoding for county?

Note: I removed the intercept so we have the two coefficients for floor.

juanitorduz commented on 2022-01-15T19:08:26Z
----------------------------------------------------------------

Now everything make sense!

review-notebook-app · 2022-01-14T22:38:09Z

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:09Z
----------------------------------------------------------------

Here the model would be

varying_intercept_slope_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=5),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "floor|county": bmb.Prior(name="Normal", mu=0.0, sigma=bmb.Prior("Exponential", lam=0.5)),
    "sigma": bmb.Prior(name="Exponential", lam=1.0),
}
varying_intercept_slope_model = bmb.Model(

formula="log_radon ~ (floor|county)",

data=srrs_mn,

priors=varying_intercept_slope_priors,

)

but I've tried running it and it was taking hours, so I guessed there is a bug when using categorical predictors in varying effects

I did the following and it worked like a charm

srrs_mn2 = srrs_mn.copy()

srrs_mn2["floor2"] = np.where(srrs_mn2["floor"] == "Basement", 0, 1)
varying_intercept_slope_priors = {

"Intercept": bmb.Prior(name="Normal", mu=0, sigma=2.5),

"1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),

"floor2|county": bmb.Prior(name="Normal", mu=0.0, sigma=bmb.Prior("Exponential", lam=0.5)),

"sigma": bmb.Prior(name="Exponential", lam=1.0),

}
varying_intercept_slope_model = bmb.Model(

formula="log_radon ~ (floor2|county)",

data=srrs_mn2,

priors=varying_intercept_slope_priors,

)

So I'll dig into this issue to see what's wrong in the first model and try to fix it.

tomicapretto · 2022-01-14T22:39:49Z

@juanitorduz thanks for this massive contribution!

I've added several comments to the notebook. I think you've made a lot of progress!

As you may see in the comments, I've found two bugs associated with Bambi. One is related to hyperpriors and the other is associated with a varying effect when using a categorical predictor. I think I have to fix the second issue before we can move on with this example.

Again, thanks so much!

tomicapretto · 2022-01-14T23:00:19Z

I have found the problem. It's a shape problem in the underlying PyMC model. We're basically adding things with shape (n,) and (n, 1) and PyMC does not like that.

I'll create a new PR with the fix and make a new patch release. I have introduced this bug accidentally when modifying other underlying code I guess. I'll also add a test so this does not happen again 😅

For the record, this is what is happening in the underlying model

x = np.array([1, 2, 3])
y = np.array([[1, 2, 3]]).T
x + y

array([[2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]])

when Bambi actually meant

array([[2, 4, 6]])

tomicapretto · 2022-01-15T01:43:27Z

The shape problem is solved in the new release Bambi 0.7.1.

I could push my changes to the notebook so you don't have to write them (if you want)

juanitorduz · 2022-01-15T09:07:15Z

The shape problem is solved in the new release Bambi 0.7.1.

I could push my changes to the notebook so you don't have to write them (if you want)

Hey @tomicapretto thank you very much on your feedback! Here are some comments regarding your points:

It totally makes sense to encode floor as ["Basement", "Floor"] categorical variable 👍 .
I actually did know know how to pass the hyper-priors to Bambi 😅 ! I do not know one could use Prior inside the priors dictionary in the Model object! I spend a lot of time trying to figure it out how to add it somehow to the formula. Hence, in my opinion, this Radon example is going to be super useful for the documentation. Specially for the ones comping from pymc.

Feel free to push your changes to the notebook! Just push whatever you have ready and I will work on top of it the open comments 💪 !

About the bug you discovered (I was also a bit surprised I could pass the prior as I did without getting an error... sometimes things are too good to be true 😆), the fix was merged in #441 right? Which means I would need to re-install bambi from the main branch right?

tomicapretto · 2022-01-15T12:53:27Z

@juanitorduz

Great!
We prevent hyperpriors for common terms (you can't have a hyperprior there) but we missed the case where you don't pass a hyperprior for a term that actually needs it.

I've quickly made a new patch release to solve this issue because it may be affecting other people too 😨. I'm about to push my changes now.

codecov-commenter · 2022-01-15T14:15:00Z

Codecov Report

Merging #440 (a6b80d9) into main (2846488) will decrease coverage by 0.11%.
The diff coverage is 58.33%.

@@            Coverage Diff             @@
##             main     #440      +/-   ##
==========================================
- Coverage   89.38%   89.26%   -0.12%     
==========================================
  Files          31       31              
  Lines        2420     2431      +11     
==========================================
+ Hits         2163     2170       +7     
- Misses        257      261       +4

Impacted Files	Coverage Δ
bambi/families/link.py	`65.00% <0.00%> (ø)`
bambi/backend/pymc.py	`79.80% <14.28%> (-1.57%)`	⬇️
bambi/models.py	`85.93% <66.66%> (+0.07%)`	⬆️
bambi/backend/terms.py	`96.27% <100.00%> (ø)`
bambi/priors/scaler_default.py	`95.38% <100.00%> (ø)`
bambi/priors/scaler_mle.py	`75.32% <100.00%> (ø)`
bambi/tests/test_model_construction.py	`100.00% <100.00%> (ø)`
bambi/version.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2846488...a6b80d9. Read the comment docs.

juanitorduz · 2022-01-15T18:56:37Z

Makes sense! Solved in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

View entire conversation on ReviewNB

juanitorduz · 2022-01-15T18:57:51Z

This was added in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

View entire conversation on ReviewNB

juanitorduz · 2022-01-15T19:01:16Z

Thanks for the input!

View entire conversation on ReviewNB

juanitorduz · 2022-02-02T09:37:43Z