Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radon Example Implementation #440

Merged
merged 31 commits into from
Feb 14, 2022
Merged

Conversation

juanitorduz
Copy link
Contributor

@juanitorduz juanitorduz commented Jan 14, 2022

This PR aims to solve #439.

I have pushed the first version of the models. Here are the formulas I used:

  • pooled_model: log_radon ~ floor
  • unpooled_model: log_radon ~ 0 + county + county:floor
  • partial_pooling_model: log_radon ~ 0 + (0 + 1|county)
  • varying_intercept_model: log_radon ~ 0 + (1|county) + floor # ! How to do complete one-hot-encoding for county?
  • varying_intercept_slope_model: log_radon ~ 0 + (floor|county) # ! How to do complete one-hot-encoding for county?

I could like to have all the countries as encoded (not removing the first one) plus I am not auto-scaling the data to make the results with the original PyMC example comparable. Does this look good? I am still getting used to the formula-like notation for hierarchical models (which is why I am working on this PR 😅 ).

After double-checking the model specification I will continue to add text and some additional plots.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@juanitorduz juanitorduz marked this pull request as draft January 14, 2022 14:21
@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:01Z
----------------------------------------------------------------

We could convert the variable "floor" to categorical in pandas. This would help users see the advantage of Bambi when dealing with categorical predictors.

Right now, "floor" is 0 and 1, which is a dummy-coded categorical variable. But if we map 0 -> Basement and 1 -> First floor, we can take advantage of automatic dimension labelling.

This is what I mean

srrs_mn["floor"] = srrs_mn["floor"].map({0: "Basement", 1: "Floor"})




juanitorduz commented on 2022-01-15T18:56:36Z
----------------------------------------------------------------

Makes sense! Solved in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:02Z
----------------------------------------------------------------

"Estimate a single radon level for every floor level" would be clearer about the meaning of the predictor "x" I think


juanitorduz commented on 2022-01-16T20:02:37Z
----------------------------------------------------------------

Added!

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:03Z
----------------------------------------------------------------

If we decide to use the categorical version of floor, it would be good to remove the intercept from the formula so the first level of floor is not dropped.

If we keep the intercept, the first level is dropped to ensure full-rankness of the underlying design matrix*

It would be log_radon ~ 0 + floor

(*) We have priors, which are a kind of soft-constraind, and the model would fit even with rank-defficient design matrices. But that can lead to other problems with the sampler. That's why Bambi sometimes drops columns for categorical variables (brms does the same thing)


juanitorduz commented on 2022-01-15T18:57:50Z
----------------------------------------------------------------

This was added in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:04Z
----------------------------------------------------------------

This is what I mean when I suggest using the categorical version of floor: https://imgur.com/a/KKQZ1eL

(I uploaded the image when I was using First floor instead of Floor)


@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:04Z
----------------------------------------------------------------

If you decide to use floor as categorical, this would be the updated code (I don't want to make you work a lot because of me! haha)

Note that the coordinate for floor is automatically set to floor_coord

pooled_results_posterior_stacked = pooled_results.posterior.stack(sample=('chain', 'draw'))

fig, ax = plt.subplots()
sns.kdeplot(
x=pooled_results_posterior_stacked["floor"].sel({"floor_coord": "Basement"}),
fill=True,
alpha=0.5,
label="Basement",
ax=ax
);

sns.kdeplot(
x=pooled_results_posterior_stacked["floor"].sel({"floor_coord": "Floor"}),
fill=True,
alpha=0.5,
label="Floor",
ax=ax
)

ax.legend(loc="upper left")
ax.set(title="Posterior Density for Floor Levels", xlabel="log(radon + 0.1)", ylabel="Density");



juanitorduz commented on 2022-01-15T19:10:13Z
----------------------------------------------------------------

Thanks for adding it to the notebook!

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:05Z
----------------------------------------------------------------

The equivalent formula for the model in the PyMC version is log_radon ~ 0 + county:floor which means no intercept and interaction between county and floor.

If we do something like log_radon ~ 0 + county + county:floor you would see Bambi drops columns in the interaction county:floor. Again that's because it keeps full-rankness in the design matrix.

0 + county:floor and 0 + county + county:floor encode the same information. They just do it in a different way.

This post from the Patsy documentation is very insightful. I confess I had to read it several times, in different days, before I started understanding in more depth :D


juanitorduz commented on 2022-01-15T19:01:14Z
----------------------------------------------------------------

Thanks for the input!

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:06Z
----------------------------------------------------------------

With the new formula we would have https://imgur.com/iRaqmKi which is more alike what is in the PyMC version


@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:07Z
----------------------------------------------------------------

I also think you've found a bug, or something similar. I would have expected an error being raised when you set the prior for 1|county to rior(name="Normal", mu=0.0, sigma=10.0) . Since it is a varying effect, the value of sigma should be a prior too (the hyperprior).

Here we should have something along these lines

partial_pooling_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=10),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "sigma": bmb.Prior(name="Exponential", lam=1.0),
}

partial_pooling_model = bmb.Model(
formula="log_radon ~ 1 + (1|county)",
data=srrs_mn,
priors=partial_pooling_priors,
)

If you want to be closer to the PyMC specification, you could pass noncentered=Flase to bmb.Model because the PyMC version is noncentered.

If you're familiar with the PyMC way of using hierarchies, the Bambi way (borrowed from mixed effects models way) may be confusing in the beginning.

A good explanation is found in Chapter 16 from Bayes Rules book, specifically section 16.3.2



juanitorduz commented on 2022-01-15T19:06:32Z
----------------------------------------------------------------

I would actually suggest to add some of this comments (and the reference) to the notebook.

juanitorduz commented on 2022-01-15T19:07:44Z
----------------------------------------------------------------

As for future reference, the bug was solved in https://github.com//pull/441

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:08Z
----------------------------------------------------------------

Line #7.        formula="log_radon ~ 0 + (0 + 1|county)",

The 0 in 0 + 1 |county is not having any effect. The 0 means "remove the intercept, if there's one". Since there's an implicit intercept, it is being removed. But then we have 1|county which means "add the intercept". So that term is equivalent to 1|county.

Feel free to ask if there's something you'd like to double check in the formula specification :)


@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:08Z
----------------------------------------------------------------

This is a similar issue than above

varying_intercept_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=10),
    "floor": bmb.Prior(name="Normal", mu=0.0, sigma=10.0),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "sigma": bmb.Prior(name="Exponential", lam=1),
}

varying_intercept_model = bmb.Model(
formula="log_radon ~ 0 + floor + (1|county)",
data=srrs_mn,
priors=varying_intercept_priors,
)

I think that updating the model solves your question about one-hot encoding for county?

Note: I removed the intercept so we have the two coefficients for floor.



juanitorduz commented on 2022-01-15T19:08:26Z
----------------------------------------------------------------

Now everything make sense!

@review-notebook-app
Copy link

review-notebook-app bot commented Jan 14, 2022

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2022-01-14T22:38:09Z
----------------------------------------------------------------

Here the model would be

varying_intercept_slope_priors = {
    "Intercept": bmb.Prior(name="Normal", mu=0, sigma=5),
    "1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
    "floor|county": bmb.Prior(name="Normal", mu=0.0, sigma=bmb.Prior("Exponential", lam=0.5)),
    "sigma": bmb.Prior(name="Exponential", lam=1.0),
}

varying_intercept_slope_model = bmb.Model(
formula="log_radon ~ (floor|county)",
data=srrs_mn,
priors=varying_intercept_slope_priors,
)

but I've tried running it and it was taking hours, so I guessed there is a bug when using categorical predictors in varying effects

I did the following and it worked like a charm

srrs_mn2 = srrs_mn.copy()
srrs_mn2["floor2"] = np.where(srrs_mn2["floor"] == "Basement", 0, 1)

varying_intercept_slope_priors = {
"Intercept": bmb.Prior(name="Normal", mu=0, sigma=2.5),
"1|county": bmb.Prior(name="Normal", mu=0, sigma=bmb.Prior("Exponential", lam=1)),
"floor2|county": bmb.Prior(name="Normal", mu=0.0, sigma=bmb.Prior("Exponential", lam=0.5)),
"sigma": bmb.Prior(name="Exponential", lam=1.0),
}

varying_intercept_slope_model = bmb.Model(
formula="log_radon ~ (floor2|county)",
data=srrs_mn2,
priors=varying_intercept_slope_priors,
)

So I'll dig into this issue to see what's wrong in the first model and try to fix it.



@tomicapretto
Copy link
Collaborator

@juanitorduz thanks for this massive contribution!

I've added several comments to the notebook. I think you've made a lot of progress!

As you may see in the comments, I've found two bugs associated with Bambi. One is related to hyperpriors and the other is associated with a varying effect when using a categorical predictor. I think I have to fix the second issue before we can move on with this example.

Again, thanks so much!

@tomicapretto
Copy link
Collaborator

tomicapretto commented Jan 14, 2022

I have found the problem. It's a shape problem in the underlying PyMC model. We're basically adding things with shape (n,) and (n, 1) and PyMC does not like that.

I'll create a new PR with the fix and make a new patch release. I have introduced this bug accidentally when modifying other underlying code I guess. I'll also add a test so this does not happen again 😅

For the record, this is what is happening in the underlying model

x = np.array([1, 2, 3])
y = np.array([[1, 2, 3]]).T
x + y
array([[2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]])

when Bambi actually meant

array([[2, 4, 6]])

@tomicapretto
Copy link
Collaborator

The shape problem is solved in the new release Bambi 0.7.1.

I could push my changes to the notebook so you don't have to write them (if you want)

@juanitorduz
Copy link
Contributor Author

The shape problem is solved in the new release Bambi 0.7.1.

I could push my changes to the notebook so you don't have to write them (if you want)

Hey @tomicapretto thank you very much on your feedback! Here are some comments regarding your points:

  1. It totally makes sense to encode floor as ["Basement", "Floor"] categorical variable 👍 .
  2. I actually did know know how to pass the hyper-priors to Bambi 😅 ! I do not know one could use Prior inside the priors dictionary in the Model object! I spend a lot of time trying to figure it out how to add it somehow to the formula. Hence, in my opinion, this Radon example is going to be super useful for the documentation. Specially for the ones comping from pymc.

Feel free to push your changes to the notebook! Just push whatever you have ready and I will work on top of it the open comments 💪 !

About the bug you discovered (I was also a bit surprised I could pass the prior as I did without getting an error... sometimes things are too good to be true 😆), the fix was merged in #441 right? Which means I would need to re-install bambi from the main branch right?

@tomicapretto
Copy link
Collaborator

@juanitorduz

  1. Great!
  2. We prevent hyperpriors for common terms (you can't have a hyperprior there) but we missed the case where you don't pass a hyperprior for a term that actually needs it.

I've quickly made a new patch release to solve this issue because it may be affecting other people too 😨. I'm about to push my changes now.

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2022

Codecov Report

Merging #440 (a6b80d9) into main (2846488) will decrease coverage by 0.11%.
The diff coverage is 58.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #440      +/-   ##
==========================================
- Coverage   89.38%   89.26%   -0.12%     
==========================================
  Files          31       31              
  Lines        2420     2431      +11     
==========================================
+ Hits         2163     2170       +7     
- Misses        257      261       +4     
Impacted Files Coverage Δ
bambi/families/link.py 65.00% <0.00%> (ø)
bambi/backend/pymc.py 79.80% <14.28%> (-1.57%) ⬇️
bambi/models.py 85.93% <66.66%> (+0.07%) ⬆️
bambi/backend/terms.py 96.27% <100.00%> (ø)
bambi/priors/scaler_default.py 95.38% <100.00%> (ø)
bambi/priors/scaler_mle.py 75.32% <100.00%> (ø)
bambi/tests/test_model_construction.py 100.00% <100.00%> (ø)
bambi/version.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2846488...a6b80d9. Read the comment docs.

Copy link
Contributor Author

Makes sense! Solved in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7


View entire conversation on ReviewNB

Copy link
Contributor Author

This was added in https://github.com//pull/440/commits/1cf6df3fa259dde97921045f9504fbeba89a22d7


View entire conversation on ReviewNB

Copy link
Contributor Author

Thanks for the input!


View entire conversation on ReviewNB

Copy link
Contributor Author

Fixed in https://github.com//pull/440/commits/3c68d074c045ab63af7ded6bf3e73958ea0b630b


View entire conversation on ReviewNB

4 similar comments
Copy link
Contributor Author

Fixed in https://github.com//pull/440/commits/3c68d074c045ab63af7ded6bf3e73958ea0b630b


View entire conversation on ReviewNB

Copy link
Contributor Author

Fixed in https://github.com//pull/440/commits/3c68d074c045ab63af7ded6bf3e73958ea0b630b


View entire conversation on ReviewNB

Copy link
Contributor Author

Fixed in https://github.com//pull/440/commits/3c68d074c045ab63af7ded6bf3e73958ea0b630b


View entire conversation on ReviewNB

Copy link
Contributor Author

Fixed in https://github.com//pull/440/commits/3c68d074c045ab63af7ded6bf3e73958ea0b630b


View entire conversation on ReviewNB

Copy link
Contributor Author

Ups! I think I have lost another comment about the model specification. Here is is for future reference:

Here you mention that you are removing the intercept and then using one coefficient per floor level. But when you describe the mathematical model with you called the coefficient beta the "Intercept for the Floor level j". The pattern repeats for the other models, so it is better to clarify the nomenclature

View entire conversation on ReviewNB

Copy link
Collaborator

I agree that saying we don't have an Intercept and then saying "intercept for the floor" may be confusing (even though we can consider those coefficients to be intercepts). Let's use "coefficient for the Floor level j". I'm fixing that now.


View entire conversation on ReviewNB

@tomicapretto
Copy link
Collaborator

tomicapretto commented Feb 5, 2022

@juanitorduz it's sad you caught it too but I'm glad you're feeling better :D

I added several changes, with the aim to simplify the models, but still be correct in what we're doing.

I compared our Bambi models, with the PyMC models and with the models in the original example in the book by Gelman and Hill. There are so many ways of writing the same underlying model, and so many terms that mean the same or something very similar, it's been very painful.

On top of that, I realized this particular example brings a lot of confusion. I'll try to explain in the following points:

Intercepts and slopes

Both the PyMC example and the original example in the book talk about the intercept and the floor slope. I think they do this because floor is a numerical 0-1 variable.

For example, Gelman and Hill use y ~ 1 + floor and say that 1 is the intercept and floor is the slope. This forces Bambi (and what they do in R too) to drop one level in the floor variable to prevent linear dependencies in the design matrix and then you are left with one coefficient for floor (instead of 2, the number of levels).

What I find confusing is that they talk about the "intercept" and the "slope" as if they were unrelated things. Here, 1 is not like any other intercept in a regression model. In this case, it has a particular meaning. It represents the mean for the reference floor level, and floor represents a shift from that reference level. So if the model has y ~ 1 + floor, floor is a difference between intercepts instead of a slope.

Consequently, I think it is better to omit the global intercept when we use floor in the model so all its levels are preserved. In other words, I mean to use y ~ 0 + floor. In this case, since there's no intercept, Bambi (and any other tool that works with model formulas) won't drop the reference level in floor and it will use 2 coefficients. One coefficient represents the mean for "basement" and the other represents the mean for "floor".

County-level predictors

When the uranium level is added, both examples say that a county-level predictor is added. However this does not mean that there's a county-specific slope for uranium (i.e. a hierarchical effect). What they mean is that there is a unique uranium measurement for each county, but the variable enters as a regular slope in the model.

Trying to mimic PyMC parametrizations

I think we should not try to replicate the same parameterizations that are used in the PyMC example. Two main reasons come to my mind:

  • It uses the terms "intercepts" and "slopes" when I think it's not a good idea for what I said above. They are the coefficients for the two levels of the same variable (floor), not unrelated things.
  • What's natural to write in PyMC may not be natural in Bambi and forcing a similar parametrization adds more confusion. In particular, it's natural to write hierarchical models in PyMC in a hierarchical manner (with hyperpriors). On the other hand, Bambi uses the mixed model notation where the group-specific component is a deviation from a common component. The sum of these two components is the hierarchical coefficient in the PyMC model.

I'll try to have a second read to this notebook soon. I think I've left many "intercepts" written there that I could express in a different way.

CC @aloctavodia feel free to add your opinion here.

@juanitorduz
Copy link
Contributor Author

Hey @tomicapretto , thank you very much for your input! Here are some comments:

  • First, I agree with you that these equivalent parametrisation and notations are a bit confusing for the user. In my experience, having a bit experience in pre PyMC and not much in the formula like specification, it was not straight forward. Still, I believe your comments above are very valuable and I would actually think it could be nice to add in this notebooks (maybe as a final remark?)...however I do not have a strong opinion about this.
  • I agree that using y ~ 0 + floor to get parameters for each floor level is the way to go. Specially thinking about the priors: If we would drop one of the floor levels, then the priors for the intercept and for the difference are not that clear to interpret (this is also the recommendation from Statistical Rethinking).
  • Again, I think because all of this discussions, this example is already paying off and I am sure is going to help Bambi users.

Is there anything I could support you with? Or maybe I wait for your second read and feedback from others?

@tomicapretto
Copy link
Collaborator

tomicapretto commented Feb 10, 2022

According to what we've discussed yesterday, these are the points we need to complete to consider this finished

  • Remove the "i" index from models and coefficients.
  • Remove redundant forest plots.
  • Make sure titles match the ones in the original PyMC post.
  • Add a sentence explaining why we use log(radon) and not just radon.
  • Check the correctness of the "intercept" word everywhere it is used.
  • Have an example where we build the model, plot the graph, and then fit. This is to show that fitting is not necessary to see the model graph.

We're getting closer to wrap this up!! 😄 🚀

@tomicapretto
Copy link
Collaborator

I think this is done if you want to have a last review @juanitorduz @aloctavodia

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 13, 2022

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2022-02-13T07:17:51Z
----------------------------------------------------------------

A small notation detail: above we used Basement and Floor to refer to the floor levels and here we are using "basement" and "floor" . I can fix it if you want? Any preference on the label notation :) ?


juanitorduz commented on 2022-02-13T07:28:17Z
----------------------------------------------------------------

fixed

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 13, 2022

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2022-02-13T07:17:51Z
----------------------------------------------------------------

Here is a typo in y = ... cadon


juanitorduz commented on 2022-02-13T07:28:01Z
----------------------------------------------------------------

fixed

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 13, 2022

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2022-02-13T07:17:52Z
----------------------------------------------------------------

y = ... is missing (log)


juanitorduz commented on 2022-02-13T07:28:44Z
----------------------------------------------------------------

fixed

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 13, 2022

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2022-02-13T07:17:53Z
----------------------------------------------------------------

here as well (log) is missing


juanitorduz commented on 2022-02-13T07:28:58Z
----------------------------------------------------------------

fixed

Copy link
Contributor Author

fixed


View entire conversation on ReviewNB

3 similar comments
Copy link
Contributor Author

fixed


View entire conversation on ReviewNB

Copy link
Contributor Author

fixed


View entire conversation on ReviewNB

Copy link
Contributor Author

fixed


View entire conversation on ReviewNB

@juanitorduz
Copy link
Contributor Author

Great work @tomicapretto ! The change in the notation to make it light-weight really improves readability IMO. Also, the explanation on how to build the model is quite handy :)

In 497b0f1 and a6b80d9 i fixed some small typos and harmonised some formulas (lower/upper case). From my side is ready to go! We can always get feedback and keep improving this valuable notebook 🚀

@aloctavodia aloctavodia merged commit 57ab32e into bambinos:main Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants