New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistical Rethinking #216
Comments
Chapter 2Neat counting example in 2.1! I tried to write up In note 41, from page 24, McElreath advocates Cox-style probability. I also wrote up (still years ago) a cute example trying to explain I kind of miss seeing "evidence" in Bayes' rule... Maybe I like this,
(The P's everywhere would kind of obfuscate the nice counting Then, note that Ah, here on page 37 is his version:
Also nice! He reminds me that I'm using "evidence" (above) in a There's also the way of doing it that's more like this:
And then we can talk about the first term as the likelihood ratio, Likelihood ratio is a nice thing to think about, especially in There's also a nice connection to the error mode of getting the I don't really like "average probability of the data" as a term, I On page 39, he doesn't include Hamiltonian Monte Carlo as one of the Oh no! Very ugly page break from 42 to 43, with the header of a table Interesting; really not explaining what's going on with Wow! I do not understand how this Metropolis algorithm on page 45 Ooh fun, some people have problem solutions online... Here's one: |
Chapter 3I'm reading Ellenberg's How Not to Be Wrong, and he says on page 49: I have that feeling in connection with Gelman and Pearl (not sure they Also Ellenberg:
I think he's referring to multi-level modeling, in the Gelman This common medical testing scenario appeared in a recent
I solved it by seeing that 0.01 * 0.99 == 0.99 * 0.01, which is sort
The "Why statistics can't save bad science" box on page 51 is neat. Just to establish equivalence between R and Python... dbinom(6, size=9, prob=0.5)
## [1] 0.1640625 import scipy.stats
scipy.stats.binom(n=9, p=0.5).pmf(6)
## 0.16406250000000006 Interesting: using "compatibility interval" rather than "credible
Hmm; his HPDI (Highest Posterior Density Interval) implementation How hard is this really to implement? If you have a histogram or just birth1 <- c(1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0,
0,0,0,1,1,1,0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,
1,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0,
1,0,1,1,1,0,1,1,1,1)
birth2 <- c(0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0,
1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1,
0,0,0,1,1,1,0,0,0,0)
table(birth1, birth2)
## birth2
## birth1 0 1
## 0 10 39
## 1 30 21 So really, what is up with this data? |
Chapter 4
Frank's The common patterns of nature seems pretty neat, getting
On page 76 he shows "precision" as τ, meaning 1/σ^2, and it shows up "procrustean" (on page 77): "(especially of a framework or system) I like the spark histograms! Oh, neat, they're even called "histosparks"... And I might have guessed... they're from Hadley. So there are unicode characters that do blocks of various sizes, by sparks <- c("\u2581", "\u2582", "\u2583", "\u2585", "\u2587")
# 1/8 2/8 3/8 5/8 7/8 Can look these up for example here: https://www.fileformat.info/info/unicode/char/2581/index.htm " ▁▂▃▄▅▆▇█" has all the heights, with a normal blank at the beginning. So why does Hadley only use some of the available heights? Not sure. Oh look at that! In my terminal those all look fine, but in a browser (maybe it depends on font?) the half and full blocks go lower than the others! Still doesn't explain why 6/8 is missing from Hadley's list... Maybe it looks bad in other fonts? Let's try it fixed-width:
Yup, looks much nicer in fixed width. Here's another nice place to see these: https://en.wikipedia.org/wiki/Block_Elements
And a fun reference to Monkeys, Kangaroos, and N:
I haven't heard about this as such, I think. Dimension_al_ analysis is Interesting to recall that in the first edition, what's now called
He repeats it in different ways here and there, but I noted it again When doing the quadratic example, he z-scores but does not Okay I'll look at sklearn... Here's somebody with a nice Python page 111:
I really liked section 4.5.2 on splines; I don't think I ever saw a
In both R listings 4.76 and 4.79, it's a little unintuitive to me in For question 4H8 on page 122, it asks what else would have to change |
Chapter 5: The Many Variables & The Spurious Waffles
In the first paragraph of 5.1.1, I don't really see how Figure 5.2 I really like dagitty. Learning about it is one of the best things It is a little weird that the web interface uses ⊥ (falsum)... Hmm;
The It took me a little bit to understand what he was getting at with the
The section 5.2 "Masked relationship" is neat.
What use of "magnitude" is this? Hmm... Looks like star brightness is Seems like this is less surprising to others, and it makes sense as "order of magnitude."
I was unfamiliar with Melanesia. One page 155, he makes index variables seem fundamentally different
Question 5E3 on page 159 jokes (I think?) about the effects of amount |
Chapter 6: The Haunted DAG & The Causal TerrorOn page 161 he starts with Berkson's Paradox, suggesting
This can "smear out" your estimates, because it isn't clear which Section 6.2 (page 170) uses "post-treatment bias" to refer to what I
The thing haunting it (as in the title of the chapter) is unmeasured I like "the four elemental confounds" on page 185. They're pretty The explanation of shutting the back-door on pages 184-185 is better Front-door isn't mentioned until page 460. On page 186, he says "Conditioning on C is the better idea, from the
Say more? Like, a footnote? An endnote? Some kind of reference to more
|
Chapter 7: Ulysses' CompassMcElreath has slides and video of his lectures online.
There's some parallel between statistical models and scientific models
"Stargazing" is a cute way to criticize fixation on stars that On page 194 he uses "hominin" which I wasn't familiar with. Hominins
He loves pointing out this kind of thing.
This alludes to Shakespeare's famous Marc Antony speech in Julius
Saucy!
Wikipedia says "In its most basic form, MDL is a model selection McElreath points to Grünwald's The Minimum Description Length He's trying to develop "out-of-sample deviance" in 7.2 "Entropy and "Likelihood" as in the likelihood of the data, given the model, on
Interesting "Rethinking" box on page 204:
It has an endnote on page 563 that includes:
On page 207 he points out that when probability is zero, L'Hopital's Endnote 110 on page 564 begins:
On page 207 he just says:
He just says "divergence" to mean Kullback-Leibler divergence, and
There he goes again!
His package has
There's more explanation in endnote 112 on page 564:
(Recall we're interested in the difference between these things; they I was briefly befuddled by the positive log-likelihoods on page 210, On page 211 he talks about I had cause to do this recently! It comes down to this: import math
def sum_log_prob(a, b):
return max(a, b) + math.log1p(math.exp(0 - abs(a - b))) I based that on post from Kevin Karplus. McElreath's is: log_sum_exp <- function( x ) {
xmax <- max(x)
xsum <- sum( exp( x - xmax ) )
xmax + log(xsum)
} (Found on rdrr.)
Section 7.4 (page 217) is "predicting predictive accuracy."
"PSIS" is "Pareto-smoothed importance sampling cross-validation."
It seems like WAIC can only be used when you have a posterior
Endnote 133 references
And then the chapter summary doesn't even mention cross-validation! Acronyms:
Practice problem 7E1: State the three motivating criteria that define
Practice problem 7E2: Suppose a coin is weighted such that, when it is Well, entropy is the negative sum of p*log(p), so: import math
# Truth (as in Problem 7E1)
p = [0.7, 0.3]
# Entropy, H(p)
H = lambda p: -sum(p_i * math.log(p_i) for p_i in p)
H(p) # 0.6108643020548935
# Candidate "models"
q = [0.5, 0.5]
r = [0.9, 0.1]
# Cross-Entropy, H(p, q), xH here because Python
xH = lambda p, q: -sum(p_i * math.log(q_i) for p_i, q_i in zip(p, q))
xH(p, q) # 0.6931471805599453
xH(p, r) # 0.764527888858692
# KL Divergence, D(p, q)
D = lambda p, q: sum(p_i * math.log(p_i/q_i) for p_i, q_i in zip(p, q))
D(p, q) # 0.08228287850505178
D(p, r) # 0.15366358680379852
# D(p, q) = H(p, q) - H(p)
D(p, q) == xH(p, q) - H(p) # True
# We wish we could do this (use D) but we can't, because we don't have p.
# Data
d = [0, 0, 1]
# Log probability (likelihood) score
S = lambda d, p: sum(math.log(p[d_i]) for d_i in d)
S(d, q) # -2.0794415416798357
S(d, r) # -2.513306124309698
# True vs. predictive
S(d, p) # -1.917322692203401
S(d, [2/3, 1/3])
# -1.9095425048844388
# Deviance
deviance = lambda d, p: -2 * S(d, q)
# Positive log likelihoods! Gasp!
# Note the log probabilities here are really probabilities, because
# I'm just using point estimates, not real distributions. Really,
# you'll have densities, which can be greater than one.
And he really likes them, largely forgetting about cross-validation.
|
Chapter 8: Conditional ManateesIt's the interactions chapter! Propeller marks on manatees are unpleasant, but DID YOU KNOW you see
Why not split data to condition on some categorical variable? (page
On page 245, he explains (again?) that using indicator variables is On using fancy Greek letters in your model specification:
Section 8.2 (page 250) on "Symmetry of interactions" is pretty neat.
In endnote 142, McElreath recommends Grafen and Hails' Main effects vs. interaction effects. On weakly informative priors:
|
out of order! Chapter 11: God spiked the integersIt's the GLM (Generalized Linear Model) chapter!
This is a little loose, maybe; Poisson is the limit of binomial, which Logistic regression as a special case of binomial regression, okay.
Mutant helper functions? Is this a common term?
He shows a table that includes logistic regression coefficients, but
(When using data that is counts of outcomes.)
He really seems to want to make gamma-Poisson happen (replacing Probit doesn't appear anywhere! (At least, I haven't seen it and it
Chapter outline:
|
out of order! Chapter 13: Models with memoryIt's the varying effects ("random effects") chapter! Multilevel models!
This is a pretty cool property to have. The problem of data imbalance
Is this really the case? It would be neat to see an example where a Costs of multilevel models (page 400, paraphrase):
Synonyms (page 401):
With parameters of multilevel models "most commonly known as random
(A "group" could be an individual, depending on the nature of the I don't love that "hyperparameter" is used for parameters that are Reasons for using a Gaussian prior (page 403):
Oh my! A coefficient for every observation! Take that, frequentist It would be interesting to see a direct comparison, using e.g. Page 408 itemizes three perspectives:
This in particular reminds me of If you use the grand average to determine the Laplace binomial values, I did a version of Laplace smoothing back when I was helping use
I enjoy that my preferred way of writing the logistic function is used
Ah! Here's where he mentions Mister P: Multilevel Regression and
|
out of order! Chapter 9: Markov Chain Monte Carlo
In an endnote, McElreath recommends Kruschke's
|
Chapter 10: Big entropy and the Generalized Linear Model
He suggests sensitivity assumptions, presumably including trying
He also points out on page 320 that with a different likelihood (and |
Chapter 12: Monsters and Mixtures
This could almost be a summary of the book, maybe.
He offers the cumulative link function.
|
Chapter 14: Adventures in Covariance
That doesn't reflect the actual order, which has IV in the middle of
Estimates are pooled/shrunk, so parameters don't fit "tightly" to the
This chimpanzee example continues to be fairly dull, for the level of
The introduction to instrumental variables is based on the classic
Discussing the front-door criterion, he points to a blog post and
This is a little puzzling. Swapping axes shouldn't change correlation. Ahhh... It doesn't swap the axes (unless there are only two Why does this happen... Some labeling is essentially arbitrary, so that "giver" and "receiver" Consider a three-point graph. Our "point of view" node is attached to Cool.
I like the phrase the author uses to describe GP regression:
The Stan documenation has more on fitting GP regressions. I think the thing that keeps this kind of GP from fitting the data But really, why doesn't it fit the data perfectly? In the primates Oh! It's because the kernel matrix doesn't enter into the mean! ... So the effect of just changing the covariance matrix is like this: install.packages('mvtnorm')
library(mvtnorm)
data <- c(1, 1, -1, -1)
# the mean here defaults to c(0, 0, 0, 0)
# "standard" 4d normal (identity for covariance matrix)
dmvnorm(data)
# 0.003428083
# covariance matrix that expects clustering
sigma <- matrix(c(1, 0.5, 0, 0,
0.5, 1, 0, 0,
0, 0, 1, 0.5,
0, 0, 0.5, 1), nrow=4)
dmvnorm(data, sigma=sigma)
# 0.008902658 (more likely than when assuming independence) So when expecting clustering, you don't have to explain via the mean For the primates example, he gets a significant coefficient on group
This is a little weird, isn't it? Just because the relationship In the final example he gets less covariance and the coefficient on Ah: For the earlier example, it's a Poisson regression anyway, so it's And the multi-variate normal bit has mean zero! It can only pull out |
Chapter 15: Missing data and other opportunities
(The ruthlessness is ruthlessness in applying rules of conditional
It sounds like instrumental variables are often (originally?) about
In the model, when we have data, the distribution we enter is
This is maybe a little strong; he's explaining here that multiple He refs this paper, which has some missing data:
|
Chapter 16: Generalized Linear Madness
"GLMM" is "Generalized linear mixed model" where "mixed" means
The 1985 "Consider a Spherical Cow: A Course in Environmental Problem Three cites here:
Learning curves and teaching when acquiring nut-cracking in humans and chimpanzees
I think this is too strong, and he walks it back a little...
This particular model is a famous one, the Lotka-Volterra Model.
|
Chapter 17: Horoscopes
This makes me wonder whether there could be some proactive system to
|
Chapter 1
I think there's a little bit of linguistic confusion between
scientific hypotheses ("I think the world works like...") and
statistical hypotheses (also called "Statistical models" in Figure
1.2).
I think this is too strong.
also: importance of measurment
He has some neat examples of evidence that's not trivial to infer from - whether the ivory-billed woodpecker was extinct, and faster-than-light neutrinos... To these, also add the Piltdown Man fraud.
The text was updated successfully, but these errors were encountered: