### AIC vs. BIC
#### Q:
I often use fit criteria like AIC and BIC to choose between models. I know that they try to balance good fit with parsimony, but beyond that I’m not sure what exactly they mean. What are they really doing? Which is better? What does it mean if they disagree? 

#### Answer

As you know, AIC and BIC are both penalized-likelihood criteria. They are sometimes used for choosing best predictor subsets in regression and often used for comparing nonnested models, which ordinary statistical tests cannot do. The AIC or BIC for a model is usually written in the form [-2logL + kp], where L is the likelihood function, p is the number of parameters in the model, and k is 2 for AIC and log(n) for BIC.

AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Each, despite its heuristic usefulness, has therefore been criticized as having questionable validity for real world data. But despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC.

AIC and BIC are both approximately correct according to a different goal and a different set of asymptotic assumptions. Both sets of assumptions have been criticized as unrealistic. Understanding the difference in their practical behavior is easiest if we consider the simple case of comparing two nested models. In such a case, several authors have pointed out that IC’s become equivalent to likelihood ratio tests with different alpha levels. Checking a chi-squared table, we see that AIC becomes like a significance test at alpha=.16, and BIC becomes like a significance test with alpha depending on sample size, e.g., .13 for n = 10, .032 for n = 100, .0086 for n = 1000, .0024 for n = 10000. Remember that power for any given alpha is increasing in n. Thus, AIC always has a chance of choosing too big a model, regardless of n. BIC has very little chance of choosing too big a model if n is sufficient, but it has a larger chance than AIC, for any given n, of choosing too small a model.

So what’s the bottom line? In general, it might be best to use AIC and BIC together in model selection. For example, in selecting the number of latent classes in a model, if BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes. AIC is better in situations when a false negative finding would be considered m
ore misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

### Q: Why Likelihood is not a pdf

### Answer

- The likelihood function is a function of the unknown parameter θ (conditioned on the data). As such, it does typically not have area 1 (i.e. the integral over all possible values of θ is not 1) and is therefore by definition not a pdf

- A probability density function (pdf) is a non-negative function that integrates to 1.

- The likelihood is defined as the joint density of the observed data as a function of the parameter.
- The likelihood function is a function of the parameter only, with the data held as a fixed constant. So the fact that it is a density as a function of the data is irrelevant.


### Q: ML vs SM

### A statistical model may be the better choice if

- Uncertainty is inherent and the signal:noise ratio is not large—even with identical twins, one twin may get colon cancer and the other not; one should model tendencies instead of doing classification when there is randomness in the outcome
- One doesn’t have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error
- One wants to isolate effects of a small number of variables
- Uncertainty in an overall prediction or the effect of a predictor is sought
- Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be pre-specified
- The sample size isn’t huge
- One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor
- One wants the entire model to be interpretable

### Machine learning may be the better choice if

- The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an E or not an E
- The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)
- Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)
- One is not very interested in estimating uncertainty in forecasts or in effects of selected predictors
- Non-additivity is expected to be strong and can’t be isolated to a few pre-specified variables (e.g., in visual pattern recognition the letter L must have both a dominating vertical component and a dominating horizontal component and these two must intersect at their endpoints)
- The sample size is huge
- One does not need to isolate the effect of a special variable such as treatment
- One does not care that the model is a “black box”

### Q:  Precision Matrix


- Let X be multivariate normal with covariance matrix Σ.

- The precision matrix, Ω, is simply defined to be the inverse of the covariance matrix:
$Ω:=Σ^{−1}$

- The key property of the precision matrix is that its zeros tell you about conditional independence. Specifically:
$Ω_{ij}=0$ if and only if Xi and Xj are conditionally independent given all other coordinates of X.

- It may help to compare and contrast this with the analogous property of the covariance matrix:
Σij=0 if and only if Xi and Xj are independent.

- That is, whereas zeros of the covariance matrix tell you about independence, zeros of the precision matrix tell you about conditional independence.


