Reading notes of **Data analysis using regression and multilevel/hierachical models**


Chapter 1: Why?
====

The two key parts of a multilevel model are varying coefficients and a
model for those varying coefficients.

Fixed effects can be viewed as special cases of random effects, in which
the higher level variance is set to 0 or $\infty$.

Chapter 2: Concepts and methods from basic probability and statistics
==================================

Mean and variance of sums of correlated random variables: If $x$ and $y$
are random variables with means $\mu_{x},\,\mu_{y}$, and standard
deviations $\sigma_{x},\,\sigma_{y}$, and correlation $\rho$, then $x+y$
has mean $\mu_{x}+\mu_{y}$ and standard deviation
$\sqrt{\sigma_{x}^{2}+\sigma_{y}^{2}+2\rho\sigma_{x}\sigma_{y}}$. More
generally, the weighted sum $ax+by$ has mean $a\mu_{x}+b\mu_{y}$ and its
standard deviation is
$\sqrt{a^{2}\sigma_{x}^{2}+b^{2}\sigma_{y}^{2}+2ab\rho\sigma_{x}\sigma_{y}}$.

Estimated regression coefficients are themeselves linear combinations of
data: $\hat{\beta}=(X^{t}X)^{-1}X^{t}y$, i.e. $\beta$ is a linear
combination of the data values $y$.

Two pitfalls of the approach of summarizing by statistical significance:
1) statistical significance does not equal practical significance. 2)
changes in statistical significance are not themselves significant. Even
large changes in significance levels can correspond to small,
nonsignificant changes in the underlying variables.

Chapter 3: Linear regression: the basics
=============================

For a binary predictor, the regression coefficient is the difference
between the averages of the two groups.

With multiple predictors, typical advice is to interpret each
coefficient “with all other predictors held constant.” But it is not
always possible to change one predictor while holding all others
constant. For example, if a model includes both $x$ and $x^{2}$ as
predictors, it does not make sense to consider changes in $x$ with
$x^{2}$ held constant.

Interactions can be important. Including interactions is a way to allow
a model to be fit differently to different subsets of data. Models with
interactions can often be more easily interpreted if we first
pre-process the data by centering each input variable about its mean or
some other convenient reference point.

The least squares estimate is also the maximum likelihood estimate if
the errors $\epsilon_{i}$ are independent with equal variance and
normally distributed. In any case, the least squares estimate can be
expressed in matrix notation as $\hat{\beta}=(X^{t}X)^{-1}X^{t}y$. As a
byproduct of the least squares estimation of $\beta$, the residuals
$r_{i}=y_{i}-X_{i}\hat{\beta}$ will be uncorrelated with all the
predictors in the model. If the model includes a constant term, then the
residuals must be uncorrelated with a constant, which means they must
have mean 0. ***This is a byproduct of how the model is estimated, it is
not a regression assumption***.

Residual standard deviation
$\hat{\sigma}=\sqrt{\sum_{i=1}^{n}r_{i}^{2}/(n-k)}$. The fit of the
model can be summarized by $\hat{\sigma}$ (the smaller the residual
variance, the better the fit) and by $R^{2}$, the fraction of variance
“explained” by the model. The unexplained variance is
$\hat{\sigma}^{2}$, and if we label $s_{y}$ as the sd of the data, then
$R^{2}=1-\hat{\sigma}^{2}/s_{y}^{2}$. $\hat{\sigma}^{2}$ has a sampling
distribution centered at the true value, $\sigma^{2}$, and proportional
to a $\chi^{2}$ distribution with $n-k$ degree of freedom.

When an estimate is statistically significant, we are fairly sure that
the sign (+/-) of the estimate is stable, and not just an artifact of
small sample size.

It is fine to have nonsignificant coefficients in a model, as long as
they make sense.

**Assumptions** of the regression model: Validity (the data you are
analyzing should map to the research question you are trying to answer);
additivity and linearity (the deterministic component is a linear
function of the predictors); independence of errors; equal variance of
errors; normality of errors.

Chapter 4: Linear regression: before and after fitting the model
================================

Linear transformations do not affect the fit of a classical regression
model, and they do not affect predictions: the changes in the inputs and
the coefficients cancel in forming the predicted value $X\beta$.
However, well-chosen linear transformation can improve interpretability
of coefficients and make a fitted model easier to understand. For
example, in the model $earnings\sim height+male$, it does not make sense
when height is 0. If we centered height at its mean value, then the
interpretation is the earning at the mean height.

Standardizing predictors using z-scores will change our interpretation
of the intercept to the mean of $y$ when all predictor values are at
their mean values.

We actually prefer to devide by 2 standard deviations to allow
inferences to be more consistent with those for binary inputs.

Linear transformation of the predictors does not affect the fit of a
classical regression model, and the residual sd, $R^{2}$, and the
coefficient and standard error of the interaction do not change.

Consider a model $y=a+bx+error$, if both $x$ and $y$ are standardized (1
sd), then the regression intercept is zero and the slope is simply the
correlation between $x$ and $y$. Thus the slope of a regression of two
standardized variables must always be between -1 and 1. In general, the
slope of a regression with one predictor is
$b=\rho\sigma_{y}/\sigma_{x}$.

The princial component line: goes theough the cloud of points, in the
sense of minimizing the sum of squared Euclidean distances between the
points and the line.

The regression line: minimizes the sum of the squares of the vertical
distances between the points and the line.

For the goal of predicting $y$ from $x$, or for estimating the average
of $y$ for any given value of $x$, the regression line is in fact
better.

Regression to the mean: when $x$ and $y$ are standardized (placed on a
common scale), the regression line always has slope less than 1. Thus
when $x$ is 1 sd above the mean, the predicted value of $y$ is somewhere
between 0 and 1 sd above the mean. This phonomenon in linear models,
that $y$ is predicted to be closer to the mean (in sd units) than $x$,
is called regression to the mean. A naive interpretation of regression
to the mean is that heights or other variable phenomena necessarily
become more and more average over time. This view is mistaken because it
ignores the error in the regression predicting $y$ from $x$. For any
data point $x_{i}$, the point prediction for its $y_{i}$ will be
regressed toward the mean, but the actual $y_{i}$ that is observed will
not be exactly where it is predicted.

A linear model on the logarithmic scale corresponds to a multiplicative
model on the original scale.

Log-log model: the coefficient can be interpreted as the expected
proportional changes in $y$ per proportional change in $x$.
