#Chapter 14.  Overview of the GLM

## -----------------------------------------------------------------------------------------------------------------------------
## Contents
### 14.1 The GLM
### 14.2 Cases of the GLM
## -----------------------------------------------------------------------------------------------------------------------------

### 14.1 The Generalized Linear Model(GLM)

### 14.1.1 Predictor and predicted variables

Suppose we want to predict someone’s weight from their height. 
> Predicted variable(독립변수): weight

> Predictor(설명변수): height

suppose we want to predict high school grade point average (GPA) from Scholastic Aptitude Test (SAT) score and family income.
> Predicted variable(독립변수): GPA

> Predictor(설명변수): SAT and income

** The value of the predictor variable comes from
“outside” the system being modeled, whereas the value of the predicted variable depends
on the value of the predictor variable.

####The key mathematical difference between predictor and predicted variables is that the likelihood function.
(Likelihood function: expresses the probability of values ofthe predicted variable as a function of values of the predictor variable. The likelihood function does not describe the probabilities of values of the predictor variable.)

In experimental settings, the variables that are actually manipulated and set by the
experimenter are the independent variables. In this context of experimental manipulation, the values of the independent variables truly are (in principle, at least) independent of the values of other variables, because the experimenter has intervened to arbitrarily set the values of the independent variables. But sometimes a non-manipulated variable is also referred to as “independent”, merely as a way to indicate that it is being used as a predictor variable.

Among non-manipulated variables, the roles of predicted and predictor are arbitrary, determined only by the interpretation of the analysis. 

> Consider, for example, people’s weights and heights. We could be interested in predicting a person’s weight from his/her height, or we could be interested in predicting a person’s height from his/her weight.

Prediction is merely a mathematical dependency, not necessarily a description of underlying causal relationship.

Just as “prediction” does not imply causation, “prediction” also does not imply any temporal relation between the variables. 

> For example, we may want to predict a person’s
sex, male or female, from his/her height. Because males tend to be taller than females, this
prediction can be made with better than chance accuracy. But a person’s sex is not caused
by his/her height, nor does a person’s sex occur only after their height is measured. 

Thus,
we can “predict” a person’s sex from his/her height, but this does not mean that the person’s
sex occurred later in time than his/her height.

### In summary: 
#### All manipulated independent variables are predictor variables, not predicted. 
####Some dependent variables can take on the role of predictor variables, if desired. All predicted variables are dependent variables. 



### Why we care.
We care about these distinctions between predicted and predictor variables <B>because the likelihood function is a mathematical description of the dependency of the predicted variable on the predictor variable.</B>

The first thing we have to do in statistical inference is identify what variables we are interested in predicting, on the basis of what
predictors.

### 14.1.2 Scale types: metric, ordinal, nominal

Items can be measured on different scales. 

> For example, the participants in a foot race can be measured 
>> * Time they took to run the race. -> <B>metric</B>
>> * Placing in the race (1st,2nd, 3rd, etc.) -> <B>ordinal</B>
>> * the name of the team they represent. -> <B>nominal scales</B>


Examples of <B>metric-scaled data</B>  include response time (i.e., latency or duration), temperature,
height, and weight. 

> * <B>Ratio scale</B>, because they have a natural zero point on the scale. 
> * <B>Interval scales</B>, because all we know about them is the amount of stuff in an interval on the scale, not the amount of stuff at a
point on the scale.
> * <B>Count data(or frequency data)</B>


Examples of <B>ordinal</B> scales include placing in a race, or rating of degree of agreement.


Examples of <B>nominal(or categorical) scales</B>, scales include political party affiliation, the face of a rolled die, and the result of a flipped coin. 

### Why we care.
We care about the scale type because the likelihood function must specify a probability distribution on the appropriate scale. 

If the scale has two nominal values, then a Bernoulli likelihood function may be appropriate. 

If the scale is metric, then a normal distribution may be appropriate as a likelihood function. 

★ Whenever we a choosing a model for data, we must answer the question, What kind of scale are we dealing with?

### 14.1.3 Linear function of a single metric predictor

> Predicted variable(독립변수): y(metric)

> Predictor(설명변수): x(metric)

> Simply assumption: Linear relationship

>> Linear functions preserve proportionality. If you double the input, then you double the output.
Despite the fact that many real-world dependencies are non-linear, most are at least approximately linear over
moderate ranges of the variables. 


The general mathematical form for a linear function of a single variable is

\begin{eqnarray*}
y = β_0 + β_1 x  ~~~~~~~~~~~~~~~~~~~~~~(14.1)              
\end{eqnarray*}

The value of parameter $β_0 $ is called the y-intercept because it is the where the line intersects the y-axis when x = 0. 

The value of parameter $β_1 $ is called the slope because it indicates how much y increases when x increase by 1. 

<figure id="fig.redline0" style="float: none"><img src="1.png"><figcaption> 
</figcaption></figure>

In strict mathematical terminology, the type of transformation in Equation 14.1 is called <B><I>affine</I></B>. When $β_0 \neq 0$, the transformation does not preserve proportionality.

For example, consider y = 10 + 2x. When x is doubled from x = 1 to x = 2, y increases from y = 12 to y = 14, which is not doubling y. Nevertheless, the rate of increase in y is the same for all values of x: Whenever x increases by 1, y increases by 2.

> ### 14.1.3.1 Re-parameterization to $x$ threshold form

Equation 14.1 can be algebraically re-arranged as follows:

\begin{eqnarray*}
y = β_0 + β_1 x = \beta_1(x-(-\beta_0/\beta_1))~~~~~~~~~~~~~~~~~~~~~~(14.2)              
\end{eqnarray*}


This form of the equation is useful because it explicitly shows the value of the x-intercept, a.k.a. threshold, denoted $\theta$.(The threshold is the value of x when y is zero.) This is sometimes also called the x intercept.


The x-threshold form preserves proportionality for $x− \theta$. 

> As an example, consider again the case of y = 10 + 2x. When changed to x-threshold form, it becomes y = 2(x + 5). When x changes from 1 to 2, x + 5 changes from 6 to 7, which is an increase of (7 − 6)/6 = 1/6. The resulting change in y is from 12 to 14, which is an increase of (14 − 12)/12 = 1/6. <B>Thus, a 1/6 increase in x − $\theta$ results in a 1/6 increase in y.</B>


<B>The threshold (i.e., x intercept) is often more meaningful than the y-intercept. </B>


For example, suppose we are piloting a tugboat upstream on the Mississippi river, and we want to predict how much headway y we will gain against the current for a given setting of the throttle x. Suppose it is the case that y = −2 + 4x. This form of the equation indicates
that when we apply zero engine power, that is when x = 0, then we lose 2 miles an hour, i.e., y = −2. 


<B>In other words, the y intercept tells us the baseline speed of the river current that we are trying to overcome. 

What may be more useful to know, however, is the amount of engine power we need to apply in order to overcome the current: How big must x be so that we are just matching the downstream pressure? </B>

The answer to this question is the <B>threshold</B>, i.e., the value of x that makes y = 0. In our example, wherein y = −2 + 4x, the threshold is $\theta$ = −(−2/4) = 0.5. In other words, when the throttle is set above the threshold
of 0.5, then we make progress upstream because y > 0, but when the throttle is set below the threshold of 0.5, the we drift downstream because y < 0. 

<B>Thus, the more intuitive form of the “headway” equation is the x intercept form, y = 4(x− 0.5), because it shows explicitly that our headway is proportional to how much the throttle exceeds 0.5.</B>

### Summary of why we care. 
The likelihood function specifies the form of the dependency of y on x. When y and x are metric variables, the simplest form of dependency, both mathematically and intuitively, is one that preserves proportionality. The mathematical expression of this relation is a so-called linear function. The usual mathematical expression of a line is the y intercept form, but often a more intuitive expression is the x threshold form.

<B>Linear functions form the core of most statistical models, so it is important to become facile with their algebraic forms and graphical representations.</B>

### 14.1.4 Additive combination of metric predictors

If we want an increase in one predictor variable to predict the <B><I>same</I></B> proportional increase in the predicted variable <B><I>for any value of the other predictor variables,</I></B> then the predictions of the individual predictor variables must be added.


In general, a linear combination of $K$ predictor variables has the form

\begin{eqnarray*}
y = β_0 + β_1 x_1 +\cdots +β_K x_K  = \beta_0 + \sum{\beta_k x_k}~~~~~~~~~~~~~~~~~~~~~~(14.3)              
\end{eqnarray*}

Figure 14.2 shows examples of linear functions of <B>two</B> variables, $x_1$ and $x_2$. 

<figure id="fig.redline0" style="float: none"><img src="2.png"><figcaption> 
</figcaption></figure>

> ### 14.1.4.1 Reparameterization to $x$ threshold form

>For notational convenience, define the length of a vector $\vec{β}$ = $<\beta_1, ..., β_K >$ to be ||$\vec{β}$||= ($\sum_kβ_k)^{1/2} $. 

> With this new notation for length, Equation 14.3 can be algebraically re-expressed as

<figure id="fig.redline0" style="float: none"><img src="3.png"><figcaption> 
</figcaption></figure> $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(14.4)$

> Notice that when there is only a single predictor variable, i.e., when K = 1, then ||$\vec{β}$|| = |$\beta_1$|
and Equation 14.4 reduces to Equation 14.2.


> In Equation 14.4, the value of $\theta$ is the (Euclidean) length of x when y = 0 and when x is in the direction of vector $<\beta_1, ..., β_K >$. In other words, when $\vec{x}$ = $\theta$ ${\vec{β}}\over{||\vec{β}||} $ then y = 0.


> When the length of $\vec{x}$ (in that direction) exceeds the threshold $\theta$, then y > 0. The x threshold form in Equation 14.4 becomes especially useful when we consider <B>logistic regression</B> in future chapters

> ### Summary of section: 
When the influence of every individual predictor is unchanged
by changing the values of other predictors, then the influences are additive. The combined
influence of two or more predictors can be additive even if the individual influences are
nonlinear. But if the individual influences are linear, and the combined influence is additive,
then the overall combined influence is also linear. The formula of Equation 14.3, or its
reparameterization in Equation 14.4, is known as the <B>linear model</B>. It forms the core of
many statistical models

### 14.1.5 Nonadditive interaction of metric predictors

The combined influence of two predictors does $NOT$ have to be additive. 


Consider, for example, a person’s self-rating of happiness, predicted from his/her overall health and annualincome. It’s likely that if a person’s health is very poor, then the person is not happy, regardless of his/her income. And if the person has zero income, then the person is probably not happy, regardless of his/her health. But if the person is both healthy and rich, then the person is probably happy (despite celebrated counter-examples in the popular media).

<figure id="fig.redline0" style="float: none"><img src="4.png"><figcaption> 
</figcaption></figure>

A graph of this sort of non-additive interaction between predictors appears in the upper left panel of Figure 14.3. 

<B>Notice that the graph of the interaction has a twist in it, but the graph ofthe additive combination is flat.And a non-additive interaction of predictors does not have to be multiplicative.</B>

### 14.1.6 Nominal predictors

> ### 14.1.6.1 Linear model for a single nominal predictor

> The previous sections assumed that the predictor was metric. But what if the predictor is nominal, such as political party affiliation or gender? 
What is the simplest generic model for a metric variable predicted from a nominal variable? 

> Answer: The “natural” model has each value of x generate a particular deflection of y away from its baseline level. 

> For example, consider predicting height from sex (male or female). We can consider the overall average height across both sexes as the baseline height. When an individual has the value “male”, that adds an upward deflection to the predicted height. When an individual has the value “female”, that adds a downward deflection to the predicted height.


> Expressing that idea in mathematical notation can get a little tricky. First consider the nominal predictor. We can’t represent it appropriately as a single scalar value, such as 1 through 5, because that would mean that level 1 is closer to level 2 than it is to level 5, which is not true of nominal values. Therefore, instead of representing the value of the nominal predictor by a single scalar value x, we will represent the nominal predictor by a vector $\vec{x}$ = $<x_1, ..., x_j >$, where $J$ is the number of categories that the predictor has. When an individual has level j of the nominal predictor, this is represented by setting $x_j$ = 1 and $x_{i \neq j}$ = 0. 


> Now that we have a formal representation for the nominal predictor variable, we can create a formal representation for the generic model of how the predictor influences the predicted variable. As mentioned above, the idea is that there is a baseline level of the predicted variable, and each category of the predictor indicates a deflection above or below that baseline level. We will denote the baseline value of the prediction as $β_0$. The deflection for the $j^{th}$ level of the predictor is denoted $β_j$. Then the predicted value is


\begin{eqnarray*}
y = β_0 + β_1 x_1 +\cdots +β_K x_K  = \beta_0 + \vec{β}·\vec{x}~~~~~~~~~~~~~~~~~~~~~~(14.5)              
\end{eqnarray*}


> where the notation $\vec{β}$·$\vec{x}$ is sometimes called the “dot product” of the vectors.
Notice that Equation 14.5 has a form very similar to the basic linear form of Equation 14.1. The conceptual analogy is this: In Equation 14.1 for a metric predictor, the slope $β_1$ indicates how much y changes when x changes from 0 to 1. In Equation 14.5 for a nominal predictor, the coefficient $β_j$ indicates how much y changes when x changes from neutral
to category j.



> The baseline is constrained so that the deflections sum to zero across the categories:


\begin{eqnarray*}
y =  \sumβ_j = 0 ~~~~~~~~~~~~~~~~~~~~~~(14.6)              
\end{eqnarray*}


> The expression of the model in Equation 14.5 is not complete without the constraint in 14.6.

<figure id="fig.redline0" style="float: none"><img src="5.png"><figcaption> 
</figcaption></figure>

> Figure 14.4 shows examples of a nominal predictor, expressed in terms of Equations 14.5 and 14.6). The left panel shows a case for which J = 2, and the right panel shows a case in which J = 5. Notice that the deflections from baseline sum to zero, as
demanded by the constraint in Equation 14.6

> ### 14.1.6.2 Additive combination of nominal predictors

> Suppose we have two (or more) nominal predictors of a metric value. 

> For example, we might be interested in predicting income as a function of political party affiliation and gender. Figure 14.4 showed examples of each of those predictors individually. What we do
now is consider the joint influence of those predictors. If the two influences are merely
additive, then the model from Equation 14.5 becomes
<figure id="fig.redline0" style="float: none"><img src="13.png"><figcaption> 
</figcaption></figure>

<figure id="fig.redline0" style="float: none"><img src="6.png"><figcaption> 
</figcaption></figure>

> The left panel of Figure 14.5 shows an example of two nominal predictors that have additive effects on the predicted variable. In this case, the overall baseline is y = 6. When $x_1$ = < 1, 0 >, there is a deflection in y of −1, and when x1 = < 1, 0 >, there is a deflection in y of +1. This deflection by $x_1$  is the same at every level of $x_2$ . The deflections for the three levels of $x_2$  are +3, −2, and −1. These deflections are the same at all levels of $x_1$. Formally, the left panel of Figure 14.5 is expressed mathematically by the additive combination: 

> $y = 6 + <-1, 1>\vec{x_1} + <3, -2, -1>\vec{x_2}$

> ### 14.1.6.3 Nonadditive interaction of nominal predictors

> When the predictor variables are non-metric, it does not even make sense to talk about a
multiplicative interaction, because there are no numerical values to multiply. For example,
consider predicting annual income from political party affiliation and gender. Both predictors are nominal, so it makes no sense to “multiply” them. But it does make sense to
consider non-additive combination of their influences.2
For example, the overall influence of gender is that men, on average, have a higher
income than women. The overall influence of political party affiliation is that Republicans,
on average, have higher income than Democrats. But it may be that the influences combine
non-additively: Perhaps people who are both Republican and male have a higher average
income than would be predicted by merely adding the average income boosts for being
Republican and for being male. (This interaction is not claimed to be true; it is being used
only as a hypothetical example.)
We need new notation to formalize the non-additive influence of a combination of nominal values. Just as −→x 1 refers to the value of predictor 1, and −→x 2 refers to the value of
predictor 2, the notation −→x 1×2 will refer to a particular combination of values of predictors 1 and 2. If there are J1 levels of predictor 1 and J2 levels of predictor 2, then there are
J1 × J2 combinations of the two predictors.
A non-additive interaction of predictors is formally represented by including a term
for the influence of combinations of predictors, beyond the additive influences, as follows:
y = β0 +
−→
β 1−→x 1 + −→β 2−→x 2 + −→β 1×2−→x 1×2. Whenever the interaction coefficient −→β 1×2 is non-zero,
the predicted value of y is not a mere addition of the separate influences of the predictors.
The right panel of Figure 14.5 shows a graphical example of two nominal predictors
that have interactive (i.e., non-additive) effects on the predicted variable. Notice, in the left
pair of bars (x2 = h1, 0, 0i), that a change from x1 = h1, 0i to x1 = h0, 1i produces an
increase of +2 in y, from y = 8 to y = 10. But for the middle pair of bars (x2 = h0, 1, 0i), a
change from x1 = h1, 0i to x1 = h0, 1i produces an increase of −2 in y, from y = 5 to y = 3.
Thus, the influence of x1 is not the same at all levels of x2.
An interesting aspect of the pattern in the right panel of Figure 14.5 is that the average
influences of x1 and x2 are the same as in the left panel. Overall, on average, going from
x1 = h1, 0i to x1 = h0, 1i produces a change of+2 in y, in both the left and right panels. And
overall, on average, for both panels it is the case that x2 = h1, 0, 0i is +3 above baseline,
x2 = h0, 1, 0i is −2 below baseline, and x2 = h0, 0, 1i is −1 below baseline. The only
difference between the two panels is that the combined influence of the two predictors
equals the sum of the individual influences in the left panel, but the combined influence of
the two predictors does not equal the sum of the individual influences in the right panel.
An interaction between nominal predictors consists of a distinct deflection, for each
specific combination of categorical values, away from the additive combination. The magnitude of the interactive deflection is whatever is left over after the additive effects have
been applied to the baseline. The model that includes an interaction term can be written as


<figure id="fig.redline0" style="float: none"><img src="14.png"><figcaption> 
</figcaption></figure>
<figure id="fig.redline0" style="float: none"><img src="15.png"><figcaption> 
</figcaption></figure>


> In these equations, the term −→x 1×2 has J1 times J2 components, all of which are zero except
for a 1 at the particular combination of levels of x1 and x2. This mysterious and arcane
notation will be revealed in all its majestic grandeur in Chapter 19. For now, the main
point is to understand that the term “interaction” refers to a non-additive influence of the
predictors on the predicted, regardless of whether the predictors are measured on a nominal
scale or a metric scale.

### 14.1.7 Linking combined predictors to the predicted

 Once the predictor variables are combined, they need to be mapped to the predicted variable. This mathematical mapping is called the (inverse) link function, denoted by f() in the
following equation:

<figure id="fig.redline0" style="float: none"><img src="16.png"><figcaption> 
</figcaption></figure>


Until now, we have been assuming that the link function is merely the identity function,
$f(x) = x$. For example, in Equation 14.9, y equals the linear combination of the predictors;
there is no transformation of the linear combination before mapping the result to y.


Before describing different link functions, it is important to make some clarifications
of terminology and corresponding concepts. First, the function $f()$ in Equation 14.11 is
usually called the inverse link function, because the link function itself is thought of as
transforming the value y into a form that can be linked to the linear model. I will abuse
convention and simply refer to either $f()$ or $f^{−1}()$ as <B>the link function</B>, and rely on context
to disambiguate which direction of linkage is intended. The reason for this terminological
sadism is that the arrows in hierarchical diagrams of Bayesian models will flow from the
linear model toward the data, and therefore it is natural for the functions to map toward the
data, as in Equation 14.11. But repeatedly referring to this function as the “inverse” link
would strain my patience and violate my aesthetic sensibilities. Second, the value y that
results from the link function $f(x)$ is not a data value per se. Instead, $f(x)$ is the value of
a parameter that expresses some characteristic of the data, usually their mean. Therefore
the function f() in Equation 14.11 is sometimes called the mean function, and is written
$µ = f()$ instead of $y = f()$. I will not use this terminology because most students already
think that “mean” means something else, namely the sum divided by N. The fact that y in
Equation 14.11 is a parameter value and not a data value will become clear in subsequent
sections as we encounter specific cases and examples.


There are situations in which a non-identity link function is appropriate. Consider,
for example, predicting response time as a function of amount of caffeine consumed. Response time declines as caffeine dosage increases, and therefore a linear prediction of RT from dosage would have a negative slope. This negative slope implies that for a very large
dosage of caffeine, response time would become negative, which is impossible unless caffeine causes precognition (i.e., foreseeing events before they occur). Therefore a direct linear function cannot be used for extrapolation to large doses, and we might instead want
to use an exponential link function such as $y = exp{(β_0 + β_1x)}$.

<figure id="fig.redline0" style="float: none"><img src="7.png"><figcaption> 
</figcaption></figure>

> ### 14.1.7.1 The sigmoid (a.k.a. logistic) function

> A frequently used link function is the sigmoid , also known as the logistic:


\begin{eqnarray*}
y = sig(x) = 1 /(1 + exp(−x))~~~~~~~~~~~~~~~~~~~~~~(14.12)              
\end{eqnarray*}


>Notice the negative sign in front of the x. The sigmoid function ranges between 0 and 1.
The sigmoid is nearly 0 when x is large negative, and is nearly 1 when x is large positive.
For linear combinations of predictors, the sigmoid link function is most conveniently
parameterized in x threshold form. For a single predictor variable, the sigmoid link function
applied to the linear function of the predictor yields

\begin{eqnarray*}
y = sig (x; \gamma, \theta) = 1 /(1 + exp (−\gamma (x − \theta)))~~~~~~~~~~~~~~~~~~~~~~(14.13)              
\end{eqnarray*}


>where $\gamma$, called the gain, corresponds to $β_1$ in Equation 14.2, and where $\theta$, called the threshold, corresponds to $−β_0/β_1$ in Equation 14.2.


>Examples of Equation 14.13, i.e., the sigmoid of a single predictor, are shown in Figure 14.6. Notice that the threshold is the point on the x axis for which $y = 0.5.$ The gain indicates how steeply the sigmoid rises through that point.


>Figure 14.7 shows examples of a sigmoid of two predictor variables. Above each panel
is the equation for the corresponding graph. The equations are parameterized in x threshold
form, as in Equation 14.4. In other words, $y = sig γ Pk wkxk − θ, with Pk w2k1/2 = 1$.
Notice, in particular, that the coefficients of $x_1$ and $x_2$ in the plotted equations do indeed have
Euclidean length of 1.0. For example, in the upper-right panel, 0.71 2 + 0.71 21/2 = 1.0,
except for rounding error.


>The coefficients of the x variables determine the orientation of the sigmoidal “cliff”. For
example, compare the two top panels in Figure 14.7, which differ only in the coefficients, not in gain or threshold. In the top left panel, the coefficients are $w_1 = 0$ and $w_2 = 1$, and
the cliff rises in the $x_2$ direction. In the top right panel, the coefficients are $w_1 = 0.71$ and
$w2 = 0.71$, and the cliff rises in the positive diagonal direction.


>The threshold determines the position of the sigmoidal cliff. In other words, the threshold determines the x values for which y = 0.5. For example, compare the two left panels
of Figure 14.7. The coefficients are the same, but the thresholds (and gains) are different.
In the upper left panel, the threshold is zero, and therefore the mid-level of the cliff is over
x2 = 0. In the lower left panel, the threshold is −3, and therefore the mid-level of the cliff
is over x2 = −3.


>The gain determines the steepness of the sigmoidal cliff. Again compare the two left
panels of Figure 14.7. The gain of the upper left is 1, whereas the gain of the lower left is 2.


>Terminology: The logit function. The inverse of the logistic function is called the
logit function. For 0 < p < 1, logit(p) = log (p/(1 − p)). It is easy to show (try it!)
that logit(sig(x)) = x, which is to say that the logit is indeed the inverse of the sigmoid.
Some authors, and programmers, prefer to express the connection between predictors and
predicted in the opposite direction, by first transforming the predicted variable to match the linear model. In other words, you may see the link expressed either of these ways:

\begin{eqnarray*}
y = logistic (β_0 + β_1 x_1 + . . .) 
logit(y) = β0 + β1 x1 + . . .~~~~~~~~~~~~~~~~~~~~~~(14.13)              
\end{eqnarray*}


The two expressions achieve the same result, mathematically. The difference between them
is merely a matter of emphasis. In the first expression, the combination of predictors is
transformed so it maps onto y expressed in its original scale. In the second expression, y is
transformed onto a new scale, and that transformed value is modeled as a combination of
predictors.

<figure id="fig.redline0" style="float: none"><img src="8.png"><figcaption> 
</figcaption></figure>

<figure id="fig.redline0" style="float: none"><img src="9.png"><figcaption> 
</figcaption></figure>

> ### 14.1.7.2 The cumulative normal (a.k.a. Phi) function

> Another frequently used link function is the cumulative normal distribution. It is qualitatively very similar to the sigmoid or logistic function. Modelers will use the logistic or
the cumulative normal depending on mathematical convenience or ease of interpretation.
For example, when we consider ordinal predicted variables (in Chapter 21), it will be natural to model the responses in terms of a continuous underlying variable that has normally
distributed variability, which leads to using the cumulative normal as a model of response
probabilities.
The cumulative normal is denoted Φ(x, µ, τ), where x is a real number and where µ
and τ are parameter values, called the mean and precision of the normal distribution. The
parameter µ governs the point at which the cumulative normal, Φ(x), equals 0.5. In other
words, µ plays the same role as the threshold in the logistic sigmoid. The parameter τ
governs the steepness of the cumulative normal function at x = µ. The τ parameter plays the same role as the gain parameter in the logistic sigmoid. A graph of a cumulative normal
appears in Figure 14.8. For this example, µ = 0, and notice that Φ(0) = 0.5. This means
that the area under the normal density to the left of 0 is 0.5.
Terminology: The probit function. The inverse of the cumulative normal is called the
probit function. (“Probit” stands for “probability unit”; Bliss, 1934). The probit function
maps a value p, for 0.0 ≤ p ≤ 1.0, onto the infinite real line, and a graph of the probit
function looks very much like the logit function. You may see the link expressed either of
these ways: Traditionally, the transformation of y (in this case, the probit function) is called the link
function, and the transformation of the linear combination of x (in this case, the Φ function)
is called the inverse link function. As mentioned before, I abuse the traditional terminology
and call either one a link function, relying on context to disambiguate.

### 14.1.8 Probabilistic prediction

In the real world, there is always variation in y that we cannot predict from x. This unpredictable “noise” in y might be deterministically caused by sundry factors we have neither
measured nor controlled, or the noise might be caused by inherent non-determinism in y. It
does not matter either way because in practice the best we can do is predict the probability
that y will have any particular value, dependent upon x. Therefore we use the deterministic
value predicted by Equation 14.11 as the predicted tendency of y as a function of the predictors. We do not predict that y is exactly fβ0 + β1 x1 + β2 x2 + β1×2 x1×2 because we would
surely be wrong. Instead, we predict that y tends to be near fβ0 + β1 x1 + β2 x2 + β1×2 x1×2.
To make this notion of probabilistic tendency precise, we need to specify a probability
distribution for y that depends on fβ0 + β1 x1 + β2 x2 + β1×2 x1×2. To keep the notation
tractable, first define µ = fβ0 + β1 x1 + β2 x2 + β1×2 x1×2. Do not confuse this use of µ
with the unrelated µ mentioned in the cumulative normal function. With this notation, we
then denote the probability distribution of y as some to-be-specified probability density
function, abbreviated as “pdf”:
y ∼ pdf(µ [, τ, ...])
The pdf might have various additional parameters, denoted by τ, ..., to specify its shape.
Examples are provided in the next section, where all these ideas are brought together.

### 14.1.9 Formal expression of the GLM

In general, the likelihood function specifies the probability of each possible predicted value
y as a function of the predictor values x j and various parameter values β, τ etc. The generalized linear model can be written:
(14,14)
15

<figure id="fig.redline0" style="float: none"><img src="10.png"><figcaption> 
</figcaption></figure>

<figure id="fig.redline0" style="float: none"><img src="11.png"><figcaption> 
</figcaption></figure>

The function f in Equation 14.14 is called the “link” function, because it links the combination of predictors to the predicted tendency. The optional parameters [, τ, ...] in Equation 14.15 may be needed for various types of the probability density function (pdf) that
describe the probability distribution of y around the tendency µ.
Figure 14.9 shows a random sample of points normally distributed around a line or
plane. The upper panel illustrates a case of the generalized linear model of Equations 14.14
and 14.15 in which there is a single predictor x, with β0 = 10 and β1 = 2. The link function
is simply the identity function, f(β0 + β1 x) = β0 + β1 x. The probability density function is
normal with a standard deviation of 2.0. Profiles of this normal density are superimposed
on the graph to make it explicit. Notice that the normal density is always centered on the
line that marks the predicted tendency as a function of the predictor.
The lower panel of Figure 14.9 shows a case with two predictor variables. The predictors are combined linearly, with no interaction. The link function is the identity. The
probability function is normal with a standard deviation of 4. Each randomly generated
point is connected to the underlying linear core by a vertical dotted line, to explicitly indicate the random variation of the point from the plane. The plane marks the predicted
tendency as a function of the predictors, and the data values are normally distributed above
and below that tendency.
Figure 14.10 shows another case of the GLM. In this case, the points are Bernoulli
distributed around a sigmoid function of two predictors, as annotated at the top of the graph.
There is a linear combination of predictors, with a sigmoid link function, and a Bernoulli
probability function that defines the probability that y = 1. The graph shows that values of y
can only be 0 or 1, and the sigmoid function defines the probability that y is 1 for particular predictor values. The sigmoidal surface plots the tendency that y = 1 as a function of the
predictors.

### 14.2 Cases of the GLM

Table 14.1, p. 312, displays the various cases of the generalized linear model that are considered in this book. Subsequent chapters of the book progress through the table in reading
order: left to right within rows, then top to bottom across rows.
The first row of Table 14.1 lists cases for which the predicted variable is metric. Moving
from left to right within this row, the first column indicates a situation in which there is only
a single group, and the predicted value for the group is simply the mean of the group. In
this situation, there is no need to explicitly denote a predictor variable, and instead the mean
of the group can be denoted by a single parameter, β0. This situation corresponds to what
classical null hypothesis significance testing (NHST) calls a single-group t-test. This case
is described in its Bayesian setting in Chapter 15.
Moving to the next column, there is a single metric predictor. This corresponds to socalled “simple linear regression”, and is explored in Chapter 16. By inspecting the equation
for the GLM in the cell, you can see that the only difference from the previous cell is the
inclusion of the predictor x1 and its coefficient β1 .
Moving rightward to the next column, we come to the scenario involving two or more
metric predictors, which corresponds to “multiple regression”, and is explored in Chapter 17. By examining the equations for the GLM in the cell, you can see that the basic form
is the same, but merely with extra terms added for the additional predictors.
The next two columns involve nominal predictors, instead of metric predictors, with the
penultimate column devoted to a single predictor and the final column devoted to two or
more predictors. The last two columns correspond to what NHST calls “oneway ANOVA”
and “multifactor ANOVA”. If that terminology is unfamiliar to you, don’t worry, it will be
explained in Chapters 18 and 19.
In all the cases in the first row, the link function is the identity, and the probability distribution for the metric predicted values in assumed to be normal. When we move to the
second row, however, the predicted variable is dichotomous, and therefore the probability
distribution for y is a Bernoulli distribution. The link function, which connects the predictors to the probability that y = 1, is assumed to be the sigmoid, i.e., logistic function. When
the predictors are metric, this situation is generically referred to as “logistic regression” and
is discussed in Chapter 20. The case of nominal predictors is also discussed.
Finally, the bottom row of Table 14.1 lists cases for which the predicted variable is
ordinal. These cases are considered in Chapter 21. Notice that the link function is the
cumulative normal instead of the sigmoid, and the ordinal values are generated by multiplecategory generalization of the Bernoulli function, denoted by dcat. Again, this will be
explained at length in the forthcoming chapters. The point here is for you to see the overall
organization of topics, and to see how all these cases are variations of the same underlying
structure.
The table can be expanded with additional rows and columns, but then it gets too big
to display easily. Additional columns would include combinations of metric and nominal
predictors. But it turns out that it is easy in Bayesian models to combine metric and nominal predictors, once you know how to handle metric and nominal predictors individually.
Additional rows would involve different types of predicted variables. In particular, a fourth row would include count data for the predicted values. We will, in fact, cover one such
case, as described in the next section. When the predicted data are count values, a natural
link function is the exponential, and a natural pdf is the Poisson distribution, which will
be defined later in the book (Section 22.1.3). In summary, the rows of the table refer to
differently scaled predicted values, with their corresponding link functions and pdf’s:

<figure id="fig.redline0" style="float: none"><img src="12.png"><figcaption> 
</figcaption></figure>

### 14.2.1 Two or more nominal variables predicting frequency

Finally, we will also consider the situation in which there are two or more nominal variables
used as predictors of a frequency count. A frequency count, i.e., how many times something
happened, is a special case of a metric scale, but because its values fall at discrete levels,
namely non-negative integers, this situation will have a different sort of likelihood distribution. This type of situation, with nominal predictors and frequency-count predicted values,
is often called “contingency table analysis” and a typical NHST analysis conducts a “chisquare test of independence of attributes”. We explore Bayesian analysis of this situation in
Chapter 22.
Here is a brief summary of how contingency tables are analyzed using a model much
like those in Table 14.1. In fact, a fourth row could be added to Table 14.1, with the predicted type labeled frequency count, and the model falling in the final column, under two
nominal predictors. As a concrete example, suppose we measure political affiliation and
religious affiliation of a set of people, and for a sample of people we count how many
occurrences there are of each combination. We are interested in analyzing possible relationships between political and religious affiliations. Suppose we conduct a poll for one
week. We happen to record 27 people who are Democrats and Unitarians. This observed
frequency reflects an underlying rate at which that combination is generated by this sort
of poll, i.e., the underlying rate for Unitarian Democrats is roughly 27 people per week.
The observed rate (i.e., frequency per unit time) for each combination of nominal values
is thought to reflect the true underlying rate at which that combination is generated by the
world. We conceive of the observed rate as being a random sample from a true underlying
rate denoted by λ. The probability of any particular observed rate, given an underlying
rate of λ, is modeled by a Poisson distribution, which is denoted as freq ∼ dpois(λ). The
Poisson distribution was smuggled into the text back in Exercise 11.3, p. 235, which I’m
sure is still as fresh in your memory as a beached fish. Don’t worry, the Poisson will be
explained again later (Section 22.1.3). The Poisson distribution specifies a probability for
each possible observed rate. The Poisson puts highest probabilities on rates near λ.
Our goal is to estimate the underlying rates at which each nominal combination is produced. But more than that, we would like to know if the attributes occur independently of
each other, or instead covary in some way. For example, if political and religious affiliation
are independent, then there should be the same proportion of Unitarians among Democrats
as among Republicans. Mathematically, independence means p(Unitarian&Democrat) =
p(Unitarian) × p(Democrat) and p(Unitarian&Republican) = p(Unitarian) × p(Republican)
and so on for every combination of attribute values. To shorten the expressions, I’ll substi tute U for Unitarian and D for Democrat, whereby independence means
p(U&D) = p(U) × p(D)
and so on for every combination of attributes. That expression for probabilities corresponds
to the following expression in terms of frequencies:
freq(U&D)/N = freq(U)/N × freq(D)/N
which can be re-arranged as
freq(U&D) = freq(U) × freq(D) × 1/N.
Notice that independence is expressed as a multiplicative product of attribute influences.
But all the models we’ve considered in this chapter used an additive sum of predictor influences. To be able to use our familiar additive models, we’ll transform the frequencies by a
logarithm, because the logarithm of the product of values equals the sum of the logarithms
of the values. In other words,
log(freq(U&D)) = log(freq(U)) + log(freq(D)) + log(1/N).
The notation “log(freq(value))” gets cumbersome, so we’ll substitute the notation βv. Thus,
log(freq(U&D)) = βU + βD + β0,
where β0 stands in for the constant log(1/N). Finally, it’s unintuitive to talk about the
logarithms of frequencies, so we’ll exponentiate to get rid of the leading logarithm, yielding:
freq(U&D) = exp (βU + βD + β0) .
To summarize, if independence is true, then the expression above should be true, for every
combination of attribute values.
But of course the attributes are usually not independent, and we would like some measure of lack of independence. We already have such a measure in the context of linear
models, namely, the interaction term. Thus, we will include an interaction term that estimates deviation from independence. Thus, our model ends up being as follows. For two
nominal attributes, we put the observed frequencies in a table, with one attribute’s values
listed down the rows, and the other attribute’s values listed across the columns. The frequency in the rth row and cth column is denoted freqrc, and the underlying rate for that cell
is denoted λ
rc. The model looks like this:
λ
rc = exp (β0 + βr + βc + βr ×c)
freqrc ∼ dpois(λrc) (14.16)
with the usual constraints (from Equation 14.10)
Xr
βr = 0 and X
c
βc = 0 and X
r
βr ×c,r,c = 0 ∀c and X
c
βr×c,r,c = 0 ∀r
The point of this over-fast prelude to contingency table analysis is merely to demonstrate that the core of the model we’ll be using is the same as the linear model that was
mentioned for multifactor ANOVA in Table 14.1. Thus, all the applied analyses we’ll see
in the remainder of the book are based on the GLM.