-
Notifications
You must be signed in to change notification settings - Fork 7
/
loglinear.Rmd
254 lines (202 loc) · 9.64 KB
/
loglinear.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Loglinear Models"
author: "Michael Friendly"
date: "`r Sys.Date()`"
package: vcdExtra
output:
rmarkdown::html_vignette:
fig_caption: yes
bibliography: ["vcd.bib", "vcdExtra.bib"]
csl: apa.csl
vignette: >
%\VignetteIndexEntry{Loglinear Models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
fig.height = 6,
fig.width = 7,
fig.path = "fig/tut03-",
dev = "png",
comment = "##"
)
# save some typing
knitr::set_alias(w = "fig.width",
h = "fig.height",
cap = "fig.cap")
# preload datasets ???
set.seed(1071)
library(vcd)
library(vcdExtra)
library(ggplot2)
data(HairEyeColor)
data(PreSex)
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
if(!file.exists("fig")) dir.create("fig")
```
You can use the `loglm()` function in the `MASS` package to fit log-linear
models. Equivalent models can also be fit (from a different perspective) as generalized
linear models with the `glm()` function using the `family='poisson'` argument,
and the `gnm` package provides a wider range of generalized *nonlinear* models,
particularly for testing structured associations.
The visualization methods for these models were originally developed for models fit using `loglm()`,
so this approach is emphasized here. Some extensions of these methods for models
fit using `glm()` and `gnm()` are contained in the `vcdExtra` package
and illustrated in @ref(sec:glm).
Assume we have a 3-way contingency table based on
variables A, B, and C.
The possible different forms of loglinear models for a 3-way table are shown in the table below.
\@(tab:loglin-3way)
The **Model formula** column shows how to express each model for `loglm()` in R.
^[For `glm()`, or `gnm()`, with the data in the form of a frequency data.frame, the same model is specified
in the form `glm(Freq` $\sim$ `..., family="poisson")`, where `Freq` is the name
of the cell frequency variable and `...` specifies the *Model formula*.]
In the **Interpretation** column, the symbol "$\perp$" is to be read as "is independent of,"
and "$\;|\;$" means "conditional on," or "adjusting for," or just "given".
| **Model** | **Model formula** | **Symbol** | **Interpretation** |
|:-------------------------|:-------------------|:---------------|:-----------------------|
| Mutual independence | `~A + B + C` | $[A][B][C]$ | $A \perp B \perp C$ |
| Joint independence | `~A*B + C` | $[AB][C]$ | $(A \: B) \perp C$ |
| Conditional independence | `~(A+B)*C` | $[AC][BC]$ | $(A \perp B) \;|\; C$ |
| All two-way associations | `~A*B + A*C + B*C` | $[AB][AC][BC]$ | homogeneous association|
| Saturated model | `~A*B*C` | $[ABC]$ | 3-way association |
For example, the formula `~A + B + C` specifies the model of *mutual independence* with
no associations among the three factors. In standard notation for the expected frequencies
$m_{ijk}$, this corresponds to
$$ \log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C \equiv A + B + C $$
The parameters $\lambda_i^A , \lambda_j^B$ and $\lambda_k^C$ pertain to the differences among the
one-way marginal frequencies for the factors A, B and C.
Similarly, the model of *joint independence*, $(A \: B) \perp C$, allows an association between A and B, but specifies that
C is independent of both of these and their combinations,
$$ \log ( m_{ijk} ) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C + \lambda_{ij}^{AB} \equiv A * B + C $$
where the parameters $\lambda_{ij}^{AB}$ pertain to the overall association between A and B (collapsing over C).
In the literature or text books, you will often find these models expressed in shorthand symbolic notation,
using brackets, `[ ]` to enclose the *high-order terms* in the model.
Thus, the joint independence model can be denoted `[AB][C]`, as shown in the **Symbol** column in the table.
\@(tab:loglin-3way).
Models of *conditional independence* allow (and fit) two of the three possible
two-way associations. There are three such models, depending on which variable is conditioned upon.
For a given conditional independence model, e.g., `[AB][AC]`, the given variable is the one
common to all terms, so this example has the interpretation $(B \perp C) \;|\; A$.
## Fitting with `loglm()` {#sec:loglm}
For example, we can fit the model of mutual independence among hair color, eye color and sex
in `HairEyeColor` as
```{r, loglm-hec1}
library(MASS)
## Independence model of hair and eye color and sex.
hec.1 <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.1
```
Similarly, the models of conditional independence and joint independence are specified as
```{r, loglm-hec2}
## Conditional independence
hec.2 <- loglm(~(Hair + Eye) * Sex, data=HairEyeColor)
hec.2
```
```{r, loglm-hec3}
## Joint independence model.
hec.3 <- loglm(~Hair*Eye + Sex, data=HairEyeColor)
hec.3
```
Note that printing the model gives a brief summary of the goodness of fit.
A set of models can be compared using the `anova()` function.
```{r, loglm-anova}
anova(hec.1, hec.2, hec.3)
```
## Fitting with `glm()` and `gnm()` {#sec:glm}
The `glm()` approach, and extensions of this in the `gnm` package allows a
much wider class of models for frequency data to be fit than can be handled by
`loglm()`. Of particular importance are models for ordinal factors and for
square tables, where we can test more structured hypotheses about the patterns
of association than are provided in the tests of general association under
`loglm()`. These are similar in spirit to the
non-parametric CMH tests described in \@ref(sec:CMH).
***Example***:
The data `Mental` in the `vcdExtra` package gives a two-way table in frequency form
classifying young people by
their mental health status and parents' socioeconomic status (SES), where
both of these variables are ordered factors.
```{r, mental1}
data(Mental, package = "vcdExtra")
str(Mental)
xtabs(Freq ~ mental + ses, data=Mental) # display the frequency table
```
Simple ways of handling ordinal variables involve assigning scores to the table
categories, and the simplest cases are to use integer scores, either for the row variable (``column
effects'' model), the column variable (``row effects'' model), or both (``uniform association''
model).
```{r, mental2}
indep <- glm(Freq ~ mental + ses, family = poisson, data = Mental) # independence model
```
To fit more parsimonious models than general association, we can define
numeric scores for the row and column categories
```{r, mental3}
# Use integer scores for rows/cols
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental)
```
Then, the row effects model, the column effects model, and the uniform association
model can be fit as follows. The essential idea is to replace a factor variable
with its numeric equivalent in the model formula for the association term.
```{r, mental4}
# column effects model (ses)
coleff <- glm(Freq ~ mental + ses + Rscore:ses, family = poisson, data = Mental)
# row effects model (mental)
roweff <- glm(Freq ~ mental + ses + mental:Cscore, family = poisson, data = Mental)
# linear x linear association
linlin <- glm(Freq ~ mental + ses + Rscore:Cscore, family = poisson, data = Mental)
```
The `LRstats()` function in `vcdExtra` provides a nice, compact summary of
the fit statistics for a set of models, collected into a *glmlist* object.
Smaller is better for AIC and BIC.
```{r, mental4a}
# compare models using AIC, BIC, etc
vcdExtra::LRstats(glmlist(indep, roweff, coleff, linlin))
```
For specific model comparisons, we can also carry out tests of *nested* models with
`anova()` when those models are listed from smallest to largest.
Here, there are two separate paths from the most restrictive (independence) model
through the model of uniform association, to those that allow only one of
row effects or column effects.
```{r, mental5}
anova(indep, linlin, coleff, test="Chisq")
anova(indep, linlin, roweff, test="Chisq")
```
The model of linear by linear association seems best on all accounts.
For comparison, one might try the CMH tests on these data:
```{r, mental6}
CMHtest(xtabs(Freq~ses+mental, data=Mental))
```
## Non-linear terms
The strength of the `gnm` package is that it handles a wide variety of models
that handle non-linear terms, where the parameters enter the model beyond a simple
linear function.
The simplest example is the Goodman RC(1) model [@Goodman:79],
which allows a multiplicative
term to account for the association of the table variables.
In the notation of generalized linear models with a log link, this can be expressed as
$$ \log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i} \delta_{j} ,$$
where the row-multiplicative effect parameters $\gamma_i$ and
corresponding column parameters $\delta_j$ are estimated from the data.%
^[This is similar in spirit to a correspondence analysis with a single dimension, but as a statistical model.]
Similarly, the RC(2) model adds two multiplicative terms to
the independence model,
$$ \log \mu_{ij} = \alpha_i + \beta_j + \gamma_{i1} \delta_{j1} + \gamma_{i2} \delta_{j2} . $$
In the `gnm` package, these models may be fit using the `Mult()`
to specify the multiplicative term, and `instances()` to specify several
such terms.
***Example***:
For the `Mental` data, we fit the RC(1) and RC(2) models, and compare
these with the independence model.
```{r, mental7}
RC1 <- gnm(Freq ~ mental + ses + Mult(mental,ses), data=Mental,
family=poisson, verbose=FALSE)
RC2 <- gnm(Freq ~ mental+ses + instances(Mult(mental,ses),2), data=Mental,
family=poisson, verbose=FALSE)
anova(indep, RC1, RC2, test="Chisq")
```
## References