/
visualisation_matrix.Rmd
347 lines (266 loc) · 11.8 KB
/
visualisation_matrix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
title: "Data grids"
output:
rmarkdown::html_vignette:
toc: true
fig_width: 10.08
fig_height: 6
tags: [r, estimate, data grid, reference grid, matrix grid]
vignette: >
%\VignetteIndexEntry{Data grids}
\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: bibliography.bib
---
```{r, include=FALSE}
library(knitr)
options(knitr.kable.NA = "")
knitr::opts_chunk$set(
comment = ">",
message = FALSE,
warning = FALSE,
out.width = "100%",
dpi = 300
)
options(digits = 2)
if (!requireNamespace("parameters", quietly = TRUE) ||
!requireNamespace("poorman", quietly = TRUE) ||
!requireNamespace("see", quietly = TRUE) ||
!requireNamespace("ggplot2", quietly = TRUE) ||
!requireNamespace("lme4", quietly = TRUE) ||
!requireNamespace("gamm4", quietly = TRUE)) {
knitr::opts_chunk$set(eval = FALSE)
} else {
library(gamm4)
}
set.seed(333)
```
Sometimes, for instance for **visualization** purposes, we want to extract a
**reference grid** (or *data grid*) of our dataset, that we will call a
**visualisation matrix**. This reference grid usually contains the same
variables than the original dataset, but reorganized in a particular,
**balanced**, way. For instance, it might contain all the combinations of
factors, or equally spread points of a continuous variable. These reference
grids are often used as data for predictions of statistical models, to help us
represent and understand them.
> NOTE: the `visualisation_matrix()` function showcased in this vignette is now an alias (another name) for the [`get_datagrid()`](https://easystats.github.io/insight/reference/get_datagrid.html) function in the *insight* package.
# Simple linear regression
For instance, let's fit a simple linear model that models the relationship
between `Sepal.Width` and `Sepal.Length`.
```{r}
library(parameters)
model <- lm(Sepal.Width ~ Sepal.Length, data = iris)
model_parameters(model)
```
The most obvious way of representing this model is to plot the data points and
add the regression line using the `geom_smooth` function from `ggplot2`:
```{r}
library(ggplot2)
library(see)
library(poorman)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm") +
theme_modern()
```
But how to "access" the data of this regression line? One good option is to select
some values of of the predictor (`Sepal.Length`), and **predict** (using the
base R `predict()` method for now) the response (`Sepal.Width`) using the model.
Using these *x* and *y* points, we can then create the regression line.
Let's try the `visualisation_matrix` function from the `modelbased` package (note that this function is the same as [`insight::get_datagrid()`](https://easystats.github.io/insight/reference/get_datagrid.html), just with a different name).
```{r}
library(modelbased)
visualisation_matrix(iris["Sepal.Length"])
```
If we pass a numeric column to the function, it will return a vector of
**equally spread points** (having the same range, i.e., the same minimum and
maximum, than the original data). The default **length** is 10, but we can
adjust that through the `length` argument. For instance, for linear relationships
(i.e., a straight line), two points are in theory sufficient. Let's generate
predictions using this reference grid of the predictor.
```{r}
vizdata <- visualisation_matrix(iris["Sepal.Length"], length = 2)
vizdata$Predicted <- predict(model, vizdata)
vizdata
```
Now that we have our *x* and *y* values, we can plot the line as an overlay to
the actual data points:
```{r}
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_line(data = vizdata, aes(y = Predicted), size = 1, color = "red") +
theme_modern()
```
As we can see, it is quite similar to the previous plot. So, when can this be
useful?
# Mixed models
Data grids are useful to represent more complex models. For instance, in the
models above, the **negative** relationship between the length and width of the
sepals is in fact biased by the presence of three different species. One way of
adjusting the model for this grouping structure is to add it as a **random effect**
in a **mixed model**. In the model below, the "fixed" effects (the parameters of
interest) will be adjusted ("averaged over") to the random effects.
```{r}
library(lme4)
model <- lmer(Sepal.Width ~ Sepal.Length + (1 | Species), data = iris)
model_parameters(model)
```
As we can see, when adjusting for the species, the relationship between the two
variables **has become positive**!
We can represent it using the same procedure as above, but note that instead of
using the base R `predict()` function, we will be using [*get_predicted()*](https://easystats.github.io/insight/reference/get_predicted.html),
from the **insight** package, which is a more robust and user-friendly version
of `predict()`.
```{r}
vizdata <- visualisation_matrix(iris["Sepal.Length"])
vizdata$Predicted <- insight::get_predicted(model, vizdata)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species)) +
geom_line(data = vizdata, aes(y = Predicted), size = 1) +
theme_modern()
```
# Fixed variables
The above way of constructing the reference grid, i.e., by providing a single
column of data to the function, is almost equivalent to the following:
```{r}
vizdata <- visualisation_matrix(iris, at = "Sepal.Length")
vizdata
```
However, the other variables (present in the dataframe but not selected as
`at`) are "fixed", *i.e.*, they are maintained at specific values. This is
useful when we have other variables in the model in whose effect we are not
interested.
By default, **factors** are fixed by their **"reference" level** and **numeric variables** are fixed at their **mean**. However, this can be easily changed:
```{r}
vizdata <- visualisation_matrix(iris, at = "Sepal.Length", numerics = "min")
vizdata
```
# Target variables
If more than one target variable is selected, `visualisation_matrix` will return
the **combination** of them (*i.e.*, all unique values crossed together). This
can be useful in the case of an **interaction** between a numeric variable and a
factor.
Let's visualise the regression line for each of the levels of `Species`:
```{r}
model <- lm(Sepal.Width ~ Sepal.Length * Species, data = iris)
vizdata <- visualisation_matrix(iris, at = c("Sepal.Length", "Species"), length = 5)
vizdata$Predicted <- insight::get_predicted(model, vizdata)
vizdata
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_line(data = vizdata, aes(y = Predicted), size = 1) +
theme_modern()
```
# Preserve range
However, it is generally not a good practice to extend the regression lines
beyond the range of its original data, as it is the case here for the **red
line**. The `preserve_range` option allows to remove observations that are
"outside" the original dataset (however, the length should be increased to
improve the precision toward the edges):
```{r}
vizdata <- visualisation_matrix(iris,
at = c("Sepal.Length", "Species"),
length = 100,
preserve_range = TRUE
)
vizdata$Predicted_Sepal.Width <- insight::get_predicted(model, vizdata)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_line(data = vizdata, aes(y = Predicted_Sepal.Width), size = 1) +
theme_modern()
```
# Visualising an interaction between two numeric variables (three-way interaction)
```{r}
model <- lm(Sepal.Length ~ Petal.Length * Petal.Width, data = iris)
model_parameters(model)
```
This idea can also be used to visualise interactions between two numeric
variables, **aka the nightmare of every scientist**. One possibility is to
basically represent the relationship between the response and one predictor **at
a few representative values of the second predictor**.
In this case, we will represent the regression line between `Sepal.Length` and
`Petal.Length` and a 5 equally spaced values of `Petal.Length`, to get a feel for
the interaction.
We can obtain the right reference grid quite easily by chaining two
`visualisation_matrix` together as follows:
```{r}
vizdata <- iris %>%
visualisation_matrix(c("Petal.Length", "Petal.Width"), length = 10) %>%
visualisation_matrix("Petal.Width", length = 5, numerics = "all")
```
What did we do here? We started by generating a reference grid containing all
the combinations between the 10 equally spread values of the two target
variables, creating `10 * 10 = 100` rows. The next step was to reduce
`Petal.Length` to a set of 5 values, but without touching the other variables
(*i.e.*, keeping the 10 values created for `Petal.Length`). This was achieved
using `numerics = "all"`.
We can then visualise it as follows:
```{r}
vizdata$Predicted <- insight::get_predicted(model, vizdata)
iris %>%
ggplot(aes(x = Petal.Length, y = Sepal.Length, color = Petal.Width)) +
geom_point() +
geom_line(data = vizdata, aes(y = Predicted, group = Petal.Width), size = 1) +
scale_color_viridis_c() +
theme_modern()
```
Such plot can be more clear by expressing the interaction variable in terms of
deviations from the mean (as a standardized variable).
```{r}
# Express values in an abstract way
vizdata$Petal.Width <- effectsize::format_standardize(vizdata$Petal.Width, reference = iris$Petal.Width)
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
# Only shapes from 21 to 25 have a fill aesthetic
geom_point2(aes(fill = Petal.Width), color = "white", shape = 21, size = 5) +
geom_line(data = vizdata, aes(y = Predicted, color = Petal.Width), size = 1) +
scale_color_viridis_d(direction = -1) +
scale_fill_viridis_c(guide = "none") +
theme_modern()
```
As the `Petal.Width` increases (becomes yellow), the coefficient between
`Petal.Length` and `Sepal.Length` increases (the slope is more steep). Although,
as we can guess, this in fact captures the underlying effect of species... **but
we'll leave discussing the meaningfulness of your models to you :)**
# `visualization_matrix()` also runs directly on model objects
To illustrate this, let's set up a **general additive mixed model (GAMM)**,
where we are going to specify a **smooth term** (a non-linear relationship;
specified by `s()` function) and some **random effects** structure.
```{r}
library(gamm4)
model <- gamm4::gamm4(
formula = Petal.Length ~ Petal.Width + s(Sepal.Length),
random = ~ (1 | Species),
data = iris
)
```
One can directly extract the visualization matrix for this model by entering the entire object into the function:
```{r}
visualisation_matrix(model, length = 3, include_random = FALSE)
```
We also skip the smooth term if we are interested only in the fixed effects:
```{r}
visualisation_matrix(model, length = 3, include_random = FALSE, include_smooth = FALSE)
```
We can also include random effects:
```{r}
visualisation_matrix(model, length = 5, include_random = TRUE)
```
# Controlled standardized change
Although the plot above is nice, and all, we would like the standardized changes in SD to be smoother (e.g., by increments of 1 SD). This can be achieved by first requesting the values that we want, and then **unstandardizing** it.
Let's use the same model as above, and then obtain a data grid with specific values for `Petal.Width`.
```{r message=FALSE, warning=FALSE}
vizdata <- lm(Sepal.Length ~ Petal.Length * Petal.Width, data = iris) %>%
visualisation_matrix(at = c("Petal.Length", "Petal.Width = seq(-3, 3)")) %>%
unstandardize(vizdata, select = "Petal.Width") %>%
estimate_relation(vizdata)
vizdata$Petal.Width <- effectsize::format_standardize(vizdata$Petal.Width, reference = iris$Petal.Width)
# 6. Plot
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point2(aes(fill = Petal.Width), shape = 21, size = 5) +
geom_line(data = vizdata, aes(y = Predicted, color = Petal.Width), size = 1) +
scale_color_viridis_d(direction = -1) +
scale_fill_viridis_c(guide = "none") +
theme_modern()
```