-
Notifications
You must be signed in to change notification settings - Fork 0
/
modeling.Rmd
84 lines (57 loc) · 3.78 KB
/
modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "Training models"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Training models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, message=FALSE}
library(LSTbook)
```
To "train a model" involves three components:
1. A data frame with training data
2. A model specification naming the response variable and the explanatory variables. This is formatted in the same tilde-expression manner as for `lm()` and `glm()`.
3. A model-fitting function that puts (1) and (2) together into a **model object**. Examples of model-fitting functions are `lm()` and `glm()`. In *Lessons in Statistical Thinking* and the corresponding `{LST}` package, we almost always use `model_train()`
Once the model object has been constructed, you can plot the model, create summaries such as regression reports or ANOVA reports, and evaluate the model for new inputs, etc.
## Using `model_train()`
`model_train()` is a wrapper around some commonly used model-fitting functions from the `{stats}` package, particularly `lm()` and `glm()`. It's worth explaining motivation for introducing a new model-fitting function.
1. `model_train()` is pipeline ready. Example: `Galton |> model_train(height ~ mother)`
2. `model_train()` has internal logic to figure out automatically which type of model (e.g. linear, binomial, poisson) to fit. (You can also specify this with the `family=` argument.) The automatic nature of `model_train()` means, e.g., you can use it with neophyte students for logistic regression without having to introduce a new function.
3. `model_train()` saves a copy of the training data as an attribute of the model object being produced. This is helpful in plotting the model, cross-validation, etc., particularly when the model specification involves nonlinear explanatory terms (e.g., `splines::ns(mother, 3)`)
## Using a model object
As examples, consider these two models:
- modeling `height` of a (fully grown) child with the `sex` of the child, and the `mother`'s and `father`'s height. Linear regression is an appropriate technique here.
```{r}
height_model <- mosaicData::Galton |> model_train(height ~ sex + mother + father)
```
- modeling the probability that a voter will vote in an election (`primary2006`) given the household size (`hhsize`), `yearofbirth` and whether the voter voted in a previous primary election (`primary2004`). Since having voted is a yes or no proposition, *logistic* regression is an appropriate technique.
```{r}
vote_model <-
Go_vote |>
model_train(zero_one(primary2006, one = "voted") ~ yearofbirth * primary2004 * hhsize * yearofbirth )
```
Note that the `zero_one()` marks the response variable as a candidate for logistic regression.
The output of `model_train()` is in the format of whichever `{stats}` package function has been used, e.g. `lm()` or `glm()`. (The training data is stored as an "attribute," meaning that it is invisible.) Consequently, you can use the model object as an input to whatever model-plotting or summarizing function you like.
In *Lessons in Statistical Thinking* we use `{LST}` functions for plotting and summarizing:
- `model_plot()`
- `R2()`
- `conf_interval()`
- Late in *Lessons*, `regression_summary()` and `anova_summary()`
Let's apply some of these to the modeling examples introduced above.
```{r}
height_model |> model_plot()
height_model |> conf_interval()
vote_model |> model_plot()
vote_model |> R2()
```
The `model_eval()` function from this package allows you to provide inputs and receive the model output, with a prediction interval by default. (For logistic regression, only a confidence interval is available.)
```{r}
vote_model |> model_eval(yearofbirth=c(1960, 1980), primary2004="voted", hhsize=4)
```