-
Notifications
You must be signed in to change notification settings - Fork 0
/
model_logtransform_and_backtransform.Rmd
169 lines (110 loc) · 5.13 KB
/
model_logtransform_and_backtransform.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "model_when to logtransform and how to backtransforming log transformed estimates"
author: "Anisha Nagpal"
date: "`r Sys.Date()`"
output: html_document
---
When to use this template: You may have log-transformed your outcome variable prior to analysis by regression in order to approximate your residuals to a Normal distribution, or you may have analysed count data using a Poisson generalized linear model (GLM) with a log link function. In both types of situations, the estimates you obtain from your analysis, and their standard errors, are on the transformed scale. However, when reporting your results, you may want to transform them back to the original scale to aid interpretation.
```{r}
library(lme4)
library(ggplot2)
library(tidyverse)
library(dfoptim)
library(optimx)
```
## Simulating some fake data
- In this fake dataset we have id, a daily outcome variable (outcome), cycleday (menses = 0), and a trait level depression score (dep). We will use this to write a MLM model where cycleday, and dep are predicting a daily outcome.
```{r}
# Create the id variable
id <- rep(seq(1, 50), each = 18)
# Create the cycleday variable
cycleday <- rep(-7:10, times = 50)
#create dep variable
dep <- rep(rnorm(50, 3, 1.25), each = 18)
outcome <- exp(rnorm(900, 1, 0.3))
# Combine id and cycleday into a dataframe
df <- data.frame(id, cycleday, dep, outcome)
```
```{r}
hist(df$outcome)
```
As you can see, this data is slightly right-skewed .
### Create MLM
```{r}
model <-lmer(outcome ~ dep + (1 |id) + (1|cycleday), data = df, REML = T, control = lmerControl(optimizer ="bobyqa", optCtrl=list(maxfun=2e5)))
summary(model)
```
There is singularity because I simulated the data and there is alevel of collinearity- just ignore.
## Test Assumptions of MLM
The assumptions underlying MLMs are:
- The model is correctly specified (i.e., all the predictors associated with the outcome and relevant random effects are included);
- The functional form is correct (e.g., the relationship between the predictors and outcome is linear if using a linear model);
- Level-1 residuals are independent and normally distributed;
- Level-2 residuals are independent and multivariate normally distributed;
- Residuals at level-1 and level-2 are unrelated;
- Predictors at one level are not related to errors at another level (homoscedasticity).
### Are the level 1 residuals normally distributed?
```{r}
df$resid <- residuals(model)
hist(df$resid)
```
Residuals are also right-skewed.
We can also examine normality of residuals in qqplot. If the line is roughly straight, our distribution is roughly normal. This plots quantiles for two distributions against each other: what proportion of points in our distribution is in a given quantile (y-axis), and how does that compare to the proportion of points in a given quantile for our theoretical normal distribution (x-axis). The line is NOT straight.
```{r}
df %>%
ggplot(mapping = aes(sample = resid)) +
stat_qq()
```
### Testing if level 1 residuals are independent
```{r}
df %>%
ggplot(mapping = aes(x = dep, y = resid)) +
geom_point() +
labs(x = "depression", y = "residuals")
cor.test(df$resid, df$dep)
```
There is no real pattern, so it seems the level 1 residuals are independent.
### Log-transforming the outcome
Lets log-transform the outcome and see if it improves the normality of results.You may need to +1 before log-tranforming, because you cannot take the log of zero.
```{r}
df$outcome.log <- (log(df$outcome))
hist(df$outcome.log)
```
```{r}
model2 <- lmer(outcome.log ~ dep + cycleday + (1 + cycleday|id), data = df, REML = T, control = lmerControl(optimizer ="bobyqa", optCtrl=list(maxfun=2e5)))
summary(model2)
```
Singularity still exists because I did a bad job simulating data, but let's ignore that. Let's check out the normality of our residuals.
```{r}
df$resid2 <- residuals(model2, na.rm = T)
hist(df$resid2)
```
```{r}
df %>%
ggplot(mapping = aes(sample = resid2)) +
stat_qq()
```
Our qqplot is more line-like, log-tranforming the data allowed our residuals to be normally distributed!
## Back-transforming
```{r}
summary(model2)
```
So what if we want to back-transform our estimates? We cannot simply exponentiate standard errors.
### Back-transforming standrard errors
$$exp(\mu)*se = \text{back-transformed standard error}$$
$$exp(\mu) = \text{back-transformed mean}$$
However, back-transformed confidence intervals are more stable than backtransformed standard errors.
### Calculating Back-transformed Confidence Intervals
- We can extract our mean depression estimate and its standard error from our model to calculate a precise confidence interval. Then we exponetiate the boundaries to calculate the back-transformed confidence interval.
```{r}
m <- summary(model2)$coef[2] #mean
sderr <- summary(model2)$coef[5] #standard error
lb <- m - qnorm(.975)*sderr #lower bound of confidence interval
ub <- m + qnorm(.975)*sderr #upper bound of confidence interval
```
Exponentiate the estimate, and the upper and lower bounds of the confidence interval
```{r}
exp(m) #depression estimate
exp(lb) #lower bound of confidence interval
exp(ub) #upper bound of confidence interval
```