-   [Non-independent data](#chap:c13)
    -   [Introduction](#introduction)
    -   [Spatial dependence](#sec:regspa)
        -   [random blocks](#random-blocks)
        -   [Spatial regression](#spatial-regression)
    -   [Repeated measures of in-host Mouse malaria](#sec:c13rep)
    -   [*B. bronchiseptica* in rabbits](#sec:c13bb)

Non-independent data
====================

Introduction
------------

Many infectious disease experiments result in non-independent data because of spatial autocorrelation across fields (such as discussed in chapter \[chap:c12\]), repeated measures on experimental animals (such as the in-host *Plasmodium* data discussed in section \[sec:c6mal\]), or other sources of correlated experimental responses among experimental units (such as the possibility of correlated infection fates among the rabbit litter-mates discussed in section \[sec:c3cat\]). Statistical methods that assume independence of observations are not strictly valid and/or fully effective on such data . ‘Mixed-effects models’ and ‘Generalized linear mix-effects models’ (GLMMs) has been / is being developed to optimize the analysis of such data .

While this full topic is outside the main scope of this text, it is very pertinent to analyses of disease data, so we will consider the three cases studies.

In [None]:
require(nlme)
require(ncf)
require(lme4)
require(splines)

Spatial dependence
------------------

We use the rust example introduced in section \[sec:koslow\] (fig. \[fig:koslow\]) to illustrate two approaches to accounting for spatial dependence in disease data: (i) random blocks vs (ii) spatial regression. This experiment looked at severity of a foliar rust infection on three focal individuals of flat-top goldenrods in each of 120 plots across a field divided into 4 blocks. The experimental treatments were (i) watering or not and (ii) wether surrounding non-focal host plants were conspecifics only, a mixture of conspecifics and an alternative host (the Canadian goldenrod) or the alternative host only.

### random blocks

As in our spatial pattern analysis, we `jitter` the coordinates because some methods require unique coordinates for each data point.

In [None]:
data(gra)
gra$jx = jitter(gra$xloc)
gra$jy = jitter(gra$yloc)

We first use `lme` to fit two random effect models. The first considers individuals in blocks. The second considers plots nested in blocks.

In [None]:
fit = lme(score~comp+water, random = ~1 | block, 
     data =  gra, na.action = na.omit)
fit2 = lme(score~comp+water, random = ~1 | block / plot, 
     data = gra, na.action = na.omit)

We next do a likelihood-ratio test to check for the better fit. The likelihood ratio test (provided by `anova`) shows that the nested model provides the most parsominous fit.

In [None]:
options(width = 50)
anova(fit, fit2)

The `intervals`-call shows that the between-plot variance is about twice as large as the between-block variance, and watered plots have a significantly higher rust burden.

In [None]:
intervals(fit2)

### Spatial regression

The above randomized block mixed-effects models are the classic solution to analyzing experiments with spatial structure. An alternative is to formulate a regression model that considers the spatial dependence among observations as a function of separating distance. To investigate how proximate observations on different experimental treatments may be spatially autocorrelated, we can explore the spatial dependence among the *residuals* from a simple linear analysis of the data. We use the nonparametric spatial covariance function (as implemented in the `spline.correlogram`-function in the `ncf`-package) discussed in chapter \[chap:c12\]. We first fit the simple regression model that ignores space altogether.

In [None]:
fitlm = lm(score ~ comp + water, data = gra)

Next we calculate the spatial correlation function among the residuals of the fit (fig. \[fig:residcor\]).

In [None]:
fitc = spline.correlog(gra$x, gra$y, resid(fitlm))

The nonparametric spatial correlation function reveals strong spatial autocorrelation that decays to zero around 38 meters (with a CI of 31-43m).

In [None]:
plot(fitc, ylim = c(-0.5, 1))

To fit the spatial regression model we use the `gls`-function from the `nlme`-package . This function fits mixed models from data that have a single dependence group, one spatial map, one time series, etc; With multiple groups we use the `lme`-function (see section \[sec:c13rep\]). There are many possible models for spatial dependence. We compare the exponential model (which assumes the correlation to decay with distance according to $exp(-d/a)$ where $d$ is distance and $a$ is the scale) and the Gaussian model ($exp(-(d/a)^2)$. \[The `nugget`-flag means that the function is not anchored at one at distance zero\]. We compare these to the nonspatial model (`fitn`) and the best random block model (`fit2`) using AIC.

In [None]:
fite = gls(score~comp+water, corr = corSpatial(form =
      ~jx + jy, type = "exponential", nugget = TRUE),
     data = gra, na.action = na.omit)
fitg = gls(score~comp+water, corr = corSpatial(form = 
     ~jx + jy, type = "gaussian", nugget = TRUE), data = gra, 
     na.action = na.omit)
fitn = gls(score~comp+water,  data = gra, na.action = na.omit)
AIC(fite, fitg, fitn, fit2)

The AICs shows that the exponential model provides the best fit. Moreover, the spatial regression model provides a better fit than the nested random effect model. This is presumably because of the gradual decay in correlation with distance (fig. \[fig:residcor\]).

In [None]:
options(width = 50)
summary(fite, corr = FALSE)

The parametrically estimated range of 9.8m is a bit longer (but within the confidence interval) of the e-folding scale (5.5m) estimated by the spline correlogram; 1-nugget = 0.64 is comparable (but a little greater) than the 0.55 y-intercept. We can use the `Variogram`-function from the `nlme`-package to see if the spatial model adequately reflects the spatial dependence (fig. \[fig:variog\]). It looks like a plausible fit.

In [None]:
plot(Variogram(fite))

Repeated measures of in-host Mouse malaria
------------------------------------------

Repeated measurementss usually result in non-independent data because of the inherent serial dependence. Consider Huijben’s data on anemia of mice infected by five different strains of *Plasmodium chaubodii* introduced in section \[sec:c6mal\] with lots of measurements taken on days 3 through 21, 24, 26, 28, 31, 33 and 35. We will study the red blood cell counts (RBCs) of mice infected by one of 5 different clones as well as the control group. The sample sizes per treatment were 10 for AQ, BC, CB, and ER, 7 for AT and 5 for control. Eleven of the animals died. `SH9` has the data (in long format)[1]. For the analysis we strip some unnecessary columns 1, 3, 4, 7, 8 and 11 that are extraneous to focus on the RBC count:

In [None]:
data(SH9)
SH9RBC = SH9[, -c(1, 3, 4, 7, 8, 10, 11)]

For the repeated measures analyses we create a `groupedData`-object from the data frame using the `nmle`-package . The below call declares how the RBC counts represents time series for each mouse. Note that mice that died are scored by zero RBC count in the data set and that these zeros ends up dominating patterns, we therefore re-score these data as missing (`NA`), and plot the grouped data object to visualize the anemia by treatment (fig. \[fig:rbc\]).

In [None]:
RBC = groupedData(RBC ~ Day | Ind2, data = SH9RBC)
RBC$RBC[RBC$RBC == 0] = NA
plot(RBC, outer = ~Treatment, key = FALSE)

The main difference is between control and treatments, but the maximum anemia varies somewhat among strains. To test for significant differences we use `lme` to build a repeated measures model. In the simplest case we follow standard convention and model the time series using day as an ordered factor and assume the treatment effect to be additive. The `random= \sim 1 | Ind2`-call in the formula indicates that we assume there to be individual variation in the intercept (but not the slopes) among individuals. We then use the `ACF` function to look for evidence of serial dependence in the residuals from the fit. As is apparent from the acf plot there is temporal autocorrelation in the residuals out to at least 4 days (fig. \[fig:rbcacf\]).

In [None]:
mle.rbc = lme(RBC~Treatment+ordered(Day), random =
   ~1|Ind2, data = RBC, na.action = na.omit, method = "ML")
plot(ACF(mle.rbc))

There are many models for serial dependence. We use a first order autoregressive process (AR1). This is specified by the `correlation=corAR1(form= \sim Day|Ind2)` function call. Note that this is one of a variety of time series models available in the `nlme`-package, the most general of which is the ARMA(p, q) model discussed in section \[sec:arma\].

In [None]:
options(width=58)
mle.rbc2 = lme(RBC~Treatment+ordered(Day), random=
     ~1|Ind2, data = RBC, correlation = corAR1(form=~
     Day|Ind2), na.action = na.omit, method = "ML")
mle.rbc2

The Phi1 parameter of 0.7088 represents the estimated day to day correlation, which is substantial. We can plot the predicted and observed correlation. The AR1-model seems to be a nice fit (fig. \[fig:rbcacf2\]).

In [None]:
tmp = ACF(mle.rbc2)
plot(ACF ~ lag, data = tmp)
lines(0:15, 0.7088^(0:15))

Moreover, a formal likelihood-ratio test provided by the `anova` function reveals that the correlated error model provides a significantly better fit to the data:

In [None]:
options(width = 50)
anova(mle.rbc, mle.rbc2)

Statistically, the time-by-treatment interaction model, rather than the additive model, is better still:

In [None]:
options(width=50)
mle.rbc3=lme(RBC~Treatment*ordered(Day), random= 
     ~1|Ind2, data=RBC, correlation=corAR1(form=
     ~Day|Ind2), na.action=na.omit, method="ML")
anova(mle.rbc2, mle.rbc3)

Finally we can plot the predicted values against time (filtering out predictions for the missing values in the original data)(fig. \[fig:rbc2\]). There is a distinct ordering in the virulence of the strains:

In [None]:
pr=predict(mle.rbc3)
RBC$pr=NA
RBC$pr[!is.na(RBC$RBC)]=pr
plot(RBC$pr~RBC$Day, col=as.numeric(RBC$Treatment), 
     pch=as.numeric(RBC$Treatment),xlab="Day", 
     ylab="RBC count")
legend("bottomright",       
     legend=c("AQ", "AT", "BC", "CB", "Control", "ER"),
     pch=unique(as.numeric(RBC$Treatment)), col=1:6)

Modeling time as an ordered factor is quite parameter wasteful (the full interaction model has 153 parameters). A flexible yet more economic approach may be to model time using smoothing splines. The following example uses B-splines with 5 degrees-of-freedom (fig. \[fig:rbc3\]). The qualitative features are similar to the more parameter rich model (fig. \[fig:rbc2\])

In [None]:
require(splines)
mle.rbc4=lme(RBC~Treatment*bs(Day, df=5), random=
   ~1|Ind2, data=RBC, correlation=corAR1(form=
   ~Day|Ind2), na.action=na.omit, method="ML")
pr=predict(mle.rbc4)
RBC$pr=NA
RBC$pr[!is.na(RBC$RBC)]=pr
plot(RBC$pr~RBC$Day, col=as.numeric(RBC$Treatment), 
   pch=as.numeric(RBC$Treatment),  xlab="Day", 
   ylab="RBC count")
legend("bottomright",       
legend=c("AQ", "AT", "BC", "CB", "Control", "ER"), 
   pch=unique(as.numeric(RBC$Treatment)), col=1:6)

*B. bronchiseptica* in rabbits
------------------------------

*Bordetella bronchiseptica* is a respiratory infection of a range of mammals . Its congeners, *B. pertussis* and *B. parapertussis* causes whooping cough in humans, but *B. bronchiseptica* is usually relatively asymptomatic (though it can cause snuffles in rabbits and kennel cough in dogs). The data comes from a commercial rabbitry which breeds NZW rabbits to study transmission paths in the colony. The data is from the same study as we used to study the age-specific force of infection in section \[sec:c3cat\]. Nasal swabs of female rabbits and their young were taken at weaning ($\sim$ 4 weeks old). A total of 86 does and 408 kits were included in the study .

In [None]:
data(litter)

To investigate if (a) offspring of infected mothers have an increased instantaneous risk of becoming infected and (b) if offspring of the same litter tended to have the same infection fate because of within-litter transmission, we use a random effect (generalized linear mixed model, GLMM) logistic regression, with litter as a random effect. We first do some data formating.

In [None]:
tdat=data.frame(lsize=as.vector(table(litter$Litter)), 
  Litter=names(table(litter$Litter)), 
  anysick=sapply(split(litter$sick,litter$Litter),sum))
ldat=merge(litter, tdat, by="Litter")
ldat$othersick=ldat$anysick-ldat$sick
ldat$anyothersick=ldat$othersick>0
ldat$X=1:408

Here, the concern is with whether litter-mates share correlated fates. Unlike for spatial or temporal autocorrelation, there are no canned functions to quantify this correlation. However, following our discussion of autocorrelation in section \[sec:c12spa\], it is easy to customize our own calculations. In the below, the first double-loop makes a sibling-sibling ‘contact-matrix’, `tmp`, that flags kits according to litter membership. After, `tmp2` rescales the binary `sick` vector that flags whether or not an animal was infected, and `tmp3` generates the correlation matrix. Finally `mean(tmp3*tmp)` provides the within-litter autocorrelation in infection status averaged across all litters.

In [None]:
tmp = matrix(NA, ncol = length(ldat$Litter), 
     nrow = length(ldat$Litter))
for(i in 1:length(ldat$Litter)){
     for(j in 1:length(ldat$Litter)){
        if(ldat$Litter[i]==ldat$Litter[j]){
          tmp[i,j] = 1
        }
     }
}
diag(tmp) = NA
tmp2 = scale(ldat$sick)[,1]
tmp3 = outer(tmp2, tmp2, "*")
mean(tmp3*tmp, na.rm = TRUE)

The within-litter correlation of 0.53 represents a substantial interdependence among litter mates. Since the response variable is binary (infected vs non-infected) we cannot use `lme`. Instead we use the `lmer`-function from the `lme4`-package and specify that the response is binomial using the argument. Using AICs we contrast the fit with within-litter correlation (`fitL`) with the fit that assumes independence (`fit0`); The appropriate independence fit is generated by declaring that each of the 408 individuals are in their own group (variable $X$ in the data set).

In [None]:
require(lme4)
fitL=glmer(sick~msick+lsize+Facility+anyothersick+
     (1|Litter), family=binomial(), data=ldat)
fit0=glmer(sick~msick+lsize+Facility+anyothersick+
     (1|X), family=binomial(), data=ldat)
AIC(fitL, fit0)

The litter-dependent model is clearly best (no surprise given the strong empirical intra-litter correlation). The summary of the best model reveals that the key predictor of infection fate is whether or not a sibling was infected (`anyothersickTRUE`). The infection status of the mother was insignificant. The mixed-effect logistic regression thus reveals that the most important route of infection is likely to be sib-to-sib transmission .

In [None]:
options(width = 50)
summary(fitL, corr = FALSE)

[1] With repeated measures data we often use both `long`-format with one line for each observation and `wide`-format with one line for each experimental unit