## RK Curve Panel Regression Analysis for StationSim

This notebook pairs with `stationsim_validation.ipynb` which i suggest you read first.

Given some data containing two groups of Ripley's K curves we wish to determine if the two groups are statistically distinct. We do this using backwards stepwise regression on a linear mixed model using R's `nlme` package.

The backwards regression works as follows:

1. We fit a fully saturated linear mixed model `mmod1` assuming there is indeed separation between the two groups. We model the Ripley's K value `y` for a given radius `x` for the number of groups $i = 1, 2$ and number of models per group $j = 1,...,n$ as follows.

$$ mmod1 = y_{ij} = (\alpha_0 + \beta_0 + \gamma_{0}) + (\alpha_1 + \beta_1 + \gamma_1)x_{ij} + (\beta_2 + \gamma_2)x^2_{ij} + \varepsilon_{ij} $$

Where $\alpha_i \sim N_3(0, \Sigma_{\alpha})$ are normal distributed random effects, $\beta_i$ are fixed effects, $\gamma_i$ are level effects (or factor effects etc.), and $\varepsilon_{ij} \sim N(0, \sigma_{\varepsilon}^2)$ are residual IID normal noise. The fixed effects and random effects terms here have shown repeatedly to provide the best results through trial and error. We could easily perform backwards regression on these as well. We are particularly interested in the values of the factor effects $\gamma_i$ as testing if all these parameters are statistically insignificant also tests whether there is indeed separation between the groups.

2. We then begin to remove the level terms $\gamma_i$ one at a time fitting a model each time until no level terms remain. This gives us three further models `mmod2` through `mmod4`.

$$ mmod2 = y_{ij} = (\alpha_0 + \beta_0 + \gamma_{0}) + (\alpha_1 + \beta_1 + \gamma_1)x_{ij} + (\beta_2)x^2_{ij} + \varepsilon_{ij} $$

$$ mmod3 = y_{ij} = (\alpha_0 + \beta_0 + \gamma_{0}) + (\alpha_1 + \beta_1)x_{ij} + (\beta_2)x^2_{ij} + \varepsilon_{ij} $$

$$ mmod4 = y_{ij} = (\alpha_0 + \beta_0) + (\alpha_1 + \beta_1)x_{ij} + (\beta_2)x^2_{ij} + \varepsilon_{ij} $$

3. We are now interested in testing whether `mmod4` with no level terms at all performs better than the remaining three models. If model 4 performs best we have no evidence to suggest that the control and test RK groups are distinct. We use R's anova function to calculate the Aikake's Information (AIC) values for each model. We test whether `mmod4` has the lowest AIC and is thus the best fitting model.

In [None]:
require(nlme)
require(ggplot2)
require(docstring)

In [None]:
library(nlme)
source("RK_population_modelling.R")