# Complexity in fitting Linear Mixed Models

Linear mixed-effects models are increasingly used for the analysis of data from experiments in fields like psychology where several subjects are each exposed to each of several different items.
In addition to a response, which here will be assumed to be on a continuous scale, such as a _response time_, a number of experimental conditions are systematically varied during the experiment.
In the language of statistical experimental design the latter variables are called _experimental factors_ whereas factors like `Subject` and `Item` are _blocking factors_.
That is, these are known sources of variation that usually are not of interest by themselves but still should be accounted for when looking for systematic variation in the response.

## An example data set

The data from experiment 2 in [_Kronmueller and Barr (2007)_](https://www.sciencedirect.com/science/article/pii/S0749596X06000581) consist of the response times (ms.) for target selection by 56 participants (undergraduate students) on 32 items (a block of displays) under combinations of 3 two-level experimental factors; `Speaker` (old/new), `Precedent` (break/maintain), and `Load` (Yes/No).

These data are available as the `"kb07"` dataframe in the [_RePsychLing_ package](https://github.com/dmbates/RePsychLing/) for [__R__](http://www.R-project.org).

A copy is also available in the `test/dat.rda` file in the [_MixedModels_ package](https://github.com/dmbates/MixedModels.jl) repository.
This file can be loaded into a [__Julia__](https://julialang.org) session using the [_RData_ package](https://github.com/JuliaData/RData.jl).

In [1]:
using BenchmarkTools, DataFrames, Distributions, FreqTables
using MixedModels, RData, Statistics, StatsBase, Tables

In [2]:
kb07 = load(joinpath(dirname(pathof(MixedModels)), "..", "test", "dat.rda"))["kb07"]

Unnamed: 0_level_0,G,H,Y,S,T,U,V,W,X,Z
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,30,1,2267.0,1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0
2,30,2,3856.0,-1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0
3,30,3,1567.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0
4,30,4,1732.0,1.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0
5,30,5,2660.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0
6,30,6,2763.0,-1.0,1.0,1.0,-1.0,-1.0,1.0,-1.0
7,30,7,3528.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,1.0
8,30,8,1741.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,30,9,3692.0,1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0
10,30,10,1949.0,-1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0


In this data frame the names of the columns have been shortened to single characters as
- `Y`: response time (ms.)
- `G`: subject
- `H`: item
- `S`: speaker
- `T`: precedent
- `U`: load

The three experimental factors have been converted to $\pm1$ coding and their two- and three-factor interactions have been created explicitly.
For example, `V` is the `S * T` interaction.

In [3]:
describe(kb07)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,G,,30.0,,103.0,56.0,,CategoricalString{UInt8}
2,H,,1.0,,32.0,32.0,,CategoricalString{UInt8}
3,Y,2180.49,336.0,1936.5,5143.65,,,Float64
4,S,0.0,-1.0,0.0,1.0,,,Float64
5,T,0.0,-1.0,0.0,1.0,,,Float64
6,U,0.00111732,-1.0,1.0,1.0,,,Float64
7,V,0.00111732,-1.0,1.0,1.0,,,Float64
8,W,0.0,-1.0,0.0,1.0,,,Float64
9,X,0.0,-1.0,0.0,1.0,,,Float64
10,Z,-0.00111732,-1.0,-1.0,1.0,,,Float64


In [4]:
all(kb07.V .== kb07.S .* kb07.T)

true

(Note that columns of data frames are extracted as "properties" using the dot extractor in Julia, like `kb07.V`.
The dot is also used for broadcasting operations over vectors so the operator for multiplying two columns component-wise is "`.*`" and an element-by-element equality comparison is written "`.==`".)

The `G` and `H` factors are crossed but not completely balanced.

In [5]:
GHtbl = freqtable(kb07, :G, :H)

56×32 Named Array{Int64,2}
G ╲ H │  1   2   3   4   5   6   7   8  …  25  26  27  28  29  30  31  32
──────┼──────────────────────────────────────────────────────────────────
30    │  1   1   1   1   1   1   1   1  …   1   1   1   1   1   1   1   1
31    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
34    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
35    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
36    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
37    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
38    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
39    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
41    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
42    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
43    │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
44    │  1 

In [6]:
countmap(vec(GHtbl))

Dict{Int64,Int64} with 2 entries:
  0 => 2
  1 => 1790

All of the experimental factors vary within subject and within item.

## Formulating a simple model

A simple model with main-effects for each of the experimental factors and with random effects for subject and for item is described by the formula `Y ~ 1 + S + T + U + (1|G) + (1|H)`.
In the _MixedModels_ package, which uses the formula specifications from the [_StatsModels_ package](https://github.com/JuliaStats/StatsModels.jl), a formula must be wrapped in a call to the `@formula` macro.
The model is created as an instance of a `LinearMixedModel` type then fit with a call to the `fit!` generic.

(By convention, the names in Julia of _mutating functions_, which modify the value of one or more of their arguments, end in `!` as a warning to the user that arguments, usually just the first argument, can be overwritten with new values.)

In [7]:
m1 = fit!(LinearMixedModel(@formula(Y ~ 1 + S + T + U + (1|G) + (1|H)), kb07))

Linear mixed model fit by maximum likelihood
 Y ~ 1 + S + T + U + (1 | G) + (1 | H)
     logLik        -2 logLik          AIC             BIC       
 -1.44157021×10⁴  2.88314042×10⁴  2.88454042×10⁴   2.8883834×10⁴

Variance components:
            Variance Std.Dev. 
 G         101978.30 319.3404
 H         131612.90 362.7849
 Residual  518767.52 720.2552
 Number of obs: 1790; levels of grouping factors: 56, 32

  Fixed-effects parameters:
             Estimate Std.Error  z value P(>|z|)
(Intercept)   2180.72   78.8909  27.6422  <1e-99
S            -67.1968   17.0244 -3.94709   <1e-4
T            -333.674   17.0244 -19.5998  <1e-84
U             78.8989   17.0244  4.63447   <1e-5


The first fit of such a model can take several seconds because the Just-In-Time (JIT) compiler must analyze and compile a considerable amount of code.  (All of the code in the _MixedModels_ package is Julia code.)
Subsequent fits of this or similar models are much faster.

In [8]:
const f1 = @formula(Y ~ 1 + S + T + U + (1|G) + (1|H));
@time fit!(LinearMixedModel(f1, kb07));

  0.304641 seconds (496.90 k allocations: 27.518 MiB, 2.83% gc time)


When timing function calls that take less than a few seconds, it is more stable to use the `@benchmark` macro from the [_BenchmarkTools_ package](https://github.com/JuliaCI/BenchmarkTools.jl).

In [9]:
@benchmark fit!(LinearMixedModel($f1, $kb07))

BenchmarkTools.Trial: 
  memory estimate:  701.09 KiB
  allocs estimate:  2414
  --------------
  minimum time:     1.777 ms (0.00% GC)
  median time:      1.816 ms (0.00% GC)
  mean time:        2.108 ms (6.01% GC)
  maximum time:     84.739 ms (97.33% GC)
  --------------
  samples:          2368
  evals/sample:     1

(The use of `$f1` and `$kb07` is an interpolation syntax that has the effect in this macro call of timing only the function execution and not the argument name lookup.)

### Model construction versus model optimization

The `m1` object is created in the call to the constructor function, `LinearMixedModel`, then the parameters are optimized or fit in the call to `fit!`.
Usually the process of fitting a model will take longer than creating the numerical representation but for simple models the creation time can be a significant portion of the overall running time.

In [10]:
@benchmark LinearMixedModel($f1, $kb07)

BenchmarkTools.Trial: 
  memory estimate:  667.63 KiB
  allocs estimate:  1117
  --------------
  minimum time:     652.910 μs (0.00% GC)
  median time:      666.951 μs (0.00% GC)
  mean time:        738.542 μs (8.46% GC)
  maximum time:     3.715 ms (70.62% GC)
  --------------
  samples:          6737
  evals/sample:     1

In [11]:
@benchmark fit!($m1)

BenchmarkTools.Trial: 
  memory estimate:  33.38 KiB
  allocs estimate:  1296
  --------------
  minimum time:     1.024 ms (0.00% GC)
  median time:      1.045 ms (0.00% GC)
  mean time:        1.131 ms (0.67% GC)
  maximum time:     17.637 ms (0.00% GC)
  --------------
  samples:          4410
  evals/sample:     1

In [12]:
@benchmark refit!($m1)

BenchmarkTools.Trial: 
  memory estimate:  33.58 KiB
  allocs estimate:  1302
  --------------
  minimum time:     1.052 ms (0.00% GC)
  median time:      1.075 ms (0.00% GC)
  mean time:        1.169 ms (0.63% GC)
  maximum time:     19.542 ms (0.00% GC)
  --------------
  samples:          4268
  evals/sample:     1

(The `refit!` method allows for specifying a new response vector and reinitializing some of the structure.
It is useful for simulations.)

### Factors affecting the time to optimize the parameters

The optimization process is summarized in the `optsum` property of the model.

In [13]:
m1.optsum

Initial parameter vector: [1.0, 1.0]
Initial objective value:  28889.20544069451

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10]
initial_step:             [0.75, 0.75]
maxfeval:                 -1

Function evaluations:     21
Final parameter vector:   [0.443371, 0.503689]
Final objective value:    28831.404175679785
Return code:              FTOL_REACHED


For this model there are two parameters to be optimized because the objective function, negative twice the log-likelihood, can be _profiled_ with respect to all the other parameters.
(See section 3 of [_Bates et al. 2015_](https://www.jstatsoft.org/article/view/v067i01) for details.)
Both these parameters must be non-negative (they have a lower bound of zero) and both have an initial value of one.
After 21 function evaluations an optimum is declared according to the function value tolerance, either $10^{-8}$ in absolute terms or $10^{-12}$ relative to the current value.

The optimization itself has a certain amount of setup and summary time but the majority of the time is spent in the evaluation of the objective - the profiled log-likelihood.

Each function evaluation is of the form

In [14]:
const θ = m1.θ

2-element Array{Float64,1}:
 0.443371205404996 
 0.5036894347196534

In [15]:
@benchmark objective(updateL!(setθ!($m1, $θ)))

BenchmarkTools.Trial: 
  memory estimate:  1.42 KiB
  allocs estimate:  57
  --------------
  minimum time:     41.958 μs (0.00% GC)
  median time:      45.220 μs (0.00% GC)
  mean time:        48.382 μs (0.00% GC)
  maximum time:     9.708 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So a bit under 50 microseconds for each of 21 evaluations gives the total function evaluation time of about 1 ms., which is practically all of the time to fit the model.

The majority of the time for the function evaluation for this model is in the call to `updateL!`

In [16]:
@benchmark updateL!($m1)

BenchmarkTools.Trial: 
  memory estimate:  320 bytes
  allocs estimate:  17
  --------------
  minimum time:     31.557 μs (0.00% GC)
  median time:      33.958 μs (0.00% GC)
  mean time:        38.019 μs (0.00% GC)
  maximum time:     9.700 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

This is an operation that updates the lower Cholesky factor (often written as `L`) of a blocked sparse matrix.

There are 4 rows and columns of blocks.
The first row and column correspond to the random effects for subject, the second to the random effects for item, the third to the fixed-effects parameters and the fourth to the response.
Their sizes and types are

In [17]:
describeblocks(m1)

1,1: LinearAlgebra.Diagonal{Float64,Array{Float64,1}} (56, 56) LinearAlgebra.Diagonal{Float64,Array{Float64,1}}
2,1: Array{Float64,2} (32, 56) Array{Float64,2}
2,2: LinearAlgebra.Diagonal{Float64,Array{Float64,1}} (32, 32) Array{Float64,2}
3,1: Array{Float64,2} (4, 56) Array{Float64,2}
3,2: Array{Float64,2} (4, 32) Array{Float64,2}
3,3: Array{Float64,2} (4, 4) Array{Float64,2}
4,1: Array{Float64,2} (1, 56) Array{Float64,2}
4,2: Array{Float64,2} (1, 32) Array{Float64,2}
4,3: Array{Float64,2} (1, 4) Array{Float64,2}
4,4: Array{Float64,2} (1, 1) Array{Float64,2}


There are two lower-triangular blocked matrices: `A` with fixed entries determined by the model and data, and `L` which is updated for each evaluation of the objective function.
The type of the `A` block is given before the size and the type of the `L` block is after the size.
For scalar random effects, generated by a random-effects term like `(1|G)`, the (1,1) block is always diagonal for both `A` and `L`.
Its size is the number of levels of the grouping factor, `G`.

Because subject and item are crossed, the (2,1) block of `A` is dense, as is the (2,1) block of `L`.
The (2,2) block of `A` is diagonal because, like the (1,1) block, it is generated from a scalar random effects term.
However, the (2,2) block of `L` ends up being dense as a result of "fill-in" in the sparse Cholesky factorization.
All the blocks associated with the fixed-effects or the response are stored as dense matrices but their dimensions are (relatively) small.

## Increasing the complexity

In general, adding more terms to a model will increase the time required to fit the model.
However, there is a big difference between adding fixed-effects terms and adding complexity to the random effects.

Adding the two- and three-factor interactions to the fixed-effects terms increases the time required to fit the model.

In [18]:
const f2 = @formula(Y ~ 1 + S + T + U + V + W + X + Z + (1|G) + (1|H));

In [19]:
const m2 = fit!(LinearMixedModel(f2, kb07))

Linear mixed model fit by maximum likelihood
 Y ~ 1 + S + T + U + V + W + X + Z + (1 | G) + (1 | H)
     logLik        -2 logLik          AIC             BIC       
 -1.44132195×10⁴  2.88264391×10⁴  2.88484391×10⁴  2.89088288×10⁴

Variance components:
            Variance   Std.Dev. 
 G         101993.015 319.36345
 H         131638.811 362.82063
 Residual  517261.888 719.20921
 Number of obs: 1790; levels of grouping factors: 56, 32

  Fixed-effects parameters:
             Estimate Std.Error  z value P(>|z|)
(Intercept)   2180.66   78.8924   27.641  <1e-99
S            -67.1689   16.9997 -3.95119   <1e-4
T            -333.702   16.9997 -19.6299  <1e-85
U             78.9525   16.9997  4.64436   <1e-5
V             22.1173   16.9997  1.30104  0.1932
W            -18.7454   16.9997 -1.10269  0.2702
X             5.08292   16.9997 0.299001  0.7649
Z            -23.9165   16.9997 -1.40688  0.1595


(Notice that none of the interactions are statistically significant.)

In [20]:
@benchmark fit!($m2)

BenchmarkTools.Trial: 
  memory estimate:  60.94 KiB
  allocs estimate:  2376
  --------------
  minimum time:     2.119 ms (0.00% GC)
  median time:      2.161 ms (0.00% GC)
  mean time:        2.309 ms (0.50% GC)
  maximum time:     17.728 ms (0.00% GC)
  --------------
  samples:          2163
  evals/sample:     1

But, in this case, it is because the number of function evaluations to determine the optimum increases.

In [21]:
m2.optsum

Initial parameter vector: [1.0, 1.0]
Initial objective value:  28884.061399826605

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10]
initial_step:             [0.75, 0.75]
maxfeval:                 -1

Function evaluations:     39
Final parameter vector:   [0.444048, 0.504472]
Final objective value:    28826.439072750934
Return code:              FTOL_REACHED


The time for each objective function evaluation has not increased substantially.

In [22]:
const θ2 = m2.θ;

In [23]:
@benchmark objective(updateL!(setθ!(m2, θ2)))

BenchmarkTools.Trial: 
  memory estimate:  1.42 KiB
  allocs estimate:  57
  --------------
  minimum time:     47.349 μs (0.00% GC)
  median time:      51.091 μs (0.00% GC)
  mean time:        55.579 μs (0.00% GC)
  maximum time:     9.753 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

### Increasing complexity of the random effects

Another way in which the model can be extended is to switch to vector-valued random effects.
Sometimes this is described as having _random slopes_, so that a subject not only brings their own shift in the typical response but also their own shift in the change due to, say, `Load` versus `No Load`.
Instead of just one, scalar, change associated with each subject there is an entire vector of changes in the coefficients.

A model with a random slopes for each of the experimental factors for both subject and item is specified as

In [24]:
const f3 = @formula(Y ~ 1 + S+T+U+V+W+X+Z + (1+S+T+U|G) + (1+S+T+U|H));

In [25]:
const m3 = fit!(LinearMixedModel(f3, kb07))

Linear mixed model fit by maximum likelihood
 Y ~ 1 + S + T + U + V + W + X + Z + (1 + S + T + U | G) + (1 + S + T + U | H)
     logLik        -2 logLik          AIC             BIC       
 -1.43221551×10⁴  2.86443102×10⁴  2.87023102×10⁴  2.88615194×10⁴

Variance components:
             Variance   Std.Dev.    Corr.
 G          91237.7942 302.055945
             1943.2438  44.082239 -0.77
             3772.6955  61.422272 -0.58 -0.04
             4190.1286  64.731203  0.35 -0.78  0.54
 H         130618.3194 361.411565
             1728.1976  41.571596 -0.43
            60908.7918 246.797066 -0.68 -0.37
             1892.7740  43.506022  0.31 -0.19 -0.16
 Residual  444339.3955 666.587875
 Number of obs: 1790; levels of grouping factors: 56, 32

  Fixed-effects parameters:
             Estimate Std.Error  z value P(>|z|)
(Intercept)   2180.51   77.1967  28.2462  <1e-99
S            -67.0753   18.3566 -3.65402  0.0003
T            -333.796   47.1065 -7.08598  <1e-11
U             79.0991 

There are several interesting aspects of this model fit.

First, the number of parameters optimized directly has increased substantially.
What was previously a 2-dimensional optimization has now become 20 dimensional.

In [26]:
m3.optsum

Initial parameter vector: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0]
Initial objective value:  29309.04765501624

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, -Inf, -Inf, -Inf, 0.0, -Inf, -Inf, 0.0, -Inf, 0.0, 0.0, -Inf, -Inf, -Inf, 0.0, -Inf, -Inf, 0.0, -Inf, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10]
initial_step:             [0.75, 1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.75, 1.0, 0.75, 0.75, 1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.75, 1.0, 0.75]
maxfeval:                 -1

Function evaluations:     652
Final parameter vector:   [0.453137, -0.0511007, -0.0536417, 0.0340426, 0.0419768, -0.0708832, -0.0785956, 0.0242634, 0.0457586, 0.0, 0.542181, -0.0265869

and the number of function evaluations to convergence has gone from under 40 to over 650.

The time required for each function evaluation has also increased considerably,

In [27]:
const θ3 = m3.θ;

In [28]:
@benchmark objective(updateL!(setθ!($m3, $θ3)))

BenchmarkTools.Trial: 
  memory estimate:  29.98 KiB
  allocs estimate:  559
  --------------
  minimum time:     585.824 μs (0.00% GC)
  median time:      595.414 μs (0.00% GC)
  mean time:        626.143 μs (0.54% GC)
  maximum time:     20.788 ms (0.00% GC)
  --------------
  samples:          7948
  evals/sample:     1

resulting in much longer times for model fitting.

In [29]:
@benchmark fit!($m3)

BenchmarkTools.Trial: 
  memory estimate:  19.19 MiB
  allocs estimate:  367103
  --------------
  minimum time:     420.348 ms (0.58% GC)
  median time:      425.642 ms (0.57% GC)
  mean time:        428.377 ms (0.47% GC)
  maximum time:     465.773 ms (0.52% GC)
  --------------
  samples:          12
  evals/sample:     1

Notice that the estimates of the fixed-effects coefficients and their standard errors are not changed very much except for the standard error of `T` (_Precedent_), which is also the largest coefficient.
(Changing from _Break_ to _Maintain_ decreases typical response time by about 667 ms.  The "effect" is twice the coefficient because of the $\pm1$ coding.  See Table 3 in [_Kronmuller and Barr (2007)_](https://www.sciencedirect.com/science/article/pii/S0749596X06000581) or evaluate)

In [30]:
by(kb07, :T, :Y => mean)

Unnamed: 0_level_0,T,Y_Statistics.mean
Unnamed: 0_level_1,Float64,Float64
1,-1.0,2514.18
2,1.0,1846.8


Furthermore the estimates of the standard deviations of the "slope" random effects are much smaller than the those of the intercept random effects except for the `T` coefficient random effect for `H` (_Item_), which suggests that the model could be reduced to `Y ~ 1 + S+T+U+V+W+X+Z + (1|G) + (1+T|H)` or even `Y ~ 1 + S+T+U + (1|G) + (1+T|H)`.

In [31]:
const f4 = @formula(Y ~ 1 + S+T+U + (1|G) + (1+T|H));

In [32]:
const m4 = fit!(LinearMixedModel(f4, kb07))

Linear mixed model fit by maximum likelihood
 Y ~ 1 + S + T + U + (1 | G) + (1 + T | H)
     logLik        -2 logLik          AIC             BIC       
 -1.43356888×10⁴  2.86713775×10⁴  2.86893775×10⁴  2.87387873×10⁴

Variance components:
            Variance  Std.Dev.   Corr.
 H         132370.10 363.82702
            63784.32 252.55558 -0.69
 G          89087.40 298.47512
 Residual  460058.54 678.27615
 Number of obs: 1790; levels of grouping factors: 32, 56

  Fixed-effects parameters:
             Estimate Std.Error  z value P(>|z|)
(Intercept)   2180.62   77.3592  28.1883  <1e-99
S            -67.1353   16.0323 -4.18752   <1e-4
T            -333.736   47.4373 -7.03531  <1e-11
U              78.992   16.0323  4.92707   <1e-6


One way of comparing models `m3` and `m4` is a likelihood ratio test.

The difference in the objective, negative twice the log-likelihood, is similar to the change in the residual sum of squares in a linear model fit.
This objective would be called the _deviance_ if there was a way of defining a saturated model but it is not clear what this should be.
However, if there was a way to define a deviance then the difference in the deviances would be the same as the differences in these objectives, which is

In [33]:
diff(objective.([m3, m4]))

1-element Array{Float64,1}:
 27.06734432979283

The difference in the degrees of freedom, in one way of counting, is

In [34]:
diff(dof.([m4, m3]))

1-element Array{Int64,1}:
 20

producing a p-value of

In [36]:
ccdf(Chisq(20), first(diff(objective.([m3,m4]))))

0.13337992341386884