-
Notifications
You must be signed in to change notification settings - Fork 0
/
tr692.Rnw
3152 lines (2887 loc) · 120 KB
/
tr692.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amscd}
\usepackage{indentfirst}
\usepackage{natbib}
\usepackage{url}
\usepackage[utf8]{inputenc}
\hyphenation{Wa-gen-ius}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\cl}{cl}
\newcommand{\real}{\mathbb{R}}
\newcommand{\set}[1]{\{\, #1 \,\}}
\newcommand{\inner}[1]{\langle #1 \rangle}
\newcommand{\abs}[1]{\lvert #1 \rvert}
\newcommand{\norm}[1]{\lVert #1 \rVert}
\newcommand{\fatdot}{\,\cdot\,}
\newcommand{\opand}{\mathbin{\rm and}}
\newcommand{\opor}{\mathbin{\rm or}}
% \setlength{\textheight}{\paperheight}
% \addtolength{\textheight}{- 2 in}
% \setlength{\topmargin}{0.25 pt}
% \setlength{\headheight}{0 pt}
% \setlength{\headsep}{0 pt}
% \addtolength{\textheight}{- \topmargin}
% \addtolength{\textheight}{- \topmargin}
% \addtolength{\textheight}{- \footskip}
% \setlength{\textwidth}{\paperwidth}
% \addtolength{\textwidth}{- 2 in}
% \setlength{\oddsidemargin}{0.25 in}
% \setlength{\evensidemargin}{0.25 in}
% \addtolength{\textwidth}{- \oddsidemargin}
% \addtolength{\textwidth}{- \evensidemargin}
\addtolength{\textheight}{\headsep}
\addtolength{\textheight}{\headheight}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}
\begin{document}
\vspace*{0.9375in}
\begin{center}
{\bfseries Aster Models with Random Effects
via Penalized Likelihood} \\
By \\
Charles J. Geyer, Caroline E. Ridley, Robert G. Latta, Julie R. Etterson,
and Ruth G. Shaw \\
Technical Report No.~692 \\
School of Statistics \\
University of Minnesota \\
July 30, 2012 \\
% \today
\end{center}
\thispagestyle{empty}
\cleardoublepage
\setcounter{page}{1}
\thispagestyle{empty}
\begin{abstract}
This technical report works out details of approximate maximum likelihood
estimation for aster models with random effects.
Fixed and random effects are estimated by penalized log likelihood.
Variance components are estimated by integrating out the random effects
in the Laplace approximation of the complete data likelihood following
\citet{b+c}, which can be done analytically,
and maximizing the resulting approximate missing data likelihood.
A further approximation treats the second derivative matrix of the cumulant
function of the exponential family where it appears in the approximate
missing data log likelihood as a constant (not a function of parameters).
Then first and second derivatives of the approximate missing data log
likelihood can be done analytically. Minus the second derivative matrix
of the approximate missing data log likelihood is treated as approximate
Fisher information and used to estimate standard errors.
\end{abstract}
\section{Theory} \label{sec:theo}
Aster models \citep*{gws,aster2} have attracted much recent attention.
Several researchers have raised the issue of incorporating random effects
in aster models, and we do so here.
\subsection{Complete Data Log Likelihood} \label{sec:complete}
Although we are particularly interested in aster models \citep{gws}, our
theory works for any exponential family model. The log likelihood can
be written
$$
l(\varphi) = y^T \varphi - c(\varphi),
$$
where $y$ is the canonical statistic vector,
$\varphi$ is the canonical parameter vector,
and the cumulant function $c$ satisfies
\begin{align}
\mu(\varphi) & = E_{\varphi}(y) = c'(\varphi)
\label{eq:moo}
\\
W(\varphi) & = \var_{\varphi}(y) = c''(\varphi)
\label{eq:w}
\end{align}
where $c'(\varphi)$ denotes the vector of first partial derivatives
and $c''(\varphi)$ denotes the matrix of second partial derivatives.
We assume a canonical affine submodel with random effects determined by
\begin{equation} \label{eq:can-aff-sub}
\varphi = a + M \alpha + Z b,
\end{equation}
where $a$ is a known vector, $M$ and $Z$ are known matrices, $b$
is a normal random vector with mean vector zero and
variance matrix $D$.
The vector $a$ is called the \emph{offset vector} and the matrices
$M$ and $Z$ are called the \emph{model matrices} for fixed and random
effects, respectively, in the terminology of the R function \texttt{glm}.
(The vector $a$ is called the \emph{origin} in the terminology of the
R function \texttt{aster}. \emph{Design matrix} is alternative
terminology for model matrix.)
The matrix $D$ is assumed to be diagonal,
so the random effects are independent random variables.
The diagonal components of $D$ are called \emph{variance components}
in the classical terminology of random effects models \citep{searle}.
Typically the components of $b$ are divided into blocks having the same
variance \citep[Section~6.1]{searle}, so there are only
a few variance components but many random effects, but nothing in this
document uses this fact.
The unknown parameter vectors are $\alpha$ and $\nu$, where $\nu$ is
the vector of variance components. Thus $D$ is a function of $\nu$,
although this is not indicated by the notation.
The ``complete data log likelihood'' (i.~e., what the log likelihood would
be if the random effect vector $b$ were observed) is
\begin{equation} \label{eq:foo}
l_c(\alpha, b, \nu)
=
l(a + M \alpha + Z b)
- \tfrac{1}{2} b^T D^{- 1} b
- \tfrac{1}{2} \log \det(D)
\end{equation}
in case none of the variance components are zero.
We deal with the case of zero variance components in Sections~\ref{sec:zero},
\ref{sec:raw}, and~\ref{sec:roots} below.
\subsection{Missing Data Likelihood}
Ideally, inference about the parameters should be based on
the \emph{missing data likelihood}, which is the complete data likelihood with
random effects $b$ integrated out
\begin{equation} \label{eq:bar}
L_m(\alpha, \nu) = \int e^{l_c(\alpha, b, \nu)} \, d b
\end{equation}
Maximum likelihood estimates (MLE) of $\alpha$ and $\nu$ are the values that
maximize \eqref{eq:bar}. However MLE are hard to find. The integral in
\eqref{eq:bar} cannot be done analytically, nor can it be done by numerical
integration except in very simple cases.
There does exist a large literature on doing such integrals
by ordinary or Markov chain Monte Carlo
\citep*{penttinen,thompson,g+t,mcmcml,promislow,shaw-geyer-shaw,booth-hobert,sung,hunter-et-al,okabayashi-geyer,hummel-et-al},
but these methods take a great
deal of computing time and are difficult for ordinary users to apply.
We wish to avoid that route if at all possible.
\subsection{A Digression on Minimization} \label{sec:minimize}
The theory of constrained optimization (Section~\ref{sec:raw} below)
has a bias in favor of minimization rather than maximization.
The explication below will be simpler if we switch now from maximization
to minimization (minimizing minus the log likelihood) rather than switch
later.
\subsection{Laplace Approximation} \label{sec:laplace}
\citet{b+c} proposed to replace the integrand in \eqref{eq:bar} by
its Laplace approximation, which is proportional to
a normal probability density function
so the random effects can be integrated out analytically.
Let $b^*$ denote the result of maximizing \eqref{eq:foo} considered
as a function of $b$ for fixed $\alpha$ and $\nu$.
Then $- \log L_m(\alpha, \nu)$ is approximated by
$$
q(\alpha, \nu)
=
\tfrac{1}{2} \log \det[\kappa''(b^*)] + \kappa(b^*)
$$
where
\begin{align*}
\kappa(b)
& =
- l_c(a + M \alpha + Z b)
\\
\kappa'(b)
& =
- Z^T [ y + \mu(a + M \alpha + Z b) ] + D^{-1} b
\\
\kappa''(b)
& =
Z^T W(a + M \alpha + Z b) Z + D^{-1}
\end{align*}
Hence
\begin{equation} \label{eq:log-pickle}
\begin{split}
q(\alpha, \nu)
& =
- l_c(\alpha, b^*, \nu)
+ \tfrac{1}{2} \log
\det\bigl[ \kappa''(b^*) \bigr]
\\
& =
- l(a + M \alpha + Z b^*)
+ \tfrac{1}{2} (b^*)^T D^{-1} b^*
+ \tfrac{1}{2} \log \det(D)
\\
& \qquad
+ \tfrac{1}{2} \log
\det\bigl[ Z^T W(a + M \alpha + Z b^*) Z + D^{-1} \bigr]
\\
& =
- l(a + M \alpha + Z b^*)
+ \tfrac{1}{2} (b^*)^T D^{-1} b^*
\\
& \qquad
+ \tfrac{1}{2} \log
\det\bigl[ Z^T W(a + M \alpha + Z b^*) Z D + I \bigr]
\end{split}
\end{equation}
where $I$ denotes the identity matrix of the appropriate dimension
(which must be the same as the dimension of $D$ for the expression it
appears in to make sense),
where $b^*$ is a function of $\alpha$ and $\nu$
and $D$ is a function of $\nu$, although this is not indicated by the
notation, and where
the last equality uses the rule sum of logs is log of product and
the rule product of determinants is determinant of matrix product
\citep[Theorem~13.3.4]{harville}.
Since minus the log likelihood of an exponential family is a convex function
\citep[Theorem~9.1]{barndorff} and the middle term on the right-hand side
of \eqref{eq:foo} is a strictly convex function, it follows that
\eqref{eq:foo} considered as a function of $b$ for fixed $\alpha$ and $\nu$
is a strictly convex function.
Moreover, this function has bounded level sets, because the first term
on the right-hand side of \eqref{eq:foo} is bounded
\citep[Theorems~4 and~6]{gdor} and the second term has bounded level sets.
It follows that there is unique global minimizer
\citep[Theorems~1.9 and~2.6]{raw}.
Thus $b^*(\alpha, \nu)$ is well defined for all values
of $\alpha$ and $\nu$.
The key idea is to use \eqref{eq:log-pickle} as if it were the log likelihood
for the unknown parameters ($\alpha$ and $\nu$), although it is only an
approximation. However, this is also problematic. In doing likelihood
inference using \eqref{eq:log-pickle} we need first and second derivatives
of it (to calculate Fisher information), but $W$ is already
the second derivative matrix of the cumulant function, so
first derivatives of \eqref{eq:log-pickle} would involve third derivatives
of the cumulant function and
second derivatives of \eqref{eq:log-pickle} would involve fourth derivatives
of the cumulant function. There are no published
formulas for derivatives higher than second of the aster model cumulant
function nor does
software (the R package \texttt{aster},
\citealp{package-aster}) provide such ---
the derivatives do, of course, exist because every cumulant function of
a regular exponential family is infinitely differentiable at every
point of the canonical parameter space \citep[Theorem~8.1]{barndorff} ---
they are just not readily available.
\citet{b+c} noted the same problem in the context of GLMM, and proceeded as
if $W$ were a constant function of its argument, so all derivatives of $W$
were zero. This is not a bad approximation because ``in asymptopia'' the
aster model log likelihood is exactly quadratic and $W$ is a constant function,
this being a general property of likelihoods \citep{simple}. Hence we adopt
this idea too, more because we are forced to by the difficulty of
differentiating $W$ than by our belief that we are ``in asymptopia.''
This leads to the following idea. Rather than basing inference on
\eqref{eq:log-pickle}, we actually use
\begin{equation} \label{eq:log-pickle-too}
q(\alpha, \nu)
=
- l(a + M \alpha + Z b^*)
+ \tfrac{1}{2} (b^*)^T D^{-1} b^*
+ \tfrac{1}{2} \log
\det\bigl[ Z^T \widehat{W} Z D + I \bigr]
\end{equation}
where $\widehat{W}$ is a constant matrix (not a function of $\alpha$ and
$\nu$). This makes sense for any
choice of $\widehat{W}$ that is symmetric and positive semidefinite,
but we will choose $\widehat{W}$ that are close to
$W(a + M \hat{\alpha} + Z \hat{b})$, where
$\hat{\alpha}$ and $\hat{\nu}$ are the joint minimizers
of \eqref{eq:log-pickle} and $\hat{b} = b^*(\hat{\alpha}, \hat{\nu})$.
Note that \eqref{eq:log-pickle-too} is a redefinition of $q(\alpha, \nu)$.
Hereafter we will no longer use the definition \eqref{eq:log-pickle}.
\subsection{A Key Concept}
Introduce
\begin{equation} \label{eq:key}
p(\alpha, b, \nu)
=
- l(a + M \alpha + Z b)
+ \tfrac{1}{2} b^T D^{-1} b
+ \tfrac{1}{2} \log
\det\bigl[ Z^T \widehat{W} Z D + I \bigr]
\end{equation}
where, as the left-hand side says, $\alpha$, $b$, and $\nu$ are all
free variables and, as usual, $D$ is a function of $\nu$, although
the notation does not indicate this.
Since the terms that contain $b$ are the same in both \eqref{eq:foo}
and \eqref{eq:key}, $b^*$ can also be defined as the result of
minimizing \eqref{eq:key} considered
as a function of $b$ for fixed $\alpha$ and $\nu$.
Thus \eqref{eq:log-pickle-too} is a profile of \eqref{eq:key}
and $(\hat{\alpha}, \hat{b}, \hat{\nu})$ is the joint minimizer
of \eqref{eq:key}.
Since $p(\alpha, b, \nu)$ is a much simpler function than $q(\alpha, \nu)$,
the latter having no closed form expression and requiring an optimization
as part of each evaluation, it is much simpler to find
$(\hat{\alpha}, \hat{b}, \hat{\nu})$
by minimizing the former rather than the latter.
\subsection{A Digression on Partial Derivatives}
Let $f(\alpha, b, \nu)$ be a scalar-valued function of three vector
variables. We write partial derivative vectors using subscripts:
$f_\alpha(\alpha, b, \nu)$ denotes the vector of partial derivatives
with respect to components of $\alpha$. Our convention is that we take
this to be a column vector. Similarly for $f_b(\alpha, b, \nu)$.
We also use this convention for partial derivatives with respect to
single variables: $f_{\nu_k}(\alpha, b, \nu)$, which are, of course,
scalars. We use this convention for any scalar-valued function
of any number of vector variables.
We continue this convention for second partial derivatives:
$f_{\alpha b}(\alpha, b, \nu)$ denotes the matrix of partial derivatives
having $i, j$ component that is the (mixed) second partial derivative of
$f$ with respect to $\alpha_i$ and $b_j$. Thus the row dimension of
$f_{\alpha b}(\alpha, b, \nu)$ is the dimension of $\alpha$,
the column dimension is the dimension of $b$, and
$f_{b \alpha}(\alpha, b, \nu)$ is the transpose of
$f_{\alpha b}(\alpha, b, \nu)$.
This convention allows easy indication of points at which partial derivatives
are evaluated. For example, $f_{\alpha b}(\alpha, b^*, \nu)$ indicates
that $b^*$ is plugged in for $b$ in the expression for
$f_{\alpha b}(\alpha, b, \nu)$.
We also use this convention of subscripts denoting partial derivatives
with vector-valued functions.
If $f(\alpha, b, \nu)$ is a column-vector-valued function of vector
variables, then $f_\alpha(\alpha, b, \nu)$ denotes the matrix of
partial derivatives having $i, j$ component that is the partial derivative
of the $i$-th component of $f_\alpha(\alpha, b, \nu)$ with respect to
$\alpha_j$.
Thus the row dimension of
$f_{\alpha}(\alpha, b, \nu)$ is the dimension of
$f(\alpha, b, \nu)$ and
the column dimension is the dimension of $\alpha$.
\subsection{First Derivatives}
Start with \eqref{eq:key}. Its derivatives are
\begin{equation} \label{eq:p-alpha}
p_\alpha(\alpha, b, \nu)
=
- M^T \bigl[ y - \mu(a + M \alpha + Z b) \bigr]
\end{equation}
\begin{equation} \label{eq:p-c}
p_b(\alpha, b, \nu)
=
- Z^T \bigl[ y - \mu(a + M \alpha + Z b) \bigr] + D^{- 1} b
\end{equation}
and
\begin{equation} \label{eq:p-theta}
p_{\nu_k}(\alpha, b, \nu)
=
- \tfrac{1}{2} b^T D^{- 1} E_k D^{- 1} b
+ \tfrac{1}{2} \tr \Bigl(
\bigl[ Z^T \widehat{W} Z D + I \bigr]^{- 1}
Z^T \widehat{W} Z E_k
\Bigr)
\end{equation}
where
\begin{equation} \label{eq:eek}
E_k = D_{\nu_k}(\nu)
\end{equation}
is the diagonal matrix whose components are equal to one
if the corresponding components of $D$ are equal to $\nu_k$ by
definition (rather than by accident when some other component of $\nu$
also has the same value) and whose components are otherwise zero.
The formula for the derivative of a matrix inverse
comes from \citet[Chapter~15, Equation~8.15]{harville}.
The formula for the derivative of the log of a determinant
comes from \citet[Chapter~15, Equation~8.6]{harville}.
The estimating equation for $b^*$ can be written
\begin{equation} \label{eq:estimating-c-star}
p_b\bigl(\alpha, b^*, \nu\bigr) = 0
\end{equation}
and by the multivariate chain rule \citep[Theorem~8.15]{browder}
we have
\begin{equation} \label{eq:q-alpha}
\begin{split}
q_\alpha(\alpha, \nu)
& =
p_\alpha(\alpha, b^*, \nu)
+
b^*_\alpha(\alpha, \nu)^T p_b(\alpha, b^*, \nu)
\\
& =
p_\alpha(\alpha, b^*, \nu)
\end{split}
\end{equation}
by \eqref{eq:estimating-c-star}, and
\begin{equation} \label{eq:q-theta}
\begin{split}
q_{\nu_k}(\alpha, \nu)
& =
b^*_{\nu_k}(\alpha, \nu)^T
p_b(\alpha, b^*, \nu)
+
p_{\nu_k}(\alpha, b^*, \nu)
\\
& =
p_{\nu_k}(\alpha, b^*, \nu)
\end{split}
\end{equation}
again by \eqref{eq:estimating-c-star}.
\subsection{Second Derivatives} \label{sec:second}
By the multivariate chain rule \citep[Theorem~8.15]{browder}
\begin{align*}
q_{\alpha \alpha}(\alpha, \nu)
& =
p_{\alpha \alpha}(\alpha, b^*, \nu)
+
p_{\alpha b}(\alpha, b^*, \nu) b^*_\alpha(\alpha, \nu)
\\
q_{\alpha \nu}(\alpha, \nu)
& =
p_{\alpha \nu}(\alpha, b^*, \nu)
+
p_{\alpha b}(\alpha, b^*, \nu) b^*_{\nu}(\alpha, \nu)
\\
q_{\nu \nu}(\alpha, \nu)
& =
p_{\nu \nu}(\alpha, b^*, \nu)
+
p_{\nu b}(\alpha, b^*, \nu) b^*_{\nu}(\alpha, \nu)
\end{align*}
The estimating equation \eqref{eq:estimating-c-star}
defines $b^*$ implicitly. Thus derivatives of $b^*$ are computed using
the implicit function theorem \citep[Theorem~8.29]{browder}
\begin{align}
b^*_\alpha(\alpha, \nu)
& =
-
p_{b b}(\alpha, b^*, \nu)^{-1} p_{b \alpha}(\alpha, b^*, \nu)
\label{eq:c-star-alpha-rewrite}
\\
b^*_{\nu}(\alpha, \nu)
& =
-
p_{b b}(\alpha, b^*, \nu)^{-1} p_{b \nu}(\alpha, b^*, \nu)
\label{eq:c-star-theta-rewrite}
\end{align}
This theorem requires that $p_{b b}(\alpha, b^*, \nu)$ be invertible,
and we shall see below that it is.
Then the second derivatives above can be rewritten
\begin{align*}
q_{\alpha \alpha}(\alpha, \nu)
& =
p_{\alpha \alpha}(\alpha, b^*, \nu)
-
p_{\alpha b}(\alpha, b^*, \nu)
p_{b b}(\alpha, b^*, \nu)^{-1} p_{b \alpha}(\alpha, b^*, \nu)
\\
q_{\alpha \nu}(\alpha, \nu)
& =
p_{\alpha \nu}(\alpha, b^*, \nu)
-
p_{\alpha b}(\alpha, b^*, \nu)
p_{b b}(\alpha, b^*, \nu)^{-1} p_{b \nu}(\alpha, b^*, \nu)
\\
q_{\nu \nu}(\alpha, \nu)
& =
p_{\nu \nu}(\alpha, b^*, \nu)
-
p_{\nu b}(\alpha, b^*, \nu)
p_{b b}(\alpha, b^*, \nu)^{-1} p_{b \nu}(\alpha, b^*, \nu)
\end{align*}
a particularly simple and symmetric form. If we combine all the parameters
in one vector $\psi = (\alpha, \nu)$ and write $p(\psi, b)$ instead
of $p(\alpha, b, \nu)$ we have
\begin{equation} \label{eq:psi-psi}
q_{\psi \psi}(\psi)
=
p_{\psi \psi}(\psi, b^*)
-
p_{\psi b}\bigl(\psi, b^*\bigr)
p_{b b}\bigl(\psi, b^*\bigr)^{- 1}
p_{b \psi}\bigl(\psi, b^*\bigr)
\end{equation}
This form is familiar from the conditional variance formula
for normal distributions if
\begin{equation} \label{eq:fat}
\begin{pmatrix} \Sigma_{1 1} & \Sigma_{1 2}
\\ \Sigma_{2 1} & \Sigma_{2 2} \end{pmatrix}
\end{equation}
is the partitioned variance matrix of a partitioned normal random vector
with components $X_1$ and $X_2$, then the variance matrix of the conditional
distribution of $X_1$ given $X_2$ is
\begin{equation} \label{eq:thin}
\Sigma_{1 1} - \Sigma_{1 2} \Sigma_{2 2}^{- 1} \Sigma_{2 1}
\end{equation}
assuming that $X_2$ is nondegenerate \citep[Theorem~2.5.1]{anderson}.
Moreover, if the conditional distribution is degenerate, that is, if there
exists a nonrandom vector $v$ such that $\var(v^T X_1 \mid X_2) = 0$, then
$$
v^T X_1 = v^T \Sigma_{1 2} \Sigma_{2 2}^{- 1} X_2
$$
with probability one, assuming $X_1$ and $X_2$ have mean zero
\citep[also by][Theorem~2.5.1]{anderson}, and the joint distribution
of $X_1$ and $X_2$ is also degenerate. Thus we conclude that if
the (joint) Hessian matrix of $p$ is nonsingular, then so is the (joint)
Hessian matrix of $q$ given by \eqref{eq:psi-psi}.
The remaining work for this section is deriving second derivatives of $p$
\begin{align*}
p_{\alpha \alpha}(\alpha, b, \nu)
& =
M^T W(a + M \alpha + Z b) M
\\
p_{\alpha b}(\alpha, b, \nu)
& =
M^T W(a + M \alpha + Z b) Z
\\
p_{b b}(\alpha, b, \nu)
& =
Z^T W(a + M \alpha + Z b) Z + D^{- 1}
\\
p_{\alpha \nu_k}(\alpha, b, \nu)
& =
0
\\
p_{b \nu_k}(\alpha, b, \nu)
& =
- D^{- 1} E_k D^{- 1} b
\\
p_{\nu_j \nu_k}(\alpha, b, \nu)
& =
b^T D^{- 1} E_j D^{- 1} E_k D^{- 1} b
\\
& \qquad
-
\tfrac{1}{2} \tr \Bigl(
\bigl[ Z^T \widehat{W} Z D + I \bigr]^{- 1}
Z^T \widehat{W} Z E_j
\\
& \qquad \qquad
\bigl[ Z^T \widehat{W} Z D + I \bigr]^{- 1}
Z^T \widehat{W} Z E_k
\Bigr)
\end{align*}
This finishes the derivation of all the derivatives we need.
Recall that in our use of the implicit function theorem we needed
$p_{b b}(\alpha, b^*, \nu)$ to be invertible. From the explicit form
given above we see that it is actually negative definite, because
$W(a + M \alpha + Z b)$ is positive semidefinite by \eqref{eq:w}.
\subsection{Zero Variance Components} \label{sec:zero}
When some variance components are zero, the corresponding diagonal components
of $D$ are zero, and the corresponding components of $b$ are zero almost surely.
However we deal with this situation, it must have the same effect as omitting
those variance components and the corresponding random effects from the model.
\citet[Section 2.3]{b+c} suggest using the Moore-Penrose pseudoinverse
\citep[Chapter~20]{harville}. Let $D^+$ denote the diagonal matrix whose
diagonal components are the reciprocals of the diagonal components
provided those are nonzero and whose diagonal components are zero otherwise.
Then
\begin{equation} \label{eq:log-pickle-too-too}
q(\alpha, \nu)
=
- l(a + M \alpha + Z b^*)
+ \tfrac{1}{2} (b^*)^T D^+ b^*
+ \tfrac{1}{2} \log
\det\bigl[ Z^T \widehat{W} Z D + I \bigr]
\end{equation}
is an approximate log likelihood, but in the calculation of
$b^*$ constrained penalized maximum likelihood must be used:
elements of $b$ corresponding to zero variance components must be
constrained to be zero, because \eqref{eq:log-pickle-too-too} does
not force them to be zero.
Although this proposal \citep[Section 2.3]{b+c} does deal with the
situation where the zero variance components are somehow known, it
does not adequately deal with estimating which variance components are
zero. That is the subject of the following two sections.
\subsection{The Theory of Constrained Optimization} \label{sec:raw}
\subsubsection{Incorporating Constraints in the Objective Function}
When zero variance components arise, optimization of \eqref{eq:key}
puts us in the
realm of constrained optimization. The theory of constrained optimization
\citep{raw} has a notational bias towards minimization \citep[p.~5]{raw}.
Thus, as explained above (Section~\ref{sec:minimize}) we have switched
from maximization to minimization.
Readers who are familiar with Karush-Kuhn-Tucker\
(\citealp[Section~9.1]{fletcher}; \citealp[Section~12.2]{nocedal-wright})
theory should be warned
that that theory is not adequate for the problem at hand, because the
constraint set is not a closed set and so cannot be defined in terms of
smooth constraint functions.
Thus the need for the more general theory \citep{raw}.
The theory of constrained optimization incorporates constraints in the
objective function by the simple device of defining the objective function
(for a minimization problem) to have the value $+ \infty$ off the constraint
set \citep[Section~1A]{raw}.
Since no point where the objective function has the value $+ \infty$
can minimize it, unless the the objective function has the value $+ \infty$
everywhere, which is not the case in any application, the unconstrained
minimizer of this sort of objective function is the same as the constrained
minimizer.
Thus we need to impose constraints on our key function
\eqref{eq:key}, requiring that
each component of $\nu$ be nonnegative and when any component of $\nu$ is
zero the corresponding components of $b$ are also zero. However,
the formula \eqref{eq:key} does not make sense when components of $\nu$
are zero, so we proceed differently.
\subsubsection{Lower Semicontinuous Regularization}
Since all but the middle term on the right-hand side of
\eqref{eq:key} are actually defined on some neighborhood of each point
of the constraint set and differentiable at each point
of the constraint set, we only need to deal with the middle term.
It is the sum of terms of the form $b_i^2 / \nu_k$, where $\nu_k$ is
the variance of $b_i$. Thus we investigate functions of this form
\begin{equation} \label{eq:h}
h(b, \nu)
=
b^2 / \nu
\end{equation}
where, temporarily, $b$ and $\nu$ are scalars rather than vectors
(representing single components of the vectors). In case $\nu > 0$
we have derivatives
\begin{align*}
h_b(b, \nu) & = 2 b / \nu
\\
h_\nu(b, \nu) & = - b^2 / \nu^2
\\
h_{b b}(b, \nu) & = 2 / \nu
\\
h_{b \nu}(b, \nu) & = - 2 b / \nu^2
\\
h_{\nu \nu}(b, \nu) & = 2 b^2 / \nu^3
\end{align*}
The Hessian matrix
$$
h''(b, \nu)
=
\begin{pmatrix}
2 / \nu & - 2 b / \nu^2
\\
- 2 b / \nu^2 & 2 b^2 / \nu^3
\end{pmatrix}
$$
has nonnegative determinants of its principal submatrices, since the
diagonal components are positive and $\det\bigl(h''(b, \nu)\bigr)$ is zero.
Thus the Hessian matrix is nonnegative definite
\citep[Theorem~14.9.11]{harville}, which implies that $h$ itself is
convex \citep[Theorem~2.14]{raw} on the set where $\nu > 0$.
We then extend $h$ to the whole of the constraint set (this just adds
the origin to the points already considered) in two steps.
First we define it to have the value $+ \infty$ at all points not
yet considered (those where any component of $\nu$ is nonpositive).
This gives us an extended-real-valued convex function defined on
all of $\real^2$.
Second we take it to be the
lower semicontinuous (LSC) regularization \citep[p.~14]{raw} of the
function just defined. It is clear that
$$
\liminf_{\substack{b \to \bar{b}\\\nu \searrow 0}} h(b, \nu)
=
\begin{cases}
0, & \bar{b} = 0
\\
+ \infty, & \text{otherwise}
\end{cases}
$$
Thus the LSC regularization is
\begin{equation} \label{eq:h-lsc}
h(b, \nu)
=
\begin{cases}
b^2 / \nu, & \nu > 0
\\
0, & \nu = 0 \opand b = 0
\\
+ \infty, & \text{otherwise}
\end{cases}
\end{equation}
The LSC regularization of a convex function is convex
\citep[Proposition~2.32]{raw}, so \eqref{eq:h-lsc} defines
an extended-real-valued convex function.
Note that $h(b, 0)$ considered as a function of $b$
is minimized at $b = 0$ because that is the only point where this function
is finite. Hence this does enforce the constraint that random effects
corresponding to zero variance components must be zero.
Let $k$ denote the map from indices for $b$ to indices for $\nu$ that gives
corresponding components: $\nu_{k(i)}$ is the variance of $b_i$.
Let $\dim(b)$ denote the number of random effects. Then our objective function
can be written
\begin{equation} \label{eq:key-lsc}
p(\alpha, b, \nu)
=
- l(a + M \alpha + Z b)
+ \tfrac{1}{2} \sum_{i = 1}^{\dim(b)} h(b_i, \nu_{k(i)})
+ \tfrac{1}{2} \log
\det\bigl[ Z^T \widehat{W} Z D + I \bigr]
\end{equation}
where $h$ is given by \eqref{eq:h-lsc}, provided all of the components of
$\nu$ are nonnegative. The proviso is necessary because the third term
on the right-hand side is not defined for all values of $\nu$, only those
such that the argument of the determinant is a positive definite matrix.
Hence, we must separately define $p(\alpha, b, \nu) = + \infty$
whenever any component of $\nu$ is negative.
\subsubsection{Subderivatives}
In calculus we learn that the first derivative is zero at a local minimum
and, therefore, to check points where the first derivative is zero.
This is called Fermat's rule. This
rule no longer works for nonsmooth functions, including those that incorporate
constraints, such as \eqref{eq:key-lsc}. It does, of course, still work
at points in the interior of the constraint set where \eqref{eq:key-lsc}
is differentiable. It does not work to check points on the boundary.
There we need what \citet[Theorem~10.1]{raw}
call \emph{Fermat's rule, generalized:} at a local minimum the
\emph{subderivative function} is nonnegative.
For any extended-real-valued function $f$ on $\real^d$,
the \emph{subderivative function}, denoted $d f(x)$
is also an extended-real-valued function on $\real^d$ defined by
$$
d f(x)(\bar{w}) = \liminf_{\substack{\tau \searrow 0 \\ w \to \bar{w}}}
\frac{f(x + \tau w) - f(x)}{\tau}
$$
\citep[Definition~8.1]{raw}. The notation on the left-hand side is read
the subderivative of $f$ at the point $x$ in the direction $\bar{w}$.
Fortunately, we do not have to use this
definition to calculate subderivatives we want, because the calculus
of subderivatives allows us to use simpler formulas in special cases.
Firstly, there is the notion of subdifferential regularity
\citep[Definition~7.25]{raw}, which we can use without knowing the definition.
The sum of regular functions is regular and the subderivative of a sum
is the sum of the subderivatives \citep[Corollary~10.9]{raw}.
A smooth function is regular and the subderivative is given by
\begin{equation} \label{eq:subderivative-smooth}
d f(x)(w) = w^T f'(x),
\end{equation}
where, as in Sections~\ref{sec:complete} and~\ref{sec:laplace} above,
$f'(x)$ denotes the gradient vector (the vector of partial derivatives)
of $f$ at the point $x$ \citep[Exercise~8.20]{raw}. Every LSC convex function
is regular \citep[Example~7.27]{raw}. Thus in computing subderivatives
of \eqref{eq:key-lsc} we may compute them term by term, and for the
first and last terms, they are given in terms of the partial derivatives
already computed by \eqref{eq:subderivative-smooth}.
For an LSC convex function $f$, we have the following characterization
of the subderivative \citep[Proposition~8.21]{raw}. At any point $x$
where $f(x)$ is finite, the limit
$$
g(w) = \lim_{\tau \searrow 0} \frac{f(x + \tau w) - f(x)}{\tau}
$$
exists and defines a sublinear function $g$, and then $d f(x)$
is the LSC regularization of $g$.
An extended-real-valued
function $g$ is \emph{sublinear} if $g(0) = 0$ and
$$
g(a_1 x_1 + a_2 x_2) \le a_1 g(x_1) + a_2 g(x_2)
$$
for all vectors $x_1$ and $x_2$ and positive scalars $a_1$ and $a_2$
\citep[Definition~3.18]{raw}.
The subderivative function of every regular LSC function
is sublinear \citep[Theorem~7.26]{raw}.
So let us proceed to calculate the subderivative of \eqref{eq:h-lsc}.
In the interior of the constraint set, where this function is smooth,
we can use the partial derivatives already calculated
$$
d h(b, \nu)(u, v) = \frac{2 b u}{\nu} - \frac{b^2 v}{\nu^2}
$$
where the notation on the left-hand side means the subderivative of $h$
at the point $(b, \nu)$ in the direction $(u, v)$.
On the boundary of the constraint set, which consists of the single point
$(0, 0)$, we take limits. In case $v > 0$, we have
$$
\lim_{\tau \searrow 0}
\frac{h(\tau u, \tau v) - h(0, 0)}{\tau}
=
\lim_{\tau \searrow 0}
\frac{\tau^2 u^2 / (\tau v)}{\tau}
=
\lim_{\tau \searrow 0}
\frac{u^2}{v}
=
\frac{u^2}{v}
$$
In case $v < 0$ or in case $v = 0$ and $u \neq 0$,
\begin{equation} \label{eq:difference-quotient}
\frac{h(\tau u, \tau v) - h(0, 0)}{\tau}
\end{equation}
is equal to $+ \infty$ for all $\tau > 0$ so the limit inferior is $+ \infty$.
In case $v = 0$ and $u = 0$, \eqref{eq:difference-quotient} is equal to zero
for all $\tau > 0$ so the limit inferior is zero.
Thus we see that the limit inferior already defines an LSC function and
$$
d h(0, 0)(u, v) = h(u, v).
$$
\subsubsection{Applying the Generalization of Fermat's Rule}
The theory of constrained optimization tells us nothing we did not already
know (from Fermat's rule) about smooth functions. The only way we can
have $d f(x)(w) = w^T f'(x) \ge 0$ for all vectors $w$ is if $f'(x) = 0$.
It is only at points where the function is nonsmooth, in the cases of
interest to us, points on the boundary of the constraint set, where
the theory of constrained optimization tells us things we did not know and
need to know.
Even on the boundary, the conclusions of the theory about components
of the state that are not on the boundary agree with what we already knew.
We have
$$
d p(\alpha, b, \nu)(s, u, v)
=
s^T p_\alpha(\alpha, b, \nu)
+
\text{terms not containing $s$}
$$
and the only way this can be nonnegative for all $s$ is if
\begin{equation} \label{eq:descent-alpha}
p_\alpha(\alpha, b, \nu) = 0
\end{equation}
in which case
$d p(\alpha, b, \nu)(s, u, v)$ is a constant function of $s$, or, what
is the same thing in other words, the terms of
$d p(\alpha, b, \nu)(s, u, v)$ that appear to involve $s$ are all zero
(and so do not actually involve $s$).
Similarly, $d p(\alpha, b, \nu)(s, u, v) \ge 0$ for all $u_i$ and $v_j$
such that $\nu_j > 0$ and $k(i) = j$ only if
\begin{equation} \label{eq:descent-other-smooth}
\begin{split}
p_{\nu_j}(\alpha, b, \nu) & = 0, \qquad \text{$j$ such that $\nu_j > 0$}
\\
p_{b_i}(\alpha, b, \nu) & = 0, \qquad \text{$i$ such that $\nu_{k(i)} > 0$}
\end{split}
\end{equation}
in which case we conclude that
$d p(\alpha, b, \nu)(s, u, v)$ is a constant function of such $u_i$ and $v_j$.
Thus, assuming that we are at a point $(\alpha, b, \nu$)
where \eqref{eq:descent-alpha}
and \eqref{eq:descent-other-smooth} hold,
and we do assume this throughout the rest of this section,
$d p(\alpha, b, \nu)(s, u, v)$ actually involves only $v_j$ and $u_i$ such
that $\nu_j = 0$ and $k(i) = j$. Define
\begin{equation} \label{eq:key-smooth}
\bar{p}(\alpha, b, \nu)
=
- l(a + M \alpha + Z b)
+ \tfrac{1}{2} \log
\det\bigl[ Z^T \widehat{W} Z D + I \bigr]
\end{equation}
(the part of \eqref{eq:key-lsc} consisting of the smooth terms).
Then
\begin{equation} \label{eq:descent-other-nonsmooth}
\begin{split}
d p(\alpha, b, \nu)(s, u, v)
& =
\sum_{j \in J}
\Biggl[
v_j \bar{p}_{\nu_j}(\alpha, b, \nu)
\\
& \qquad
+
\sum_{i \in k^{- 1}(j)}
\Biggl(
u_i \bar{p}_{b_i}(\alpha, b, \nu)
+
h(u_i, v_j)
\Biggr)
\Biggr]
\end{split}
\end{equation}
where $J$ is the set of $j$ such that $\nu_j = 0$,
where $k^{- 1}(j)$ denotes the set of $i$ such that $k(i) = j$,
and where $h$ is defined by \eqref{eq:h-lsc}.
Fermat's rule generalized says we must consider all of the terms
of \eqref{eq:descent-other-nonsmooth} together.
We cannot consider partial derivatives, because the partial derivatives
do not exist. To check that we are at a local minimum we need to show
that \eqref{eq:descent-other-nonsmooth} is nonnegative
for all vectors $u$ and $v$.
Conversely, to verify that we are not at a local minimum, we need to find
one pair of vectors $u$ and $v$ such that \eqref{eq:descent-other-nonsmooth}
is negative. Such a pair $(u, v)$ we call a \emph{descent direction}.
Since Fermat's rule generalized is a necessary but not sufficient condition
(like the ordinary Fermat's rule), the check that we are at a local minimum
is not definitive, but the check that we are not is. If a descent direction
is found, then moving in that direction away from
the current value of $(\alpha, b, \nu)$ will decrease the objective
function \eqref{eq:key-lsc}.
So how do we find a descent direction? We want to minimize
\eqref{eq:descent-other-nonsmooth} considered as a function of $u$ and $v$
for fixed $\alpha$, $b$, and $\nu$.
On further consideration, we can consider the terms of
\eqref{eq:descent-other-nonsmooth} for each $j$ separately.
If the minimum of
\begin{equation} \label{eq:descent-other-nonsmooth-j}
v_j \bar{p}_{\nu_j}(\alpha, b, \nu)
+
\sum_{i \in k^{- 1}(j)}
\Biggl(
u_i \bar{p}_{b_i}(\alpha, b, \nu)
+
h(u_i, v_j)
\Biggr)
\end{equation}
over all vectors $u$ and $v$ is nonnegative, then the minimum is zero,
because \eqref{eq:descent-other-nonsmooth-j} has the value zero
when $u = 0$ and $v = 0$. Thus we can ignore this $j$ in calculating
the descent direction.
On the other hand, if the minimum is negative,
then the minimum does not occur at $v = 0$ and the minimum is actually
$- \infty$ by the sublinearity of the subderivative,
one consequence of sublinearity being positive homogeneity
$$
d f(x)(\tau w) = \tau d f(x)(w), \qquad \tau \ge 0
$$
which holds
for any subderivative. Thus (as our terminology hints) we are only trying
to find a descent \emph{direction}, the length of the vector $(u, v)$ does
not matter, only its direction. Thus to get a finite minimum we can
do a constrained minimization of \eqref{eq:descent-other-nonsmooth-j},
constraining $(u, v)$ to lie in a ball. This is found by the well-known
Karush-Kuhn-Tucker theory of constrained optimization
(\citealp[Section~9.1]{fletcher}; \citealp[Section~12.2]{nocedal-wright})
to be the minimum of the Lagrangian function
\begin{equation} \label{eq:laggard}
L(u, v)
=
\lambda v_j^2
+
v_j \bar{p}_{\nu_j}(\alpha, b, \nu)
+
\sum_{i \in k^{- 1}(j)}
\Biggl(
\lambda
u_i^2
+
u_i \bar{p}_{b_i}(\alpha, b, \nu)
+
\frac{u_i^2}{v_j}
\Biggr)
\end{equation}
where $\lambda > 0$ is the Lagrange multiplier, which would have to be
adjusted if we were interested in constraining $(u, v)$ to lie in a particular
ball. Since we do not care about the length of $(u, v)$ we can use any
$\lambda$. We have replaced $h(u_i, v_i)$ by $u_i^2 / v_j$ because we
know that if we are finding an actual descent direction, then we will
have $v_j > 0$. Now
\begin{align*}
L_{u_i}(u, v)
& =