# LMM predictions and prediction intervals

Below we will fit a linear mixed model using the Ruby gem [mixed\_models](https://github.com/agisga/mixed_models) and demonstrate the available prediction methods.

## Data and linear mixed model

We use the same data and model formulation as in several previous examples, where we have looked at various parameter estimates ([1](http://nbviewer.ipython.org/github/agisga/mixed_models/blob/master/notebooks/LMM_model_fitting.ipynb)) and demostrated many types hypotheses tests as well as confidence intervals ([2](http://nbviewer.ipython.org/github/agisga/mixed_models/blob/master/notebooks/LMM_tests_and_intervals.ipynb)).

The data set, which is simulated, contains two numeric variables *Age* and *Aggression*, and two categorical variables *Location* and *Species*. These data are available for 100 (human and alien) individuals.

We model the *Aggression* level of an individual of *Species* $spcs$ who is at the *Location* $lctn$ as:

$$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$

where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each *Location*).

We fit this model in `mixed_models` using a syntax familiar from the `R` package `lme4`.

In [1]:
require 'mixed_models'
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)", 
                             data: Daru::DataFrame.from_csv("../examples/data/alien_species.csv"))
model_fit.fix_ef_summary

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"http://d3js.org/d3.v3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    window.dispatchEvent(event);\n\t

Daru::DataFrame:69998224639280 rows: 5 cols: 4,Daru::DataFrame:69998224639280 rows: 5 cols: 4,Daru::DataFrame:69998224639280 rows: 5 cols: 4,Daru::DataFrame:69998224639280 rows: 5 cols: 4,Daru::DataFrame:69998224639280 rows: 5 cols: 4
Unnamed: 0_level_1,coef,sd,z_score,WaldZ_p_value
intercept,1016.2867207696772,60.19727495932258,16.882603431075875,0.0
Age,-0.0653161534346766,0.0898848636725385,-0.7266646548258817,0.4674314106158888
Species_lvl_Human,-499.693695290209,0.2682523406941927,-1862.77478137594,0.0
Species_lvl_Ood,-899.5693213535769,0.2814470814004366,-3196.2289922406044,0.0
Species_lvl_WeepingAngel,-199.5889580420076,0.2757835779525997,-723.7158917283754,0.0


## Predictions and prediction intervals

Often, the objective of a statistical model is the prediction of future observations based on new data input.

We consider the following new data set containing age, geographic location and species for ten individuals.

In [2]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'

Daru::DataFrame:69998224313500 rows: 10 cols: 3,Daru::DataFrame:69998224313500 rows: 10 cols: 3,Daru::DataFrame:69998224313500 rows: 10 cols: 3,Daru::DataFrame:69998224313500 rows: 10 cols: 3
Unnamed: 0_level_1,Age,Location,Species
0,209,OodSphere,Dalek
1,90,Earth,Ood
2,173,Asylum,Ood
3,153,Asylum,Human
4,255,OodSphere,WeepingAngel
5,256,Asylum,WeepingAngel
6,37,Earth,Dalek
7,146,Earth,WeepingAngel
8,127,Asylum,WeepingAngel
9,41,Asylum,Ood


#### Point estimates

Based on the fitted linear mixed model we can predict the aggression levels for the inidividuals, where we can specify whether the random effects estimates should be included in the calculations or not.

In [3]:
puts "Predictions of aggression levels on a new data set:"
pred =  model_fit.predict(newdata: newdata, with_ran_ef: true)

Predictions of aggression levels on a new data set:


[1070.9125752531213, 182.45206492790766, -17.064468754763425, 384.78815861991046, 876.1240725686444, 674.711339114886, 1092.6985606350875, 871.150885526236, 687.4629975728096, -4.0162601001437395]

Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous.

In [4]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata[:Predicted_Agression] = pred
newdata

Daru::DataFrame:69998223261360 rows: 10 cols: 4,Daru::DataFrame:69998223261360 rows: 10 cols: 4,Daru::DataFrame:69998223261360 rows: 10 cols: 4,Daru::DataFrame:69998223261360 rows: 10 cols: 4,Daru::DataFrame:69998223261360 rows: 10 cols: 4
Unnamed: 0_level_1,Age,Location,Species,Predicted_Agression
0,209,OodSphere,Dalek,1070.9125752531213
1,90,Earth,Ood,182.45206492790768
2,173,Asylum,Ood,-17.064468754763425
3,153,Asylum,Human,384.78815861991046
4,255,OodSphere,WeepingAngel,876.1240725686444
5,256,Asylum,WeepingAngel,674.711339114886
6,37,Earth,Dalek,1092.6985606350877
7,146,Earth,WeepingAngel,871.150885526236
8,127,Asylum,WeepingAngel,687.4629975728096
9,41,Asylum,Ood,-4.01626010014374


#### Interval estimates

Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.

Two types of such interval estimates are currently available in `LMM`. On the one hand, a *confidence interval* is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a *prediction interval* is an interval estimate of a future observation (for further explanation of this distinction see for example <https://stat.ethz.ch/education/semesters/ss2010/seminar/06_Handout.pdf>).

In [5]:
puts "88% confidence intervals for the predictions:"
ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)
Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])

88% confidence intervals for the predictions:


Daru::DataFrame:69998222468540 rows: 10 cols: 3,Daru::DataFrame:69998222468540 rows: 10 cols: 3,Daru::DataFrame:69998222468540 rows: 10 cols: 3,Daru::DataFrame:69998222468540 rows: 10 cols: 3
Unnamed: 0_level_1,pred,lower88,upper88
0,1002.6356447018298,906.2754747934688,1098.9958146101908
1,110.83894560697944,17.15393227635039,204.5239589376085
2,105.41770487190126,10.164689101505488,200.67072064229703
3,506.59965400396266,411.85192033760063,601.3473876703247
4,800.0421436018271,701.9091186954804,898.1751685081738
5,799.9768274483924,701.8009464989634,898.1527083978215
6,1013.8700230925942,920.4439324626674,1107.296113722521
7,807.1616043262068,712.5717603652894,901.7514482871242
8,808.4026112414656,714.1916412760585,902.6135812068728
9,114.0394371252786,20.614036014106333,207.46483823645087


In [6]:
puts "88% prediction intervals for the predictions:"
pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)
Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])

88% prediction intervals for the predictions:


Daru::DataFrame:69998216536580 rows: 10 cols: 3,Daru::DataFrame:69998216536580 rows: 10 cols: 3,Daru::DataFrame:69998216536580 rows: 10 cols: 3,Daru::DataFrame:69998216536580 rows: 10 cols: 3
Unnamed: 0_level_1,pred,lower88,upper88
0,1002.6356447018298,809.9100524363739,1195.3612369672858
1,110.83894560697944,-76.53615661744246,298.21404783140133
2,105.41770487190126,-85.09352637970153,295.92893612350406
3,506.59965400396266,317.0989018065296,696.1004062013957
4,800.0421436018271,603.7714004192385,996.3128867844156
5,799.9768274483924,603.6203800394777,996.3332748573072
6,1013.8700230925942,827.0127254555641,1200.7273207296244
7,807.1616043262068,617.9767326615571,996.3464759908564
8,808.4026112414656,619.9754814901145,996.8297409928168
9,114.0394371252786,-72.81614249215542,300.89501674271264


**Remark**: You might notice that `#predict` with `with_ran_ef: true` produces some values outside of the confidence intervals, because the confidence intervals are computed from `#predict` with `with_ran_ef: false`.
However, `#predict` with `with_ran_ef: false` should always give values which lie in the center of the confidence or prediction intervals.