Update default_priors notebook; change 'exp' link to 'log'

README correctly stated that Poisson models should use log link function (as in https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function) but in the actual code it was implemented as exponential.
bambinos · Sep 19, 2016 · 8b5f22f · 8b5f22f
1 parent 84acda7
commit 8b5f22f
Show file tree

Hide file tree

Showing 4 changed files with 67 additions and 23 deletions.
diff --git a/bambi/backends.py b/bambi/backends.py
@@ -39,7 +39,7 @@ class PyMC3BackEnd(BackEnd):
         'identity': lambda x: x,
         'logit': theano.tensor.nnet.sigmoid,
         'inverse': theano.tensor.inv,
-        'exp': theano.tensor.exp
+        'log': theano.tensor.log
     }
 
     def __init__(self):

diff --git a/bambi/config/priors.json b/bambi/config/priors.json
@@ -37,7 +37,7 @@
           "mu": "#halfcauchy"
         }
       ],
-      "link": "exp",
+      "link": "log",
       "parent": "mu"
     },
     "t": {

diff --git a/bambi/models.py b/bambi/models.py
@@ -130,28 +130,28 @@ def build(self):
                     'logit': self.y.data.std(),
                     'probit': self.y.data.std(),
                     'inverse': self.y.data.std(),
-                    'exp': self.y.data.std()
+                    'log': self.y.data.std()
                 },
                 'binomial': {
                     'identity': self.y.data.std(),
                     'logit': np.pi / 3**.5,
                     'probit': 1,
                     'inverse': self.y.data.std(),
-                    'exp': self.y.data.std()
+                    'log': self.y.data.std()
                 },
                 'poisson': {
                     'identity': self.y.data.std(),
                     'logit': self.y.data.std(),
                     'probit': self.y.data.std(),
                     'inverse': self.y.data.std(),
-                    'exp': self.y.data.std()
+                    'log': self.y.data.std()
                 },
                 't': {
                     'identity': self.y.data.std(),
                     'logit': self.y.data.std(),
                     'probit': self.y.data.std(),
                     'inverse': self.y.data.std(),
-                    'exp': self.y.data.std()
+                    'log': self.y.data.std()
                 }
             }
 
@@ -229,7 +229,7 @@ def fit(self, fixed=None, random=None, priors=None, family='gaussian',
             link (str): The model link function to use. Can be either a string
                 (must be one of the options defined in the current backend;
                 typically this will include at least 'identity', 'logit',
-                'inverse', and 'exp'), or a callable that takes a 1D ndarray
+                'inverse', and 'log'), or a callable that takes a 1D ndarray
                 or theano tensor as the sole argument and returns one with
                 the same shape.
             run (bool): Whether or not to immediately begin fitting the model
@@ -285,7 +285,7 @@ def add_formula(self, fixed=None, random=None, priors=None,
             link (str): The model link function to use. Can be either a string
                 (must be one of the options defined in the current backend;
                 typically this will include at least 'identity', 'logit',
-                'inverse', and 'exp'), or a callable that takes a 1D ndarray
+                'inverse', and 'log'), or a callable that takes a 1D ndarray
                 or theano tensor as the sole argument and returns one with
                 the same shape.
             categorical (str, list): The names of any variables to treat as
@@ -411,7 +411,7 @@ def add_y(self, variable, prior=None, family='gaussian', link=None, *args,
             link (str): The model link function to use. Can be either a string
                 (must be one of the options defined in the current backend;
                 typically this will include at least 'identity', 'logit',
-                'inverse', and 'exp'), or a callable that takes a 1D ndarray
+                'inverse', and 'log'), or a callable that takes a 1D ndarray
                 or theano tensor as the sole argument and returns one with
                 the same shape.
             args, kwargs: Optional positional and keyword arguments to pass

diff --git a/examples/about_those_default_priors.ipynb b/examples/about_those_default_priors.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Fixed effects"
+    "# Fixed effects only, Normal response distribution (LMs)"
    ]
   },
   {
@@ -20,21 +20,33 @@
    "source": [
     "Consider a regression equation (row indices omitted for simplicity):\n",
     "\n",
-    "$Y = \\beta_0 + \\beta_1X_1 + \\ldots + \\beta_pX_p + e$,\n",
+    "$Y = \\beta_0 + \\beta_1X_1 + \\beta_2X_2 + \\ldots + e$,\n",
     "\n",
-    "where $e \\sim \\text{Normal}(0,\\sigma)$. Our goal is to devise a set of default priors for the regression coefficients that--in the absence of any additional information supplied by the user--will still allow one to obtain reasonable parameter estimates in general. One obviously unsatisfactory solution would be to just use, say, standard Normal distributions for all the priors, i.e.:\n",
+    "where $e \\sim \\text{Normal}(0,\\sigma^2)$. Our goal is to devise a set of default priors for the regression coefficients that--in the absence of any additional information supplied by the user--will still allow one to obtain reasonable parameter estimates in general. One obviously unsatisfactory solution would be to just use, say, standard Normal distributions for all the priors, i.e.:\n",
     "\n",
     "$\\beta_j \\sim \\text{Normal}(0, 1)$.\n",
     "\n",
-    "However, this ignores the fact that the observed variables may all be on wildly different scales. So for some predictors the prior may be extremely narrow or “informative,” shrinking estimates strongly toward 0, while for other predictors the prior may be extremely wide or “vague,” leaving the estimates essentially unchanged. \n",
+    "This ignores the fact that the observed variables may all be on wildly different scales. So for some predictors the prior may be extremely narrow or “informative,” shrinking estimates strongly toward 0, while for other predictors the prior may be extremely wide or “vague,” leaving the estimates essentially unchanged. \n",
     "\n",
     "One remedy to this differential informativeness issue could be to set the standard deviation of the prior to a very large value, so that the prior is likely to be wide relative to almost any predictor variables. For example,\n",
     "\n",
     "$\\beta_j \\sim \\text{Normal}(0, 10^{10})$.\n",
     "\n",
-    "This is better, although in principle it suffers from the same problem--that is, a user could still conceivably use variables for which even this prior is implausibly narrow (although in practice it would be unlikely). A different but related worry is that different scalings of the variables in the same dataset--for example, due to changing the units of measurement--will lead to the priors having different levels of informativeness. This seems undesirable because scaling and shifting the variables has no meaningful consequence for traditional test statistics and standardized effect sizes (with some obvious exceptions pertaining to intercepts and models with interaction terms).\n",
-    "\n",
-    "The approach we take for the default priors in bambi is to set the prior indirectly by defining the prior on the corresponding partial correlation, and then seeing what scale this implies for the prior on the raw regression coefficient scale. \n",
+    "This is better, although in principle it suffers from the same problem--that is, a user could still conceivably use variables for which even this prior is implausibly narrow (although in practice it would be unlikely). A different but related worry is that different scalings of the same variable in the same dataset--for example, due to changing the units of measurement--will lead to the prior having different levels of informativeness. This seems undesirable because scaling and shifting the variables has no meaningful consequence for traditional test statistics and standardized effect sizes (with some obvious exceptions pertaining to intercepts and models with interaction terms)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Fixed slopes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The approach we take for the default priors on fixed slopes in bambi is to set the prior indirectly, by defining the prior on the corresponding partial correlation and then seeing what scale this implies for the prior on the raw regression coefficient scale. \n",
     "\n",
     "One can transform the multiple regression coefficient for the predictor $X_j$ into its corresponding partial correlation $\\rho^p_{YX_j}$ (i.e., the partial correlation between the outcome and $X_j$, controlling for all the other predictors) using the identity:\n",
     "\n",
@@ -44,10 +56,29 @@
     "\n",
     "Now we define some prior distribution for $\\rho^p_{YX_j}$ that has mean zero and standard deviation $\\sigma_\\rho$. This implies that\n",
     "\n",
-    "$\\begin{aligned} \\text{var}(\\beta_j) &=  \\text{var}(\\rho^p_{YX_j}\\sqrt{\\frac{(1-R^2_{YX_{-j}})\\text{var}(Y)}{(1-R^2_{X_j})\\text{var}(X_j)}}) \\\\ &= \\frac{(1-R^2_{YX_{-j}})\\text{var}(Y)}{(1-R^2_{X_j})\\text{var}(X_j)}\\sigma^2_\\rho \\end{aligned}$.\n",
+    "$\\begin{aligned} \\text{var}(\\beta_j) &=  \\text{var}(\\rho^p_{YX_j}\\sqrt{\\frac{(1-R^2_{YX_{-j}})\\text{var}(Y)}{(1-R^2_{X_jX_{-j}})\\text{var}(X_j)}}) \\\\ &= \\frac{(1-R^2_{YX_{-j}})\\text{var}(Y)}{(1-R^2_{X_jX_{-j}})\\text{var}(X_j)}\\sigma^2_\\rho \\end{aligned}$.\n",
     "\n",
-    "One can tune the width or informativeness of this prior by setting different values of $\\sigma_\\rho$, corresponding to different standard deviations of the distribution of plausible partial correlations. Our default prior is a Normal distribution with zero mean and standard deviation following this scheme, with $\\sigma_\\rho = \\sqrt{1/3} \\approx .577$, which is the standard deviation of a flat prior in the interval [-1,1]. We allow users to specify their priors in terms of $\\sigma_\\rho$, or in terms of four labels: “narrow” meaning $\\sigma_\\rho=.2$, “medium” meaning $\\sigma_\\rho=.4$, “wide” meaning $\\sigma_\\rho=\\sqrt{1/3}$ (i.e., the default), or “superwide” meaning $\\sigma_\\rho=.8$. (Note that the maximum possible standard deviation of a distribution of partial correlations is 1, which would be a distribution with half of the values at $\\rho^p_{YX_j}=-1$ and the other half at $\\rho^p_{YX_j}=1$.) Viewed from this partial correlation perspective, it seems hard to theoretically justify anything wider than our “wide” default, since this would correspond to something that is wider than a flat prior on the partial correlation scale (although we note that, purely practically speaking, there are often no discernible problems in using such a wider prior).\n",
+    "One can tune the width or informativeness of this prior by setting different values of $\\sigma_\\rho$, corresponding to different standard deviations of the distribution of plausible partial correlations. Our default prior is a Normal distribution with zero mean and standard deviation following this scheme, with $\\sigma_\\rho = \\sqrt{1/3} \\approx .577$, which is the standard deviation of a flat prior in the interval [-1,1]. We allow users to specify their priors in terms of $\\sigma_\\rho$, or in terms of four labels:\n",
     "\n",
+    " - “narrow” meaning $\\sigma_\\rho=.2$,\n",
+    " - “medium” meaning $\\sigma_\\rho=.4$,\n",
+    " - “wide” meaning $\\sigma_\\rho=\\sqrt{1/3} \\approx .577$ (i.e., the default), or\n",
+    " - “superwide” meaning $\\sigma_\\rho=.8$.\n",
+    "\n",
+    "Note that the maximum possible standard deviation of a distribution of partial correlations is 1, which would be a distribution with half of the values at $\\rho^p_{YX_j}=-1$ and the other half at $\\rho^p_{YX_j}=1$. Viewed from this partial correlation perspective, it seems hard to theoretically justify anything wider than our “wide” default, since this would correspond to something that is wider than a flat prior on the partial correlation scale (although we note that, purely practically speaking, there are often no discernible problems in using such a wider prior)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The intercept / constant term"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "The default prior for the intercept $\\beta_0$ must follow a different scheme, since partial correlations with the constant term are undefined. We first note that in ordinary least squares (OLS) regression $\\beta_0 = \\bar{Y} - \\beta_1\\bar{X}_1 - \\beta_2\\bar{X}_2 - \\dots$. So we can set the mean of the prior on $\\beta_0$ to\n",
     "\n",
     "$\\text{E}[\\beta_0] = \\bar{Y} - \\text{E}[\\beta_1]\\bar{X}_1 - \\text{E}[\\beta_2]\\bar{X}_2 - \\dots$.\n",
@@ -58,29 +89,42 @@
     "\n",
     "$\\text{var}(\\beta_0) = \\bar{X}_1^2\\text{var}(\\beta_1) + \\bar{X}_2^2\\text{var}(\\beta_2) + \\dots$.\n",
     "\n",
-    "In other words, once we have defined the priors on the slopes, we can combine this with the means of the predictors to find the implied variance of $\\beta_0$. Our default prior for the intercept is a Normal distribution following this scheme, and it assumes that all of the slopes were set to have “wide” priors (as defined above), regardless of what the user actually selected for the width of the slope priors.\n",
-    "\n"
+    "In other words, once we have defined the priors on the slopes, we can combine this with the means of the predictors to find the implied variance of $\\beta_0$. Our default prior for the intercept is a Normal distribution following this scheme, and it assumes that all of the slopes were set to have “wide” priors (as defined above), regardless of what the user actually selected for the width of the slope priors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Residual standard deviation ($\\sigma$)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We know that necessarily $0 \\le \\sigma \\le \\text{SD}(Y)$. Ideally we would have a prior with support bounded in $[0,\\text{SD}(Y)]$, negatively skewed with minimum value at 0 (which implies a coefficient of determination $R^2=1$) and maximum value at $\\text{SD}(Y)$ (which implies $R^2=0$). Because there is not such a ready-made distribution in the backend packages underlying Bambi--and because it will tend to make little difference in practice--we simply use $\\sigma \\sim \\text{Uniform}(0,\\text{SD}(Y))$."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Edge cases"
+    "### Models without a constant term"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Random effects"
+    "# Fixed and random effects, Normal response distribution (LMMs)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Other link functions and non-Normal response distributions (i.e., GLMMs)"
+    "# Non-Normal response distribution (GLMs and GLMMs)"
    ]
   },
   {