Add method for simulating from the posterior (or just add an example to the documentation) #113

cbrummitt · 2017-07-27T17:38:34Z

Estimating the mean and confidence intervals (using prediction_intervals) is great. In some cases, it can be useful to simulate from the posterior distribution of the model's coefficients. An example is given in pages 242–243 of [1].

I think the following code snippet does the trick for a LinearGAM:

def simulate_from_posterior(linear_gam, X, n_simulations):
    """Simulate from the posterior of a LinearGAM a certain number of times.

    Inputs
    ------
    linear_gam : pyGAM.LinearGAM

    X : array of shape (n_samples, n_features)

    n_simulations : int
        The number of simulations from the posterior to compute

    Returns
    -------
    simulations : array of shape (n_samples, n_simulations)
    """
    beta_replicates = np.random.multivariate_normal(
        linear_gam.coef_, linear_gam.statistics_['cov'], size=n_simulations)
    return linear_gam._modelmat(X).dot(beta_replicates.T)

I'm not sure if this should be added as an example in the documentation or added to the code (or both).

To implement this in general, I think we'd want to add a method for each Distribution that draws a certain number of samples (called sample or random_variate?), so we'd have a NormalDist.sample, BinomialDist.sample, and so on. Then the GAM.simulate could just call self.dist.sample(self.coef_, self.statistics['cov'], size=n_simulations)? I'm not sure yet how to best handle the link functions for these simulations...

As pointed out on pages 256–257 of [1], this procedure simulates the coefficients conditioned on the smoothing parameters, lambda (lam). To actually simulate from the coefficients, one may use bootstrap samples to get simulations of the coefficients and of the smoothing parameters; an example is given on page 257 of [1].

[1] S. Wood. Generalized Additive Models: An Introduction with R (First Edition). Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2006.

The text was updated successfully, but these errors were encountered:

dswah · 2017-07-27T23:31:06Z

@cbrummitt Awesome code sample :) i'm impressed you were able to dig that up.

We should definitely add that functionality to the models.

For linear GAMs your example looks solid. (ignoring the question of estimating the scale parameter... pg 68 section 2.1.5 of Wood's book)

I see your point about each distribution needing its own sampler... I suspect that we can go a long way with numpy's random module.

Perhaps adding the link functions could be similar to the way the gam.predict_mu method handles it? Here is my suggestion, assuming this would be a GAM method (plus or minus a few transposes...)

def simulate_from_posterior(self, X, n_simulations):
    """Simulate from the posterior of a GAM a certain number of times.

    Inputs
    ------
    X : array of shape (n_samples, n_features)

    n_simulations : int
        The number of simulations from the posterior to compute

    Returns
    -------
    simulations : array of shape (n_samples, n_simulations)
    """
    beta_replicates = np.random.multivariate_normal(
        self.coef_, linear_gam.statistics_['cov'], size=n_simulations)
    
    linear_predictor = self._modelmat(X).dot(beta_replicates.T)
    
    mu = self.link.mu(linear_predictor, self.distribution)

    return self.distribution.sample(mu)

Does that look right to you?

Bootstrapping makes sense to me for getting a distribution of (B, lam).
However, fitting multiple GAMs via bootstrap sounds potentially expensive for most models...

Do you have an intuition of how many bootstrap samples is enough?

I will do some reading on the properties of the bootstrap.

cbrummitt · 2017-07-28T14:03:31Z

That new method is beautiful 😍 Just one edit: linear_gam.statistics_ ➡️ self.statistics_

An alternative to using numpy's random module is to use the "random variates" method rvs of random variables in scipy, which draws samples using the CDF.

I don't have good intuition right now for how many bootstrap samples would be a good default value. Wood simulates the coefficients n_simulations=1000 times when conditioning on the estimated smoothing parameters (page 256, , Sec. 5.4.2). On the next page (257) he does 19 bootstrap samples of the smoothing parameters sp, and for each of those values of sp he makes n_simulations=100 simulations of the coefficients. I guess just make it a parameter (n_bootstrap_samples?), and allow some status messages to be printed (verbose=1) because it could take a while?

cbrummitt · 2017-07-28T14:16:06Z

I'm also not sure about where to transpose. The last line may need to be self.distribution.sample(mu.T).T depending on what shape the distribution.sample method takes.

Because the first axis tends to be samples, I chose to make the shape of the output to be (n_samples, n_simulations): the first axis is samples (like for X), and the second axis is simulations.

dswah · 2017-07-28T17:18:35Z

Nice. yeah something like n_simulations n_bootstap_samples seems like a reasonable API.

The scipy RV sampling looks cool. ill play around with that a bit and compare to numpy. let's see which one has the most distributions/easiest parameterization/fastest/etc.

@cbrummitt TBH I would love for you to contribute to this code base (even if it's just a few lines!)
Would you make a new branch and add the method that we've been working on?

I can then branch off of your branch, and collaborate on the PR that way.

cbrummitt · 2017-07-29T19:42:31Z

Sure! I started working on it.

For one point of reference, in PyMC3's API the sample function returns a dictionary mapping variable name to numpy arrays with the first axis being the index of the random sample and the remaining axes being the axes of that variable. This suggests we insert a transpose in the code snippets above.

The sample function runs the step method(s) assigned (or passed) to it for the given number of iterations and returns a Trace object containing the samples collected, in the order they were collected. The trace object can be queried in a similar way to a dict containing a map from variable names to numpy.arrays. The first dimension of the array is the sampling index and the later dimensions match the shape of the variable.

cbrummitt · 2017-07-29T19:46:50Z

numpy.random seems to be 15% faster than scipy:

import numpy as np
import scipy as sp
n_simulations = 1000
n_samples = 100
mu_samples = np.random.uniform(size=(n_samples, n_simulations))

%timeit -n 200 -r 30 np.random.normal(mu_samples)
# 3.77 ms ± 109 µs per loop (mean ± std. dev. of 30 runs, 200 loops each)

%timeit -n 200 -r 30 sp.stats.norm(loc=mu_samples).rvs()
# 4.3 ms ± 86.7 µs per loop (mean ± std. dev. of 30 runs, 200 loops each)

cbrummitt · 2017-07-29T20:11:22Z

I forked the repo and made a new branch simulate-from-posterior. I added a method simulate_from_coef_posterior_conditioned_on_smoothing_parameters (a clunky, long name that should be improved) and added a sample method to each distribution using numpy. I have not gotten a chance to test these sample methods and to make sure that I got all the parameters right, and in particular the scale parameter.

• Normal: np.random.normal(loc=mu, scale=standard_deviation, size=None) where standard_deviation = self.scale**0.5 if self.scale else 1.0.
• Binomial: np.random.binomial(n=number_of_trials, p=success_probability, size=None) where number_of_trials = self.levels and success_probability = mu / number_of_trials.
• Poisson: np.random.poisson(lam=mu, size=None).
• Gamma: np.random.gamma(shape=shape, scale=scale, size=None) where shape = 1. / self.scale and scale = mu / shape.
• InvGaussian: np.random.wald(mean=mu, scale=self.scale, size=None).

The size parameters are all None so that the result has the same shape as mu.

I also made sample an abstract method. I'm not sure whether you'd like to make Distribution an abstract base class.

I haven't yet tried implementing bootstrap samples to get random samples of the smoothing parameters, too.

Update: Fixed the arguments to the numpy methods in the list above.

dswah · 2017-07-29T20:27:27Z

😍😭 love it. im planning on working on pygam today and tomorrow. ive got some more work on the other PR.

push youre work whenever you want and i that way i can add to it!

dswah · 2017-12-10T20:32:10Z

@cbrummitt so awesome man!!

thanks for this :)

cbrummitt mentioned this issue Jul 31, 2017

[WIP] Simulate from posterior of the coefficients and smoothing parameters #117

Merged

dswah closed this as completed Dec 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method for simulating from the posterior (or just add an example to the documentation) #113

Add method for simulating from the posterior (or just add an example to the documentation) #113

cbrummitt commented Jul 27, 2017

dswah commented Jul 27, 2017

cbrummitt commented Jul 28, 2017

cbrummitt commented Jul 28, 2017

dswah commented Jul 28, 2017

cbrummitt commented Jul 29, 2017

cbrummitt commented Jul 29, 2017

cbrummitt commented Jul 29, 2017 •

edited

dswah commented Jul 29, 2017

dswah commented Dec 10, 2017

Add method for simulating from the posterior (or just add an example to the documentation) #113

Add method for simulating from the posterior (or just add an example to the documentation) #113

Comments

cbrummitt commented Jul 27, 2017

dswah commented Jul 27, 2017

cbrummitt commented Jul 28, 2017

cbrummitt commented Jul 28, 2017

dswah commented Jul 28, 2017

cbrummitt commented Jul 29, 2017

cbrummitt commented Jul 29, 2017

cbrummitt commented Jul 29, 2017 • edited

dswah commented Jul 29, 2017

dswah commented Dec 10, 2017

cbrummitt commented Jul 29, 2017 •

edited