Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for simulating from the posterior (or just add an example to the documentation) #113

Closed
cbrummitt opened this issue Jul 27, 2017 · 9 comments

Comments

@cbrummitt
Copy link
Collaborator

Estimating the mean and confidence intervals (using prediction_intervals) is great. In some cases, it can be useful to simulate from the posterior distribution of the model's coefficients. An example is given in pages 242–243 of [1].

I think the following code snippet does the trick for a LinearGAM:

def simulate_from_posterior(linear_gam, X, n_simulations):
    """Simulate from the posterior of a LinearGAM a certain number of times.

    Inputs
    ------
    linear_gam : pyGAM.LinearGAM

    X : array of shape (n_samples, n_features)

    n_simulations : int
        The number of simulations from the posterior to compute

    Returns
    -------
    simulations : array of shape (n_samples, n_simulations)
    """
    beta_replicates = np.random.multivariate_normal(
        linear_gam.coef_, linear_gam.statistics_['cov'], size=n_simulations)
    return linear_gam._modelmat(X).dot(beta_replicates.T)

I'm not sure if this should be added as an example in the documentation or added to the code (or both).

To implement this in general, I think we'd want to add a method for each Distribution that draws a certain number of samples (called sample or random_variate?), so we'd have a NormalDist.sample, BinomialDist.sample, and so on. Then the GAM.simulate could just call self.dist.sample(self.coef_, self.statistics['cov'], size=n_simulations)? I'm not sure yet how to best handle the link functions for these simulations...

As pointed out on pages 256–257 of [1], this procedure simulates the coefficients conditioned on the smoothing parameters, lambda (lam). To actually simulate from the coefficients, one may use bootstrap samples to get simulations of the coefficients and of the smoothing parameters; an example is given on page 257 of [1].

[1] S. Wood. Generalized Additive Models: An Introduction with R (First Edition). Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2006.

@dswah
Copy link
Owner

dswah commented Jul 27, 2017

@cbrummitt Awesome code sample :) i'm impressed you were able to dig that up.

We should definitely add that functionality to the models.

For linear GAMs your example looks solid. (ignoring the question of estimating the scale parameter... pg 68 section 2.1.5 of Wood's book)

I see your point about each distribution needing its own sampler... I suspect that we can go a long way with numpy's random module.

Perhaps adding the link functions could be similar to the way the gam.predict_mu method handles it? Here is my suggestion, assuming this would be a GAM method (plus or minus a few transposes...)

def simulate_from_posterior(self, X, n_simulations):
    """Simulate from the posterior of a GAM a certain number of times.

    Inputs
    ------
    X : array of shape (n_samples, n_features)

    n_simulations : int
        The number of simulations from the posterior to compute

    Returns
    -------
    simulations : array of shape (n_samples, n_simulations)
    """
    beta_replicates = np.random.multivariate_normal(
        self.coef_, linear_gam.statistics_['cov'], size=n_simulations)
    
    linear_predictor = self._modelmat(X).dot(beta_replicates.T)
    
    mu = self.link.mu(linear_predictor, self.distribution)

    return self.distribution.sample(mu)

Does that look right to you?

Bootstrapping makes sense to me for getting a distribution of (B, lam).
However, fitting multiple GAMs via bootstrap sounds potentially expensive for most models...

Do you have an intuition of how many bootstrap samples is enough?

I will do some reading on the properties of the bootstrap.

@cbrummitt
Copy link
Collaborator Author

That new method is beautiful 😍 Just one edit: linear_gam.statistics_ ➡️ self.statistics_

An alternative to using numpy's random module is to use the "random variates" method rvs of random variables in scipy, which draws samples using the CDF.

I don't have good intuition right now for how many bootstrap samples would be a good default value. Wood simulates the coefficients n_simulations=1000 times when conditioning on the estimated smoothing parameters (page 256, , Sec. 5.4.2). On the next page (257) he does 19 bootstrap samples of the smoothing parameters sp, and for each of those values of sp he makes n_simulations=100 simulations of the coefficients. I guess just make it a parameter (n_bootstrap_samples?), and allow some status messages to be printed (verbose=1) because it could take a while?

@cbrummitt
Copy link
Collaborator Author

I'm also not sure about where to transpose. The last line may need to be self.distribution.sample(mu.T).T depending on what shape the distribution.sample method takes.

Because the first axis tends to be samples, I chose to make the shape of the output to be (n_samples, n_simulations): the first axis is samples (like for X), and the second axis is simulations.

@dswah
Copy link
Owner

dswah commented Jul 28, 2017

Nice. yeah something like n_simulations n_bootstap_samples seems like a reasonable API.

The scipy RV sampling looks cool. ill play around with that a bit and compare to numpy. let's see which one has the most distributions/easiest parameterization/fastest/etc.

@cbrummitt TBH I would love for you to contribute to this code base (even if it's just a few lines!)
Would you make a new branch and add the method that we've been working on?

I can then branch off of your branch, and collaborate on the PR that way.

@cbrummitt
Copy link
Collaborator Author

Sure! I started working on it.

For one point of reference, in PyMC3's API the sample function returns a dictionary mapping variable name to numpy arrays with the first axis being the index of the random sample and the remaining axes being the axes of that variable. This suggests we insert a transpose in the code snippets above.

The sample function runs the step method(s) assigned (or passed) to it for the given number of iterations and returns a Trace object containing the samples collected, in the order they were collected. The trace object can be queried in a similar way to a dict containing a map from variable names to numpy.arrays. The first dimension of the array is the sampling index and the later dimensions match the shape of the variable.

@cbrummitt
Copy link
Collaborator Author

numpy.random seems to be 15% faster than scipy:

import numpy as np
import scipy as sp
n_simulations = 1000
n_samples = 100
mu_samples = np.random.uniform(size=(n_samples, n_simulations))

%timeit -n 200 -r 30 np.random.normal(mu_samples)
# 3.77 ms ± 109 µs per loop (mean ± std. dev. of 30 runs, 200 loops each)

%timeit -n 200 -r 30 sp.stats.norm(loc=mu_samples).rvs()
# 4.3 ms ± 86.7 µs per loop (mean ± std. dev. of 30 runs, 200 loops each)

@cbrummitt
Copy link
Collaborator Author

cbrummitt commented Jul 29, 2017

I forked the repo and made a new branch simulate-from-posterior. I added a method simulate_from_coef_posterior_conditioned_on_smoothing_parameters (a clunky, long name that should be improved) and added a sample method to each distribution using numpy. I have not gotten a chance to test these sample methods and to make sure that I got all the parameters right, and in particular the scale parameter.

• Normal: np.random.normal(loc=mu, scale=standard_deviation, size=None) where standard_deviation = self.scale**0.5 if self.scale else 1.0.
• Binomial: np.random.binomial(n=number_of_trials, p=success_probability, size=None) where number_of_trials = self.levels and success_probability = mu / number_of_trials.
• Poisson: np.random.poisson(lam=mu, size=None).
• Gamma: np.random.gamma(shape=shape, scale=scale, size=None) where shape = 1. / self.scale and scale = mu / shape.
• InvGaussian: np.random.wald(mean=mu, scale=self.scale, size=None).

The size parameters are all None so that the result has the same shape as mu.

I also made sample an abstract method. I'm not sure whether you'd like to make Distribution an abstract base class.

I haven't yet tried implementing bootstrap samples to get random samples of the smoothing parameters, too.

Update: Fixed the arguments to the numpy methods in the list above.

@dswah
Copy link
Owner

dswah commented Jul 29, 2017

😍😭 love it. im planning on working on pygam today and tomorrow. ive got some more work on the other PR.

push youre work whenever you want and i that way i can add to it!

@dswah
Copy link
Owner

dswah commented Dec 10, 2017

@cbrummitt so awesome man!!

thanks for this :)

@dswah dswah closed this as completed Dec 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants