Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fitting distributions and computing likelihoods #5

Merged
merged 30 commits into from
May 29, 2019

Conversation

1danjordan
Copy link
Contributor

@1danjordan 1danjordan commented May 14, 2019

Hi Alex,

I was literally thinking about the need for this package yesterday, then lo and behold it pops up on Twitter today! I want to use this package in one of my own packages - in particular I need to estimate distributions with data.

I've added a fit generic like in Distributions.jl. I've implemented methods for normal, binomial and bernoulli. They are the simplest parameters to estimate without going into MLE. I figured it was a good place to start.

There's also some simple tests added to test-methods.R as well. I haven't written any examples because I wasn't sure where you wanted to include it - maybe under each distribution function like pdf, cdf and quantile?

I'll continue to add fit methods to this branch if this pull request is welcome.

Thanks,
Dan

@alexpghayes
Copy link
Owner

This is very welcome and part of the longer term plan for sure. Misc thoughts on designing this right from the beginning to avoid having to redo things later:

  • The generic should be called fit_mle() if returns models fit using MLE estimators
  • fit_mle() should probably calculate sufficient statistics and pass these to a generic fit_mle_suff_stat(). I think Distributions.jl does something like this but haven't looked into the details.
  • The documentation for bernoulli distributions should include the likelihood, log-likelihood and the sufficient statistics in the documentation, and fit_mle.bernoulli and fit_mle_suff_stat.bernoulli should inherit this documentation
  • Once we have a generic log_pdf() (and aliased log_pmf()) we can have a fallback MLE estimation using optim() together wrapped around sum(log_pdf(distribution, data)). Perhaps we should have likelihood() and log_likelihood() methods as well?

I've punted so far on describing the support of a distribution / deciding on behavior when the input data isn't in that support, but this becomes more important once we start fitting distributions.

Another question is whether estimated distributions should have different classes from "pure" distributions to enable additional S3 dispatch for things like diagnostic plots, etc.

Sorry, that's a lot. Don't worry about all of it, just pick a tiny piece and start there.

R/bernoulli.R Outdated Show resolved Hide resolved
R/bernoulli.R Outdated Show resolved Hide resolved
@alexpghayes
Copy link
Owner

Nice! For the next change let's move to fit_mle() and fit_mle_suff_stat()!

@1danjordan
Copy link
Contributor Author

So I had a look through Distributions.jl on how they handled fitting distributions using sufficient statistics - they actually have suffstat generic which works really nicely. I could see it being used outside of fitting distributions to data - even as a way to summarise data in the context of a distribution. So I've implemented a suff_stat generic and written methods for bernoulli and binomial.

suff_stat(bernoulli(), c(0,0,1,1,1))
#>  $successes
#> [1] 3
#>
#> $failures
#> [1] 2

suff_stat(binomial(5), c(3,3,3))
#> $successes
#> [1] 9
#>
#> $experiments
#> [1] 3
#>
#> $trials
#> [1] 5

fit_mle just calls suff_stat and builds the distribution object with them. suff_stat is used to check if input data is correct. Do you still think you need a fit_mle_suff_stat generic?

@alexpghayes
Copy link
Owner

Oh that's nice. No need for fit_mle_suff_stat()! I'm a big fan of where this is going.

@1danjordan
Copy link
Contributor Author

There's now working and tested suff_stat and fit_mle methods for bernoulli, binomial, exponential, normal and poisson.

fit(bernoulli(), rbinom(100, 1, 0.1))
#> Bernoulli distribution (p = 0.12)

fit(binomial(10), rbinom(100, 10, 0.9))
#>Binomial distribution (size = 10, p = 0.888)

fit(normal(), rnorm(100))
#> normal distribution (mu = -0.0962394013206253, sigma = 1.09850673011021)

fit(exponential(), rexp(100))
#> Exponential distribution (rate = 0.827082214289625)

fit(poisson(1), rpois(100, 1.5))
#> Poisson distribution (lambda = 1.55)

I've also added a likelihood generic, but haven't implemented any methods or documented it. That and a log_likelihood generic will make it easy to do MLE for the harder distributions. Do you think this is going in the right directions?

This is the exact same as the bernoulli dist, except it creates
a binomial object. The code duplication seems reasonable in this
case.
This is stated in the documentation, but no default was set
Also correctly check inputs are counts and not greater than `size`
Checks simple working examples and cases where data passed to
fit the binomial are incorrect (negative counts or counts greater
than the size parameter of the binomial.
`fit_mle` is now the generic for fitting a distribution with data.
This commit also includes some basic tests for the both functions.
suff_stat.binomial returns a named list {successes, experiments,
trials} to be explicit as possible for users.
@1danjordan
Copy link
Contributor Author

I've created a log_likelihood generic and implemented methods for bernoulli and binomial. Maybe you could review and let me know if this is what you wanted?

@alexpghayes
Copy link
Owner

Will review this afternoon!

Copy link
Owner

@alexpghayes alexpghayes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited about this! Left some comments, but I think we're close to merging and moving to next steps!

R/bernoulli.R Outdated
#' @export

likelihood.bernoulli <- function(d, x, ...) {
prod(dbinom(x = x, size = 1, prob = d$p))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. Let's use pdf() here rather than dbinom().
  2. Products of probabilities tend to underflow. I would compute the sum of log probabilities and then exponential it at the end. So this can just call log_likelihood().

R/bernoulli.R Outdated
#' @return the log-likelihood
#' @export
log_likelihood.bernoulli <- function(d, x, ...) {
sum(dbinom(x = x, size = 1, prob = d$p, log = TRUE))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this but think we should use the generic based approach here. Which means implementing a bunch of log_pdf() methods ugh. If you're willing to do this, great. If not, I'll merge and take care of it down the line.

Copy link
Contributor Author

@1danjordan 1danjordan May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually a log_pdf generic would solve for log_likelihood and in turn likelihood. That would also mean that potentially we wouldn't need generics for them? Just have something like

log_likelihood <- function(d, x) {
  sum(log_pdf(d, x))
}

likelihood <- function(d, x) {
  exp(log_likelihood(d, x))
}

Letting log_pdf do all the dispatching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if just using log(pdf(d, x)) was a good idea over implementing an entirely new generic that would be nearly identical but it mentions in the docs that the log = TRUE argument in all the base distributions has a greater range.

n <- 2000
k <- seq(0, n, by = 20)
plot (k, dbinom(k, n, pi/10, log = TRUE), type = "l", ylab = "log density",
      main = "dbinom(*, log=TRUE) is better than  log(dbinom(*))")
lines(k, log(dbinom(k, n, pi/10)), col = "red", lwd = 2)

So it's worth it really

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely worth it for numerical reasons. Taking the log of tiny numbers is sketchy and I'm sure the log = TRUE option does smart stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've been thinking about this over the last few days and am just coming into some time to work on it, but wanted to get your thoughts. I think ultimately there's 3 options, and each has their pros and cons and wanted to put them past you.

Option 1

Create a log_pdf generic and write methods for every distribution. This is straight forward but results in a lot of code duplication and makes updating and adding code down the line slightly more difficult. For example

pdf.binomial <- function(d, x, ...) {
  dbinom(x = x, size = d$size, prob = d$p)
}

log_pdf.binomial <- function(d, x, ...) {
  dbinom(x = x, size = d$size, prob = d$p, log = TRUE)
}

This seems slightly ridiculous when you look at it.

Option 2

Add a log argument to pdf and just create a function log_pdf that uses this

pdf.binomial <- function(d, x, log = FALSE, ...) {
  dbinom(x = x, size = d$size, prob = d$p, log = log)
}

log_pdf <- function(d, x, ...) {
  pdf(d, x, log = TRUE, ...)
}

This cuts down on duplication considerably and would be very easy to update existing distributions for. It also falls inline with the APIs for the base R distributions.

Option 3

Option 3 is to make use of dot-dot-dot, but I never really use dot-dot-dot and am not sure of the implications if any

pdf.binomial <- function(d, x, ...) {
  dbinom(x = x, size = d$size, prob = d$p, ...)
}

log_pdf <- function(d, x, ...) {
  pdf(d, x, log = TRUE, ...)
}

Option 1 gives them most control, because we can write an exact log_pdf function for every distribution. I'm not sure if there is much benefit in that, but maybe there are cases where this is preferable?

Option 2 and 3 are simple which is nice. Especially given really we just want to take the log of a probability. A downside is that every pdf method needs to handle the log implementation internally, but that is already done for the base R functions.

My preference would be Option 2 because it's the most explicit. Any thoughts yourself?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite strongly in favor of Option 1!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grand, consider it done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right they're all written - documentation just points to the pdf method for each distribution and I've added a log_pdf example for each too. No tests included though. I can start building them out next.

}

#' Compute the sufficient statistics for the binomial distribution from data
#'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document the @return for all the suff_stat() methods. This should have two parts: 1. that the return is a list, and the names of the elements of the list, and 2. adding some notes on the sufficient statistics to the original @details in the bernoulli object documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each suff_stat method's @return is now documented. I haven't gotten to adding documentation to the @details yet.

R/exponential.R Show resolved Hide resolved
R/methods.R Outdated
#'
#' @param d A probability distribution object such as those created by
#' a call to [bernoulli()], [beta()], or [binomial()].
#' @param x A vector of data to estimate the parameters of the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not estimating parameters of a distribution here, but calculating a likelihood / joint density.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that's from copy/pasting. I'll do a proper clean up and reorg of the docs - they're mostly thrown together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated to "A vector of data to compute the likelihood."

#' @export
fit_mle.poisson <- function(d, x, ...) {
ss <- suff_stat(d, x, ...)
poisson(ss$sum / ss$samples)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start a conventions vignette to keep track of all these names for things to make sure to be consistent down the line. We didn't do that in broom and it bit us in the ass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, already some divergence on my part. suff_stat.binomial uses the trials rather than size. I was aware that it didn't match the size argument, but it just didn't seem informative or clear enough in that context. I'm happy to follow whatever convention though of course.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a list in #4 that we can use to track for the time being

@1danjordan 1danjordan changed the title Fit generic Fitting distributions and computing likelihoods May 26, 2019
@1danjordan
Copy link
Contributor Author

This has become quite a large pull request at this point and encapsulates more than just creating a fit generic, so I've updated the title of the pull request to reflect that. To summarise what has been completed at this point - this pull request contains:

  • log_pdf methods for each distribution
  • a likelihood and log_likelihood function that relies on each log_pdf methods
  • a suff_stat generic for computing sufficient statistics with methods for the following distributions:
    • Bernoulli
    • Binomial
    • Normal
    • Exponential
    • Poisson
  • a fit function and fit_mle generic for doing maximum likelihood estimation, with methods for
    • Bernoulli
    • Binomial
    • Normal
    • Exponential
    • Poisson
  • a number of tests

There is more documentation that to be written in the @details sections of each distribution and obviously more distributions to write suff_stat and fit_mle methods for.

@alexpghayes
Copy link
Owner

This is incredible. I'd like to merge this and then do more documentation / distributions in other pull requests. Will you add yourself as a contributor in DESCRIPTION and then I'll merge?

@1danjordan
Copy link
Contributor Author

This is incredible. I'd like to merge this and then do more documentation / distributions in other pull requests. Will you add yourself as a contributor in DESCRIPTION and then I'll merge?

I've done that there. Honestly this is my pleasure, I've really enjoyed working on this pull request and think distributions has great potential.

@alexpghayes alexpghayes merged commit d20a390 into alexpghayes:master May 29, 2019
1danjordan pushed a commit to 1danjordan/distributions that referenced this pull request Jul 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants