-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fitting distributions and computing likelihoods #5
Conversation
This is very welcome and part of the longer term plan for sure. Misc thoughts on designing this right from the beginning to avoid having to redo things later:
I've punted so far on describing the support of a distribution / deciding on behavior when the input data isn't in that support, but this becomes more important once we start fitting distributions. Another question is whether estimated distributions should have different classes from "pure" distributions to enable additional S3 dispatch for things like diagnostic plots, etc. Sorry, that's a lot. Don't worry about all of it, just pick a tiny piece and start there. |
Nice! For the next change let's move to |
So I had a look through Distributions.jl on how they handled fitting distributions using sufficient statistics - they actually have suff_stat(bernoulli(), c(0,0,1,1,1))
#> $successes
#> [1] 3
#>
#> $failures
#> [1] 2
suff_stat(binomial(5), c(3,3,3))
#> $successes
#> [1] 9
#>
#> $experiments
#> [1] 3
#>
#> $trials
#> [1] 5
|
Oh that's nice. No need for |
There's now working and tested fit(bernoulli(), rbinom(100, 1, 0.1))
#> Bernoulli distribution (p = 0.12)
fit(binomial(10), rbinom(100, 10, 0.9))
#>Binomial distribution (size = 10, p = 0.888)
fit(normal(), rnorm(100))
#> normal distribution (mu = -0.0962394013206253, sigma = 1.09850673011021)
fit(exponential(), rexp(100))
#> Exponential distribution (rate = 0.827082214289625)
fit(poisson(1), rpois(100, 1.5))
#> Poisson distribution (lambda = 1.55) I've also added a |
This is the exact same as the bernoulli dist, except it creates a binomial object. The code duplication seems reasonable in this case.
This is stated in the documentation, but no default was set
Also correctly check inputs are counts and not greater than `size`
Checks simple working examples and cases where data passed to fit the binomial are incorrect (negative counts or counts greater than the size parameter of the binomial.
`fit_mle` is now the generic for fitting a distribution with data.
This commit also includes some basic tests for the both functions.
suff_stat.binomial returns a named list {successes, experiments, trials} to be explicit as possible for users.
Along with basic tests
I've created a |
Will review this afternoon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited about this! Left some comments, but I think we're close to merging and moving to next steps!
R/bernoulli.R
Outdated
#' @export | ||
|
||
likelihood.bernoulli <- function(d, x, ...) { | ||
prod(dbinom(x = x, size = 1, prob = d$p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- Let's use
pdf()
here rather thandbinom()
. - Products of probabilities tend to underflow. I would compute the sum of log probabilities and then exponential it at the end. So this can just call
log_likelihood()
.
R/bernoulli.R
Outdated
#' @return the log-likelihood | ||
#' @export | ||
log_likelihood.bernoulli <- function(d, x, ...) { | ||
sum(dbinom(x = x, size = 1, prob = d$p, log = TRUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this but think we should use the generic based approach here. Which means implementing a bunch of log_pdf()
methods ugh. If you're willing to do this, great. If not, I'll merge and take care of it down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually a log_pdf
generic would solve for log_likelihood
and in turn likelihood
. That would also mean that potentially we wouldn't need generics for them? Just have something like
log_likelihood <- function(d, x) {
sum(log_pdf(d, x))
}
likelihood <- function(d, x) {
exp(log_likelihood(d, x))
}
Letting log_pdf
do all the dispatching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if just using log(pdf(d, x))
was a good idea over implementing an entirely new generic that would be nearly identical but it mentions in the docs that the log = TRUE
argument in all the base distributions has a greater range.
n <- 2000
k <- seq(0, n, by = 20)
plot (k, dbinom(k, n, pi/10, log = TRUE), type = "l", ylab = "log density",
main = "dbinom(*, log=TRUE) is better than log(dbinom(*))")
lines(k, log(dbinom(k, n, pi/10)), col = "red", lwd = 2)
So it's worth it really
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely worth it for numerical reasons. Taking the log of tiny numbers is sketchy and I'm sure the log = TRUE
option does smart stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've been thinking about this over the last few days and am just coming into some time to work on it, but wanted to get your thoughts. I think ultimately there's 3 options, and each has their pros and cons and wanted to put them past you.
Option 1
Create a log_pdf
generic and write methods for every distribution. This is straight forward but results in a lot of code duplication and makes updating and adding code down the line slightly more difficult. For example
pdf.binomial <- function(d, x, ...) {
dbinom(x = x, size = d$size, prob = d$p)
}
log_pdf.binomial <- function(d, x, ...) {
dbinom(x = x, size = d$size, prob = d$p, log = TRUE)
}
This seems slightly ridiculous when you look at it.
Option 2
Add a log
argument to pdf
and just create a function log_pdf
that uses this
pdf.binomial <- function(d, x, log = FALSE, ...) {
dbinom(x = x, size = d$size, prob = d$p, log = log)
}
log_pdf <- function(d, x, ...) {
pdf(d, x, log = TRUE, ...)
}
This cuts down on duplication considerably and would be very easy to update existing distributions for. It also falls inline with the APIs for the base R distributions.
Option 3
Option 3 is to make use of dot-dot-dot, but I never really use dot-dot-dot and am not sure of the implications if any
pdf.binomial <- function(d, x, ...) {
dbinom(x = x, size = d$size, prob = d$p, ...)
}
log_pdf <- function(d, x, ...) {
pdf(d, x, log = TRUE, ...)
}
Option 1 gives them most control, because we can write an exact log_pdf
function for every distribution. I'm not sure if there is much benefit in that, but maybe there are cases where this is preferable?
Option 2 and 3 are simple which is nice. Especially given really we just want to take the log of a probability. A downside is that every pdf
method needs to handle the log implementation internally, but that is already done for the base R functions.
My preference would be Option 2 because it's the most explicit. Any thoughts yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite strongly in favor of Option 1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grand, consider it done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right they're all written - documentation just points to the pdf
method for each distribution and I've added a log_pdf
example for each too. No tests included though. I can start building them out next.
} | ||
|
||
#' Compute the sufficient statistics for the binomial distribution from data | ||
#' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to document the @return
for all the suff_stat()
methods. This should have two parts: 1. that the return is a list, and the names of the elements of the list, and 2. adding some notes on the sufficient statistics to the original @details
in the bernoulli
object documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each suff_stat
method's @return
is now documented. I haven't gotten to adding documentation to the @details
yet.
R/methods.R
Outdated
#' | ||
#' @param d A probability distribution object such as those created by | ||
#' a call to [bernoulli()], [beta()], or [binomial()]. | ||
#' @param x A vector of data to estimate the parameters of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not estimating parameters of a distribution here, but calculating a likelihood / joint density.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that's from copy/pasting. I'll do a proper clean up and reorg of the docs - they're mostly thrown together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been updated to "A vector of data to compute the likelihood."
#' @export | ||
fit_mle.poisson <- function(d, x, ...) { | ||
ss <- suff_stat(d, x, ...) | ||
poisson(ss$sum / ss$samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start a conventions
vignette to keep track of all these names for things to make sure to be consistent down the line. We didn't do that in broom
and it bit us in the ass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, already some divergence on my part. suff_stat.binomial
uses the trials
rather than size
. I was aware that it didn't match the size
argument, but it just didn't seem informative or clear enough in that context. I'm happy to follow whatever convention though of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started a list in #4 that we can use to track for the time being
Also includes a likelihood and log_likelihood function that uses log_pdf
This has become quite a large pull request at this point and encapsulates more than just creating a
There is more documentation that to be written in the |
This is incredible. I'd like to merge this and then do more documentation / distributions in other pull requests. Will you add yourself as a contributor in DESCRIPTION and then I'll merge? |
I've done that there. Honestly this is my pleasure, I've really enjoyed working on this pull request and think |
Hi Alex,
I was literally thinking about the need for this package yesterday, then lo and behold it pops up on Twitter today! I want to use this package in one of my own packages - in particular I need to estimate distributions with data.
I've added a
fit
generic like in Distributions.jl. I've implemented methods fornormal
,binomial
andbernoulli
. They are the simplest parameters to estimate without going into MLE. I figured it was a good place to start.There's also some simple tests added to
test-methods.R
as well. I haven't written any examples because I wasn't sure where you wanted to include it - maybe under each distribution function likepdf
,cdf
andquantile
?I'll continue to add
fit
methods to this branch if this pull request is welcome.Thanks,
Dan