Fitting distributions and computing likelihoods #5

1danjordan · 2019-05-14T21:12:41Z

Hi Alex,

I was literally thinking about the need for this package yesterday, then lo and behold it pops up on Twitter today! I want to use this package in one of my own packages - in particular I need to estimate distributions with data.

I've added a fit generic like in Distributions.jl. I've implemented methods for normal, binomial and bernoulli. They are the simplest parameters to estimate without going into MLE. I figured it was a good place to start.

There's also some simple tests added to test-methods.R as well. I haven't written any examples because I wasn't sure where you wanted to include it - maybe under each distribution function like pdf, cdf and quantile?

I'll continue to add fit methods to this branch if this pull request is welcome.

Thanks,
Dan

alexpghayes · 2019-05-14T21:28:38Z

This is very welcome and part of the longer term plan for sure. Misc thoughts on designing this right from the beginning to avoid having to redo things later:

The generic should be called fit_mle() if returns models fit using MLE estimators
fit_mle() should probably calculate sufficient statistics and pass these to a generic fit_mle_suff_stat(). I think Distributions.jl does something like this but haven't looked into the details.
The documentation for bernoulli distributions should include the likelihood, log-likelihood and the sufficient statistics in the documentation, and fit_mle.bernoulli and fit_mle_suff_stat.bernoulli should inherit this documentation
Once we have a generic log_pdf() (and aliased log_pmf()) we can have a fallback MLE estimation using optim() together wrapped around sum(log_pdf(distribution, data)). Perhaps we should have likelihood() and log_likelihood() methods as well?

I've punted so far on describing the support of a distribution / deciding on behavior when the input data isn't in that support, but this becomes more important once we start fitting distributions.

Another question is whether estimated distributions should have different classes from "pure" distributions to enable additional S3 dispatch for things like diagnostic plots, etc.

Sorry, that's a lot. Don't worry about all of it, just pick a tiny piece and start there.

R/bernoulli.R

alexpghayes · 2019-05-15T00:17:57Z

Nice! For the next change let's move to fit_mle() and fit_mle_suff_stat()!

1danjordan · 2019-05-15T19:24:00Z

So I had a look through Distributions.jl on how they handled fitting distributions using sufficient statistics - they actually have suffstat generic which works really nicely. I could see it being used outside of fitting distributions to data - even as a way to summarise data in the context of a distribution. So I've implemented a suff_stat generic and written methods for bernoulli and binomial.

suff_stat(bernoulli(), c(0,0,1,1,1))
#>  $successes
#> [1] 3
#>
#> $failures
#> [1] 2

suff_stat(binomial(5), c(3,3,3))
#> $successes
#> [1] 9
#>
#> $experiments
#> [1] 3
#>
#> $trials
#> [1] 5

fit_mle just calls suff_stat and builds the distribution object with them. suff_stat is used to check if input data is correct. Do you still think you need a fit_mle_suff_stat generic?

alexpghayes · 2019-05-15T20:29:24Z

Oh that's nice. No need for fit_mle_suff_stat()! I'm a big fan of where this is going.

1danjordan · 2019-05-16T22:40:22Z

There's now working and tested suff_stat and fit_mle methods for bernoulli, binomial, exponential, normal and poisson.

fit(bernoulli(), rbinom(100, 1, 0.1))
#> Bernoulli distribution (p = 0.12)

fit(binomial(10), rbinom(100, 10, 0.9))
#>Binomial distribution (size = 10, p = 0.888)

fit(normal(), rnorm(100))
#> normal distribution (mu = -0.0962394013206253, sigma = 1.09850673011021)

fit(exponential(), rexp(100))
#> Exponential distribution (rate = 0.827082214289625)

fit(poisson(1), rpois(100, 1.5))
#> Poisson distribution (lambda = 1.55)

I've also added a likelihood generic, but haven't implemented any methods or documented it. That and a log_likelihood generic will make it easy to do MLE for the harder distributions. Do you think this is going in the right directions?

This is the exact same as the bernoulli dist, except it creates a binomial object. The code duplication seems reasonable in this case.

This is stated in the documentation, but no default was set

Also correctly check inputs are counts and not greater than `size`

Checks simple working examples and cases where data passed to fit the binomial are incorrect (negative counts or counts greater than the size parameter of the binomial.

`fit_mle` is now the generic for fitting a distribution with data.

This commit also includes some basic tests for the both functions.

suff_stat.binomial returns a named list {successes, experiments, trials} to be explicit as possible for users.

Along with basic tests

1danjordan · 2019-05-17T16:19:55Z

I've created a log_likelihood generic and implemented methods for bernoulli and binomial. Maybe you could review and let me know if this is what you wanted?

alexpghayes · 2019-05-17T16:20:51Z

Will review this afternoon!

alexpghayes

I'm very excited about this! Left some comments, but I think we're close to merging and moving to next steps!

alexpghayes · 2019-05-17T17:48:43Z

R/bernoulli.R

+#' @export
+
+likelihood.bernoulli <- function(d, x, ...) {
+  prod(dbinom(x = x, size = 1, prob = d$p))


Two thoughts:

Let's use pdf() here rather than dbinom().

Products of probabilities tend to underflow. I would compute the sum of log probabilities and then exponential it at the end. So this can just call log_likelihood().

alexpghayes · 2019-05-17T17:50:25Z

R/bernoulli.R

+#' @return the log-likelihood
+#' @export
+log_likelihood.bernoulli <- function(d, x, ...) {
+  sum(dbinom(x = x, size = 1, prob = d$p, log = TRUE))


I like this but think we should use the generic based approach here. Which means implementing a bunch of log_pdf() methods ugh. If you're willing to do this, great. If not, I'll merge and take care of it down the line.

Yeah, actually a log_pdf generic would solve for log_likelihood and in turn likelihood. That would also mean that potentially we wouldn't need generics for them? Just have something like

log_likelihood <- function(d, x) { sum(log_pdf(d, x)) } likelihood <- function(d, x) { exp(log_likelihood(d, x)) }

Letting log_pdf do all the dispatching?

I was wondering if just using log(pdf(d, x)) was a good idea over implementing an entirely new generic that would be nearly identical but it mentions in the docs that the log = TRUE argument in all the base distributions has a greater range.

n <- 2000 k <- seq(0, n, by = 20) plot (k, dbinom(k, n, pi/10, log = TRUE), type = "l", ylab = "log density", main = "dbinom(*, log=TRUE) is better than log(dbinom(*))") lines(k, log(dbinom(k, n, pi/10)), col = "red", lwd = 2)

So it's worth it really

Yeah definitely worth it for numerical reasons. Taking the log of tiny numbers is sketchy and I'm sure the log = TRUE option does smart stuff.

So I've been thinking about this over the last few days and am just coming into some time to work on it, but wanted to get your thoughts. I think ultimately there's 3 options, and each has their pros and cons and wanted to put them past you.

Option 1

Create a log_pdf generic and write methods for every distribution. This is straight forward but results in a lot of code duplication and makes updating and adding code down the line slightly more difficult. For example

pdf.binomial <- function(d, x, ...) { dbinom(x = x, size = d$size, prob = d$p) } log_pdf.binomial <- function(d, x, ...) { dbinom(x = x, size = d$size, prob = d$p, log = TRUE) }

This seems slightly ridiculous when you look at it.

Option 2

Add a log argument to pdf and just create a function log_pdf that uses this

pdf.binomial <- function(d, x, log = FALSE, ...) { dbinom(x = x, size = d$size, prob = d$p, log = log) } log_pdf <- function(d, x, ...) { pdf(d, x, log = TRUE, ...) }

This cuts down on duplication considerably and would be very easy to update existing distributions for. It also falls inline with the APIs for the base R distributions.

Option 3

Option 3 is to make use of dot-dot-dot, but I never really use dot-dot-dot and am not sure of the implications if any

pdf.binomial <- function(d, x, ...) { dbinom(x = x, size = d$size, prob = d$p, ...) } log_pdf <- function(d, x, ...) { pdf(d, x, log = TRUE, ...) }

Option 1 gives them most control, because we can write an exact log_pdf function for every distribution. I'm not sure if there is much benefit in that, but maybe there are cases where this is preferable?

Option 2 and 3 are simple which is nice. Especially given really we just want to take the log of a probability. A downside is that every pdf method needs to handle the log implementation internally, but that is already done for the base R functions.

My preference would be Option 2 because it's the most explicit. Any thoughts yourself?

I am quite strongly in favor of Option 1!

Grand, consider it done

Right they're all written - documentation just points to the pdf method for each distribution and I've added a log_pdf example for each too. No tests included though. I can start building them out next.

alexpghayes · 2019-05-17T17:52:18Z

R/binomial.R

+}
+
+#' Compute the sufficient statistics for the binomial distribution from data
+#'


It would be good to document the @return for all the suff_stat() methods. This should have two parts: 1. that the return is a list, and the names of the elements of the list, and 2. adding some notes on the sufficient statistics to the original @details in the bernoulli object documentation.

Each suff_stat method's @return is now documented. I haven't gotten to adding documentation to the @details yet.

R/exponential.R

alexpghayes · 2019-05-17T17:55:05Z

R/methods.R

+#'
+#' @param d A probability distribution object such as those created by
+#'   a call to [bernoulli()], [beta()], or [binomial()].
+#' @param x A vector of data to estimate the parameters of the


We're not estimating parameters of a distribution here, but calculating a likelihood / joint density.

Sorry that's from copy/pasting. I'll do a proper clean up and reorg of the docs - they're mostly thrown together.

This has been updated to "A vector of data to compute the likelihood."

alexpghayes · 2019-05-17T17:56:51Z

R/poisson.R

+#' @export
+fit_mle.poisson <- function(d, x, ...) {
+  ss <- suff_stat(d, x, ...)
+  poisson(ss$sum / ss$samples)


Let's start a conventions vignette to keep track of all these names for things to make sure to be consistent down the line. We didn't do that in broom and it bit us in the ass.

Yeah, already some divergence on my part. suff_stat.binomial uses the trials rather than size. I was aware that it didn't match the size argument, but it just didn't seem informative or clear enough in that context. I'm happy to follow whatever convention though of course.

I started a list in #4 that we can use to track for the time being

Also includes a likelihood and log_likelihood function that uses log_pdf

1danjordan · 2019-05-26T23:18:43Z

This has become quite a large pull request at this point and encapsulates more than just creating a fit generic, so I've updated the title of the pull request to reflect that. To summarise what has been completed at this point - this pull request contains:

log_pdf methods for each distribution
a likelihood and log_likelihood function that relies on each log_pdf methods
a suff_stat generic for computing sufficient statistics with methods for the following distributions:
- Bernoulli
- Binomial
- Normal
- Exponential
- Poisson
a fit function and fit_mle generic for doing maximum likelihood estimation, with methods for
- Bernoulli
- Binomial
- Normal
- Exponential
- Poisson
a number of tests

There is more documentation that to be written in the @details sections of each distribution and obviously more distributions to write suff_stat and fit_mle methods for.

alexpghayes · 2019-05-27T15:28:05Z

This is incredible. I'd like to merge this and then do more documentation / distributions in other pull requests. Will you add yourself as a contributor in DESCRIPTION and then I'll merge?

1danjordan · 2019-05-28T22:53:23Z

This is incredible. I'd like to merge this and then do more documentation / distributions in other pull requests. Will you add yourself as a contributor in DESCRIPTION and then I'll merge?

I've done that there. Honestly this is my pleasure, I've really enjoyed working on this pull request and think distributions has great potential.

alexpghayes reviewed May 14, 2019

View reviewed changes

R/bernoulli.R Outdated Show resolved Hide resolved

R/bernoulli.R Outdated Show resolved Hide resolved

EmilHvitfeldt mentioned this pull request May 16, 2019

Rounding/nicer formatting for print methods #11

Closed

1danjordan added 23 commits May 17, 2019 16:22

Add a fit() generic

f367f35

Add a fit method for normal distributions

8e0fc77

Add fit method for the bernoulli distribution

a5e34f1

Add a fit method for the binomial distribution

8f85070

This is the exact same as the bernoulli dist, except it creates a binomial object. The code duplication seems reasonable in this case.

Have binomial arg p default to 0.5

f171969

This is stated in the documentation, but no default was set

Tests for fit including normal, binomial and bernoulli

1c2403a

Inherit the size parameter of binomial obj when fitting

6ee35a1

Fix fit.binomial to normalise p by size

ef55bdd

Also correctly check inputs are counts and not greater than `size`

Add simple tests for fit.binomial

9f7a716

Checks simple working examples and cases where data passed to fit the binomial are incorrect (negative counts or counts greater than the size parameter of the binomial.

Fix binomial test in test-methods

1f3b7ca

Update fit.{bernoulli, binomial, normal} docs for punctuation

00cceb9

Remove integer conversion check from fit.{binomial, bernoulli}

4c7f1a4

Rename fit -> fit_mle and create a new fn fit that calls it

e0e2f55

`fit_mle` is now the generic for fitting a distribution with data.

Add a suff_stat generic for computing sufficient stats of dists

4a49b76

Add suff_stat.bernoulli and have fit_mle call it

3ebedd0

This commit also includes some basic tests for the both functions.

Add suff_stat.binomial with tests to for simple cases

bb1756c

suff_stat.binomial returns a named list {successes, experiments, trials} to be explicit as possible for users.

Add suff_stat.normal with some basic tests

b378f9d

Rename n in the list returned by suff_stat.normal to samples

c308f9f

Add exponential method for suff_stat and fit_mle

5dbdeb5

Add a likelihood generic

9baacbc

poisson methods for suff_stat and fit_mle

97910ee

Along with basic tests

Standardise checking inputs for suff_stat to be clearer

332aef5

Add a log_likelihood generic

dea9b4c

1danjordan added 2 commits May 17, 2019 17:13

Add bernoulli methods for likelihood and log_likelihood

2493028

Add binomial methods for likelihood and log_likelihood

87c86e6

1danjordan force-pushed the fit-generic branch from a2490f1 to 87c86e6 Compare May 17, 2019 16:17

alexpghayes requested changes May 17, 2019

View reviewed changes

1danjordan added 4 commits May 26, 2019 23:01

log_pdf generic and methods for all existing distributions

f4a182d

Also includes a likelihood and log_likelihood function that uses log_pdf

Remove redundant likelihood and log_loglikelihood methods

c83da3f

Build documents

f2d79cc

Document the suff_stat methods

2423023

1danjordan changed the title ~~Fit generic~~ Fitting distributions and computing likelihoods May 26, 2019

Add Dan Jordan as a contributor

23af771

1danjordan mentioned this pull request May 28, 2019

pdf.bernoulli isn't vectorised #12

Closed

alexpghayes merged commit d20a390 into alexpghayes:master May 29, 2019

1danjordan pushed a commit to 1danjordan/distributions that referenced this pull request Jul 7, 2019

Fitting distributions and computing likelihoods (alexpghayes#5)

27f082e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fitting distributions and computing likelihoods #5

Fitting distributions and computing likelihoods #5

1danjordan commented May 14, 2019 •

edited

Loading

alexpghayes commented May 14, 2019

alexpghayes commented May 15, 2019

1danjordan commented May 15, 2019

alexpghayes commented May 15, 2019

1danjordan commented May 16, 2019

1danjordan commented May 17, 2019

alexpghayes commented May 17, 2019

alexpghayes left a comment

alexpghayes May 17, 2019

alexpghayes May 17, 2019

1danjordan May 17, 2019 •

edited

Loading

1danjordan May 17, 2019

alexpghayes May 20, 2019

1danjordan May 23, 2019

alexpghayes May 24, 2019

1danjordan May 24, 2019

1danjordan May 26, 2019

alexpghayes May 17, 2019

1danjordan May 26, 2019

alexpghayes May 17, 2019

1danjordan May 17, 2019

1danjordan May 26, 2019

alexpghayes May 17, 2019

1danjordan May 17, 2019

alexpghayes May 20, 2019

1danjordan commented May 26, 2019

alexpghayes commented May 27, 2019

1danjordan commented May 28, 2019

Fitting distributions and computing likelihoods #5

Fitting distributions and computing likelihoods #5

Conversation

1danjordan commented May 14, 2019 • edited Loading

alexpghayes commented May 14, 2019

alexpghayes commented May 15, 2019

1danjordan commented May 15, 2019

alexpghayes commented May 15, 2019

1danjordan commented May 16, 2019

1danjordan commented May 17, 2019

alexpghayes commented May 17, 2019

alexpghayes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1danjordan May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Option 1

Option 2

Option 3

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1danjordan commented May 26, 2019

alexpghayes commented May 27, 2019

1danjordan commented May 28, 2019

1danjordan commented May 14, 2019 •

edited

Loading

1danjordan May 17, 2019 •

edited

Loading