Add Poisson regression support (WIP). #46

mpancia · 2017-05-02T02:47:16Z

This adds an implementation of Poisson regression for count data, mimicking what is implemented for Logistic regression. I'm having some trouble debugging it, so I thought I'd submit this PR for a little help.

After testing it, the solver methods all give me wacky output except newton, which seems sane. The other methods give me division by zero errors in a logarithm, which I can't manage to track down. This seems to imply that there's something messed up in my implementation of the likelihood or something.

mpancia · 2017-05-02T02:47:39Z

dask_glm/estimators.py

-from .utils import (
-    sigmoid, dot, add_intercept, mean_squared_error, accuracy_score
-)
+from .utils import (sigmoid, dot, add_intercept, mean_squared_error, accuracy_score, exp, poisson_deviance)


Blame pycharm for cleaning this up.

mpancia · 2017-05-02T02:47:53Z

dask_glm/families.py

@@ -38,7 +39,7 @@ def hessian(Xbeta, X):
 class Normal(object):
    @staticmethod
    def loglike(Xbeta, y):
-        return ((y - Xbeta)**2).sum()
+        return ((y - Xbeta) ** 2).sum()


Similar pycharm PEPping.

From my perspective most of these changes are welcome.

mrocklin

Thanks for working on this @mpancia . I'm glad how clean this seems to be to add.

I'm unable to review the mathematical contents here. I'll leave that to @moody-marlin and @TomAugspurger .

Question, is it possible to extend some of the larger parametrized tests in tests/test_algos_families.py with the new Poisson family? This would guarantee proper functioning in a number of cases. If this isn't possible (for example perhaps the test data is inappropriate for this family) then can you think of a way to improve those tests to make it possible?

mrocklin · 2017-05-02T12:19:43Z

dask_glm/utils.py

@@ -8,6 +8,7 @@
 from multipledispatch import dispatch


+@dispatch(np.ndarray)


I don't think that this is necessary. I also think that this is causing some of your errors.

This function should work on all objects the same. The different array classes are handled by dispatching on exp within this function.

mrocklin · 2017-05-02T12:19:58Z

dask_glm/families.py

@@ -38,7 +39,7 @@ def hessian(Xbeta, X):
 class Normal(object):
    @staticmethod
    def loglike(Xbeta, y):
-        return ((y - Xbeta)**2).sum()
+        return ((y - Xbeta) ** 2).sum()


From my perspective most of these changes are welcome.

cicdw · 2017-05-02T20:23:46Z

dask_glm/families.py

+    @staticmethod
+    def gradient(Xbeta, X, y):
+        eXbeta = exp(Xbeta)
+        return dot(X.T, eXbeta - y)


This looks like the negative gradient to me.

Actually, see comment above.

cicdw · 2017-05-02T20:26:53Z

dask_glm/families.py

+    def loglike(Xbeta, y):
+        eXbeta = exp(Xbeta)
+        yXbeta = y*Xbeta
+        return (yXbeta - eXbeta).sum()


Ah, actually this is probably the issue; what you have written appears to be the true log-likelihood (which makes sense considering we called this loglike). In fact, this should be the negative log-likelihood because all of our functions attempt to minimize.

This is a failure on my part of documentation / naming.

cicdw · 2017-05-02T20:34:05Z

So the problem is most likely due to sign error; the loglike should be the negative log-likelihood. Your implementation of the gradient and the hessian are already for the negative log-likelihood, which explains why newton worked (it never calls the loglike method) and why all other algos failed.

We might want to consider changing this method name altogether to neg_loglike or something...

cc: @TomAugspurger (for if you're working on doc strings).

mpancia · 2017-05-02T22:26:30Z

@moody-marlin Well, changing the sign seems to make the output sane. I should have considered a sign error and looked at the other implementations, but I think that (like you mentioned) it might be a good idea to rename the method to neg_loglike or at least having a comment that gives the appropriate caveat. :)

stoneyv · 2017-05-02T23:56:51Z

It would be good to include a quasi-poisson regression method (quasi likelihood Wedderburn 1974) similar to the R glm package as well as generalized estimate equations like the R "sandwich" package. You need these functions to check how close the mean is to the variance to determine the robustness of your poisson regression model as an estimator. Is this non linear estimator better than adding a squared term to the linear model or using a negative binomial model when the variance is less or greater than the mean. Poisson regression may be a good choice for unbounded counts, rates or proportions.
Blog article about using quasi-poisson for count data with variance greater than the mean
http://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/
Sandwich agnostic variance (generalized estimate equations)
https://www.jstatsoft.org/article/view/v016i09
https://cran.r-project.org/web/packages/sandwich/sandwich.pdf

cicdw · 2017-05-03T01:46:34Z

@stoneyv I agree that over-dispersion is an issue and quasi-likelihood methods are worthy of consideration -- maybe you could raise an issue for either including quasi-likelihood based families or a metrics module that helps diagnose when quasi-likelihood methods are needed?

I don't think over-dispersion metrics / implementations should be included in this PR though - IMO PoissonRegression should avoid interpreting the user's intent and implement straightforward Poisson Regression, which I think is what this PR does.

cicdw · 2017-05-03T01:48:00Z

@mpancia if you add the Poisson family to the parametrized tests in test_algos_families.py, and then flake the code I think we'll be good to go here.

cicdw · 2017-05-03T01:56:25Z

dask_glm/families.py

+    def hessian(Xbeta, X):
+        eXbeta = exp(Xbeta)
+        diag_eXbeta = da.diag(eXbeta)
+        x_diag_eXbeta = dot(diag_eXbeta, X)


FWIW I think eXbeta[:, None] * X should broadcast appropriately and save some memory (avoids storing all the 0's from the diag creation).

True. Weird numpy idioms abound.

mpancia · 2017-05-03T02:48:02Z

@moody-marlin Looks good, though it seems a bit weird as @mrocklin mentioned to be using the logistic test data from make_y to test the normal LR or the Poisson regression - some family-specific tests based on appropriately-generated data might make sense.

At the same time, those are tests for gradient descent working, so regardless of whether or not the family is appropriate for the actual data, the descent should decrease loss. shrug

cicdw · 2017-05-03T03:04:43Z

@mpancia Ah yea good call out; ultimately this re-raises the question of how to make a truly robust GLM testing framework (#7, #9, etc.). However, I think that opens a new can of worms that is beyond the scope of this PR. I think this provides an informative (and useful) example for how to include future GLM families into dask-glm.

LGTM; I'll let this hang out until tomorrow in case anyone else has further feedback.

mpancia · 2017-05-03T15:08:35Z

@moody-marlin Great, thanks. The Travis build is passing now, BTW; might want to squash this when you merge so as not to muss up the main commit history, or I can rebase it to get a cleaner commit history.

mpancia added 4 commits May 1, 2017 19:43

Add util and Poisson family.

dcc6629

Add dataset generator for PR.

95b1269

Add PR estimator.

54b4828

Add PR test.

919b0d0

mpancia commented May 2, 2017

View reviewed changes

mrocklin reviewed May 2, 2017

View reviewed changes

cicdw reviewed May 2, 2017

View reviewed changes

cicdw mentioned this pull request May 2, 2017

Abstract Base Classes #48

Open

2 tasks

Fix sign error.

2cd735b

Remove dispatch for sigmoid.

94c688c

cicdw reviewed May 3, 2017

View reviewed changes

mpancia added 4 commits May 2, 2017 19:30

Add efficient slicing for PR hessian.

0499741

Add PR to parametrized tests.

9cfe353

Fix flake.

2e09002

Fix a typo.

fb439a6

cicdw approved these changes May 3, 2017

View reviewed changes

cicdw merged commit b251bfc into dask:master May 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Poisson regression support (WIP). #46

Add Poisson regression support (WIP). #46

mpancia commented May 2, 2017

mpancia May 2, 2017

mpancia May 2, 2017

mrocklin May 2, 2017

mrocklin left a comment

mrocklin May 2, 2017

mrocklin May 2, 2017

cicdw May 2, 2017

cicdw May 2, 2017

cicdw May 2, 2017

cicdw commented May 2, 2017

mpancia commented May 2, 2017

stoneyv commented May 2, 2017

cicdw commented May 3, 2017

cicdw commented May 3, 2017

cicdw May 3, 2017

mpancia May 3, 2017

mpancia commented May 3, 2017

cicdw commented May 3, 2017

mpancia commented May 3, 2017

		@@ -8,6 +8,7 @@
		from multipledispatch import dispatch


		@dispatch(np.ndarray)

Add Poisson regression support (WIP). #46

Add Poisson regression support (WIP). #46

Conversation

mpancia commented May 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cicdw commented May 2, 2017

mpancia commented May 2, 2017

stoneyv commented May 2, 2017

cicdw commented May 3, 2017

cicdw commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpancia commented May 3, 2017

cicdw commented May 3, 2017

mpancia commented May 3, 2017