Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Poisson regression support (WIP). #46

Merged
merged 10 commits into from
May 3, 2017

Conversation

mpancia
Copy link
Contributor

@mpancia mpancia commented May 2, 2017

This adds an implementation of Poisson regression for count data, mimicking what is implemented for Logistic regression. I'm having some trouble debugging it, so I thought I'd submit this PR for a little help.

After testing it, the solver methods all give me wacky output except newton, which seems sane. The other methods give me division by zero errors in a logarithm, which I can't manage to track down. This seems to imply that there's something messed up in my implementation of the likelihood or something.

from .utils import (
sigmoid, dot, add_intercept, mean_squared_error, accuracy_score
)
from .utils import (sigmoid, dot, add_intercept, mean_squared_error, accuracy_score, exp, poisson_deviance)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blame pycharm for cleaning this up.

@@ -38,7 +39,7 @@ def hessian(Xbeta, X):
class Normal(object):
@staticmethod
def loglike(Xbeta, y):
return ((y - Xbeta)**2).sum()
return ((y - Xbeta) ** 2).sum()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar pycharm PEPping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective most of these changes are welcome.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @mpancia . I'm glad how clean this seems to be to add.

I'm unable to review the mathematical contents here. I'll leave that to @moody-marlin and @TomAugspurger .

Question, is it possible to extend some of the larger parametrized tests in tests/test_algos_families.py with the new Poisson family? This would guarantee proper functioning in a number of cases. If this isn't possible (for example perhaps the test data is inappropriate for this family) then can you think of a way to improve those tests to make it possible?

@@ -8,6 +8,7 @@
from multipledispatch import dispatch


@dispatch(np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is necessary. I also think that this is causing some of your errors.

This function should work on all objects the same. The different array classes are handled by dispatching on exp within this function.

@@ -38,7 +39,7 @@ def hessian(Xbeta, X):
class Normal(object):
@staticmethod
def loglike(Xbeta, y):
return ((y - Xbeta)**2).sum()
return ((y - Xbeta) ** 2).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective most of these changes are welcome.

@staticmethod
def gradient(Xbeta, X, y):
eXbeta = exp(Xbeta)
return dot(X.T, eXbeta - y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the negative gradient to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, see comment above.

def loglike(Xbeta, y):
eXbeta = exp(Xbeta)
yXbeta = y*Xbeta
return (yXbeta - eXbeta).sum()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually this is probably the issue; what you have written appears to be the true log-likelihood (which makes sense considering we called this loglike). In fact, this should be the negative log-likelihood because all of our functions attempt to minimize.

This is a failure on my part of documentation / naming.

@cicdw
Copy link
Collaborator

cicdw commented May 2, 2017

So the problem is most likely due to sign error; the loglike should be the negative log-likelihood. Your implementation of the gradient and the hessian are already for the negative log-likelihood, which explains why newton worked (it never calls the loglike method) and why all other algos failed.

We might want to consider changing this method name altogether to neg_loglike or something...

cc: @TomAugspurger (for if you're working on doc strings).

@cicdw cicdw mentioned this pull request May 2, 2017
2 tasks
@mpancia
Copy link
Contributor Author

mpancia commented May 2, 2017

@moody-marlin Well, changing the sign seems to make the output sane. I should have considered a sign error and looked at the other implementations, but I think that (like you mentioned) it might be a good idea to rename the method to neg_loglike or at least having a comment that gives the appropriate caveat. :)

@stoneyv
Copy link

stoneyv commented May 2, 2017

It would be good to include a quasi-poisson regression method (quasi likelihood Wedderburn 1974) similar to the R glm package as well as generalized estimate equations like the R "sandwich" package. You need these functions to check how close the mean is to the variance to determine the robustness of your poisson regression model as an estimator. Is this non linear estimator better than adding a squared term to the linear model or using a negative binomial model when the variance is less or greater than the mean. Poisson regression may be a good choice for unbounded counts, rates or proportions.
Blog article about using quasi-poisson for count data with variance greater than the mean
http://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/
Sandwich agnostic variance (generalized estimate equations)
https://www.jstatsoft.org/article/view/v016i09
https://cran.r-project.org/web/packages/sandwich/sandwich.pdf

@cicdw
Copy link
Collaborator

cicdw commented May 3, 2017

@stoneyv I agree that over-dispersion is an issue and quasi-likelihood methods are worthy of consideration -- maybe you could raise an issue for either including quasi-likelihood based families or a metrics module that helps diagnose when quasi-likelihood methods are needed?

I don't think over-dispersion metrics / implementations should be included in this PR though - IMO PoissonRegression should avoid interpreting the user's intent and implement straightforward Poisson Regression, which I think is what this PR does.

@cicdw
Copy link
Collaborator

cicdw commented May 3, 2017

@mpancia if you add the Poisson family to the parametrized tests in test_algos_families.py, and then flake the code I think we'll be good to go here.

def hessian(Xbeta, X):
eXbeta = exp(Xbeta)
diag_eXbeta = da.diag(eXbeta)
x_diag_eXbeta = dot(diag_eXbeta, X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think eXbeta[:, None] * X should broadcast appropriately and save some memory (avoids storing all the 0's from the diag creation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Weird numpy idioms abound.

@mpancia
Copy link
Contributor Author

mpancia commented May 3, 2017

@moody-marlin Looks good, though it seems a bit weird as @mrocklin mentioned to be using the logistic test data from make_y to test the normal LR or the Poisson regression - some family-specific tests based on appropriately-generated data might make sense.

At the same time, those are tests for gradient descent working, so regardless of whether or not the family is appropriate for the actual data, the descent should decrease loss. shrug

@cicdw
Copy link
Collaborator

cicdw commented May 3, 2017

@mpancia Ah yea good call out; ultimately this re-raises the question of how to make a truly robust GLM testing framework (#7, #9, etc.). However, I think that opens a new can of worms that is beyond the scope of this PR. I think this provides an informative (and useful) example for how to include future GLM families into dask-glm.

LGTM; I'll let this hang out until tomorrow in case anyone else has further feedback.

@mpancia
Copy link
Contributor Author

mpancia commented May 3, 2017

@moody-marlin Great, thanks. The Travis build is passing now, BTW; might want to squash this when you merge so as not to muss up the main commit history, or I can rebase it to get a cleaner commit history.

@cicdw cicdw merged commit b251bfc into dask:master May 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants