-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Poisson regression support (WIP). #46
Conversation
dask_glm/estimators.py
Outdated
from .utils import ( | ||
sigmoid, dot, add_intercept, mean_squared_error, accuracy_score | ||
) | ||
from .utils import (sigmoid, dot, add_intercept, mean_squared_error, accuracy_score, exp, poisson_deviance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blame pycharm for cleaning this up.
@@ -38,7 +39,7 @@ def hessian(Xbeta, X): | |||
class Normal(object): | |||
@staticmethod | |||
def loglike(Xbeta, y): | |||
return ((y - Xbeta)**2).sum() | |||
return ((y - Xbeta) ** 2).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar pycharm PEPping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective most of these changes are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @mpancia . I'm glad how clean this seems to be to add.
I'm unable to review the mathematical contents here. I'll leave that to @moody-marlin and @TomAugspurger .
Question, is it possible to extend some of the larger parametrized tests in tests/test_algos_families.py
with the new Poisson family? This would guarantee proper functioning in a number of cases. If this isn't possible (for example perhaps the test data is inappropriate for this family) then can you think of a way to improve those tests to make it possible?
dask_glm/utils.py
Outdated
@@ -8,6 +8,7 @@ | |||
from multipledispatch import dispatch | |||
|
|||
|
|||
@dispatch(np.ndarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is necessary. I also think that this is causing some of your errors.
This function should work on all objects the same. The different array classes are handled by dispatching on exp
within this function.
@@ -38,7 +39,7 @@ def hessian(Xbeta, X): | |||
class Normal(object): | |||
@staticmethod | |||
def loglike(Xbeta, y): | |||
return ((y - Xbeta)**2).sum() | |||
return ((y - Xbeta) ** 2).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective most of these changes are welcome.
@staticmethod | ||
def gradient(Xbeta, X, y): | ||
eXbeta = exp(Xbeta) | ||
return dot(X.T, eXbeta - y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the negative gradient to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, see comment above.
dask_glm/families.py
Outdated
def loglike(Xbeta, y): | ||
eXbeta = exp(Xbeta) | ||
yXbeta = y*Xbeta | ||
return (yXbeta - eXbeta).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually this is probably the issue; what you have written appears to be the true log-likelihood (which makes sense considering we called this loglike
). In fact, this should be the negative log-likelihood because all of our functions attempt to minimize.
This is a failure on my part of documentation / naming.
So the problem is most likely due to sign error; the We might want to consider changing this method name altogether to cc: @TomAugspurger (for if you're working on doc strings). |
@moody-marlin Well, changing the sign seems to make the output sane. I should have considered a sign error and looked at the other implementations, but I think that (like you mentioned) it might be a good idea to rename the method to |
It would be good to include a quasi-poisson regression method (quasi likelihood Wedderburn 1974) similar to the R glm package as well as generalized estimate equations like the R "sandwich" package. You need these functions to check how close the mean is to the variance to determine the robustness of your poisson regression model as an estimator. Is this non linear estimator better than adding a squared term to the linear model or using a negative binomial model when the variance is less or greater than the mean. Poisson regression may be a good choice for unbounded counts, rates or proportions. |
@stoneyv I agree that over-dispersion is an issue and quasi-likelihood methods are worthy of consideration -- maybe you could raise an issue for either including quasi-likelihood based families or a I don't think over-dispersion metrics / implementations should be included in this PR though - IMO |
@mpancia if you add the |
dask_glm/families.py
Outdated
def hessian(Xbeta, X): | ||
eXbeta = exp(Xbeta) | ||
diag_eXbeta = da.diag(eXbeta) | ||
x_diag_eXbeta = dot(diag_eXbeta, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think eXbeta[:, None] * X
should broadcast appropriately and save some memory (avoids storing all the 0
's from the diag
creation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Weird numpy
idioms abound.
@moody-marlin Looks good, though it seems a bit weird as @mrocklin mentioned to be using the logistic test data from At the same time, those are tests for gradient descent working, so regardless of whether or not the family is appropriate for the actual data, the descent should decrease loss. shrug |
@mpancia Ah yea good call out; ultimately this re-raises the question of how to make a truly robust GLM testing framework (#7, #9, etc.). However, I think that opens a new can of worms that is beyond the scope of this PR. I think this provides an informative (and useful) example for how to include future GLM families into LGTM; I'll let this hang out until tomorrow in case anyone else has further feedback. |
@moody-marlin Great, thanks. The Travis build is passing now, BTW; might want to squash this when you merge so as not to muss up the main commit history, or I can rebase it to get a cleaner commit history. |
This adds an implementation of Poisson regression for count data, mimicking what is implemented for Logistic regression. I'm having some trouble debugging it, so I thought I'd submit this PR for a little help.
After testing it, the solver methods all give me wacky output except
newton
, which seems sane. The other methods give me division by zero errors in a logarithm, which I can't manage to track down. This seems to imply that there's something messed up in my implementation of the likelihood or something.