Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogisticRegression cannot train from Dask DataFrame #84

Open
julioasotodv opened this Issue Nov 4, 2017 · 15 comments

Comments

Projects
None yet
5 participants
@julioasotodv
Copy link

julioasotodv commented Nov 4, 2017

A simple example:

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X, y)

Returns KeyError: (<class 'dask.dataframe.core.DataFrame'>,)

I did not have time to try if it is also the case for other models.

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 6, 2017

Thanks. At the moment the dask_glm based estimators just work with dask arrays, not dataframes. You can use .values to get the array.

I'm hoping to put in some helpers for handling all the extra DataFrame metadata sometime soon, so this will be more consistent across estimators.

@julioasotodv

This comment has been minimized.

Copy link
Author

julioasotodv commented Nov 6, 2017

Thank you so much for the quick response!

The problem is that when fitting a glm with intercept (which is usually the case), the dask array containing the features needs to have defined the chunk size, which I believe it is not possible when the array comes from a dataframe.

Anyways, I will reach out to the main dask issue page and ask there.

Thank you!

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 6, 2017

@julioasotodv, yes I forgot about that case. Let me put something together quick.

@julioasotodv

This comment has been minimized.

Copy link
Author

julioasotodv commented Nov 6, 2017

Do you think there is a way to achieve this without making changes to dask's engine itself?

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 7, 2017

@julioasotodv

This comment has been minimized.

Copy link
Author

julioasotodv commented Nov 7, 2017

I see. Would it work with that fix, even if chunksize is not defined for the underlying dask array?

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 7, 2017

Yes, that should work. The solvers only require that the shape along the second axis is known:

from dask_ml.linear_model import LinearRegression
from dask_ml.datasets import make_regression

X, y = make_regression(chunks=50)

df = dd.from_dask_array(X)
X2 = df.values  # dask.array with unknown chunks along first dim

lm = LinearRegression(fit_intercept=False)
lm.fit(X2, y)

Note that fit_intercept does not currently work with unknown chunks. But when dask/dask-glm@master...TomAugspurger:add-intercept-dd is merged, you'd just do

lm = LinearRegression()  # fit_intercept=True
lm.fit(df)

And the intercept is added during the fit.

@julioasotodv

This comment has been minimized.

Copy link
Author

julioasotodv commented Nov 12, 2017

That's awesome!

But let me be just a little picky with that change (dask/dask-glm@master...TomAugspurger:add-intercept-dd):

In theory, if using either L1 or L2 regularization (or Elastic Net), the penalty term should not affect the intercept (this is, the "ones" column that works as the intercept should not be multiplied by the Lagrange multipliers that perform the actual regularization).

However, it would still be better than not having intercept. What do you think about this?

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 13, 2017

Thanks, I'll take a look at how other packages handle regularization of the intercept, but I think your correct. cc @moody-marlin thoughts on that?

@cicdw

This comment has been minimized.

Copy link

cicdw commented Nov 13, 2017

Yea, I agree that the intercept should not be included in the regularization; I believe this is recommended best practice, and also not regularizing the intercept ensures that all regularizers still produce estimates which satisfy that the residuals have mean 0, which preserves the standard interpretation of things like R^2, etc.

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Nov 14, 2017

Opened dask/dask-glm#65 to track that.

I'll deprecate the estimators in dask_glm and move them over here later today.

@jakirkham

This comment has been minimized.

Copy link
Member

jakirkham commented Jun 6, 2018

See there is PR ( dask/dask-glm#66 ) to deprecate the dask-glm estimators and PR ( #94 ), which seems to have migrated the bulk of that content to dask-ml. Is this still the plan?

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Jun 6, 2018

@asifali22

This comment has been minimized.

Copy link

asifali22 commented Sep 5, 2018

I'm facing the same issue.

Traceback (most recent call last):
  File "diya_libs/alog_main.py", line 20, in <module>
    clf.fit(X, y)
  File "/Users/asifali/workspace/pythonProjects/ML-engine-DataX/pre-processing/diya_libs/lib/algorithms/diya_logit.py", line 67, in fit
    self.estimator.fit(X, y)
  File "/anaconda3/lib/python3.6/site-packages/dask_ml/linear_model/glm.py", line 153, in fit
    X = self._check_array(X)
  File "/anaconda3/lib/python3.6/site-packages/dask_ml/linear_model/glm.py", line 167, in _check_array
    X = add_intercept(X)
  File "/anaconda3/lib/python3.6/site-packages/multipledispatch/dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/dask_glm/utils.py", line 147, in add_intercept
    raise NotImplementedError("Can not add intercept to array with "
NotImplementedError: Can not add intercept to array with unknown chunk shape

Initially I tried with Dask DataFrame, later changed to Dask Array using
X = X.values #resulted in nan chunks which is causing the above error.
What am I supposed to do now? How do I install the fix, mentioned above? As it is not present in the version available on pip.

@TomAugspurger

This comment has been minimized.

Copy link
Member

TomAugspurger commented Sep 5, 2018

@asifali22 that looks strange. Can you provide a full example? Does the following work for you?

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X.values, y.values)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.