LogisticRegression cannot train from Dask DataFrame #84

julioasotodv · 2017-11-04T02:26:26Z

A simple example:

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X, y)

Returns KeyError: (<class 'dask.dataframe.core.DataFrame'>,)

I did not have time to try if it is also the case for other models.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-11-06T16:50:54Z

Thanks. At the moment the dask_glm based estimators just work with dask arrays, not dataframes. You can use .values to get the array.

I'm hoping to put in some helpers for handling all the extra DataFrame metadata sometime soon, so this will be more consistent across estimators.

julioasotodv · 2017-11-06T19:21:26Z

Thank you so much for the quick response!

The problem is that when fitting a glm with intercept (which is usually the case), the dask array containing the features needs to have defined the chunk size, which I believe it is not possible when the array comes from a dataframe.

Anyways, I will reach out to the main dask issue page and ask there.

Thank you!

TomAugspurger · 2017-11-06T19:24:47Z

@julioasotodv, yes I forgot about that case. Let me put something together quick.

julioasotodv · 2017-11-06T23:05:53Z

Do you think there is a way to achieve this without making changes to dask's engine itself?

TomAugspurger · 2017-11-07T01:56:11Z

What do you mean by "dasks's engine"? See dask/dask-glm#63 for a discussion on the relationship between dask-ml and dask-glm, and dask/dask-glm@master...TomAugspurger:add-intercept-dd for what the fix will look like.

…

On Mon, Nov 6, 2017 at 5:05 PM, Julio Antonio Soto ***@***.*** > wrote: Do you think there is a way to achieve this without making changes to dask's engine itself? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhn3_1V1qCkEkmlNXqr09SUYkpc7ks5sz5DRgaJpZM4QR2-N> .

julioasotodv · 2017-11-07T15:06:39Z

I see. Would it work with that fix, even if chunksize is not defined for the underlying dask array?

TomAugspurger · 2017-11-07T15:51:13Z

Yes, that should work. The solvers only require that the shape along the second axis is known:

from dask_ml.linear_model import LinearRegression
from dask_ml.datasets import make_regression

X, y = make_regression(chunks=50)

df = dd.from_dask_array(X)
X2 = df.values  # dask.array with unknown chunks along first dim

lm = LinearRegression(fit_intercept=False)
lm.fit(X2, y)

Note that fit_intercept does not currently work with unknown chunks. But when dask/dask-glm@master...TomAugspurger:add-intercept-dd is merged, you'd just do

lm = LinearRegression()  # fit_intercept=True
lm.fit(df)

And the intercept is added during the fit.

julioasotodv · 2017-11-12T22:08:44Z

That's awesome!

But let me be just a little picky with that change (dask/dask-glm@master...TomAugspurger:add-intercept-dd):

In theory, if using either L1 or L2 regularization (or Elastic Net), the penalty term should not affect the intercept (this is, the "ones" column that works as the intercept should not be multiplied by the Lagrange multipliers that perform the actual regularization).

However, it would still be better than not having intercept. What do you think about this?

TomAugspurger · 2017-11-13T14:16:54Z

Thanks, I'll take a look at how other packages handle regularization of the intercept, but I think your correct. cc @moody-marlin thoughts on that?

cicdw · 2017-11-13T17:49:44Z

Yea, I agree that the intercept should not be included in the regularization; I believe this is recommended best practice, and also not regularizing the intercept ensures that all regularizers still produce estimates which satisfy that the residuals have mean 0, which preserves the standard interpretation of things like R^2, etc.

TomAugspurger · 2017-11-14T16:49:27Z

Opened dask/dask-glm#65 to track that.

I'll deprecate the estimators in dask_glm and move them over here later today.

jakirkham · 2018-06-06T02:02:04Z

See there is PR ( dask/dask-glm#66 ) to deprecate the dask-glm estimators and PR ( #94 ), which seems to have migrated the bulk of that content to dask-ml. Is this still the plan?

TomAugspurger · 2018-06-06T12:26:40Z

Yes, in my mind dask-glm has the optimizers, and dask-ml has the estimators built on top of those.

…

On Tue, Jun 5, 2018 at 9:02 PM, jakirkham ***@***.***> wrote: See there is PR ( dask/dask-glm#66 <dask/dask-glm#66> ) to deprecate the dask-glm estimators and PR ( #94 <#94> ), which seems to have migrated the bulk of that content to dask-ml. Is this still the plan? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHImZMyUAU1i6xb6RUlqzOslUPFnWeks5t5zgdgaJpZM4QR2-N> .

asifali22 · 2018-09-05T08:14:39Z

I'm facing the same issue.

Traceback (most recent call last):
  File "diya_libs/alog_main.py", line 20, in <module>
    clf.fit(X, y)
  File "/Users/asifali/workspace/pythonProjects/ML-engine-DataX/pre-processing/diya_libs/lib/algorithms/diya_logit.py", line 67, in fit
    self.estimator.fit(X, y)
  File "/anaconda3/lib/python3.6/site-packages/dask_ml/linear_model/glm.py", line 153, in fit
    X = self._check_array(X)
  File "/anaconda3/lib/python3.6/site-packages/dask_ml/linear_model/glm.py", line 167, in _check_array
    X = add_intercept(X)
  File "/anaconda3/lib/python3.6/site-packages/multipledispatch/dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/dask_glm/utils.py", line 147, in add_intercept
    raise NotImplementedError("Can not add intercept to array with "
NotImplementedError: Can not add intercept to array with unknown chunk shape

Initially I tried with Dask DataFrame, later changed to Dask Array using
X = X.values #resulted in nan chunks which is causing the above error.
What am I supposed to do now? How do I install the fix, mentioned above? As it is not present in the version available on pip.

TomAugspurger · 2018-09-05T10:51:29Z

@asifali22 that looks strange. Can you provide a full example? Does the following work for you?

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X.values, y.values)

thebeancounter · 2019-06-13T15:03:27Z

Having a similar issue with dask array @TomAugspurger see my SO question, Any idea?

TomAugspurger · 2019-06-13T20:26:38Z

@thebeancounter do you have a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

thebeancounter · 2019-06-14T05:31:55Z

@TomAugspurger
Hi. The code is in the SO question, do you mean copy it here?

TomAugspurger · 2019-06-14T12:11:37Z

It looks like data isn’t defined. Also the error says you have multiple columns with no variance. You probably don’t want that.

…

________________________________ From: thebeancounter <notifications@github.com> Sent: Friday, June 14, 2019 12:31 AM To: dask/dask-ml Cc: Tom Augspurger; Mention Subject: Re: [dask/dask-ml] LogisticRegression cannot train from Dask DataFrame (#84) @TomAugspurger<https://github.com/TomAugspurger> Hi. The code is in the SO question, do you mean copy it here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#84?email_source=notifications&email_token=AAKAOITHSXGZHLBI7F6J3KDP2MUMZA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXVYGBY#issuecomment-501973767>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAKAOIUA32NFF2AGXY7LBFTP2MUMZANCNFSM4ECHN6GQ>.

thebeancounter · 2019-06-16T10:21:44Z

@TomAugspurger

Data is defined
It's regular cifar10 data, passed via a pre trained resnet 50 for feature extraction. Trains well with sklearn. I can't guarantee that there are no zero variance columns but those should not prevent learning anyway! Only waste some processing time.

Here is the data zipped (read it from folder with generator just for preventing memory from exploding)

i = ImageDataGenerator(preprocessing_function=preprocess_input)

train_flow = i.flow_from_directory(directory=test_dir, target_size=(224, 224), class_mode="sparse", batch_size=1024, shuffle=True)

pre_model = ResNet50(weights="imagenet", include_top=False)
pre_model.compile(optimizer=Adam(), loss=categorical_crossentropy)

labels = []
data = []
for i in range(len(train_flow)):
    imgs, l = next(train_flow)
    data.append(pre_model.predict(imgs))
    labels.append(l)

labels = np.concatenate(labels)
data = np.concatenate(data, axis=0)
data = data.reshape(-1, np.prod(data.shape[1:]))

Data is under
github.com/thebeancounter/data

TomAugspurger · 2019-06-16T11:44:01Z

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports may be helpful for writing an example. Does the error show up if you have a dummy dataset where two columns have no variance?

…

________________________________ From: thebeancounter <notifications@github.com> Sent: Sunday, June 16, 2019 5:21 AM To: dask/dask-ml Cc: Tom Augspurger; Mention Subject: Re: [dask/dask-ml] LogisticRegression cannot train from Dask DataFrame (#84) @TomAugspurger<https://github.com/TomAugspurger> Data is defined It's regular cifar10 data, passed via a pre trained resnet 50 for feature extraction. Trains well with sklearn. I can't guarantee that there are no zero variance columns but those should not prevent learning anyway! Only waste some processing time. Here is the data zipped (read it from folder with generator just for preventing memory from exploding) i = ImageDataGenerator(preprocessing_function=preprocess_input) train_flow = i.flow_from_directory(directory=test_dir, target_size=(224, 224), class_mode="sparse", batch_size=1024, shuffle=True) pre_model = ResNet50(weights="imagenet", include_top=False) pre_model.compile(optimizer=Adam(), loss=categorical_crossentropy) labels = [] data = [] for i in range(len(train_flow)): imgs, l = next(train_flow) data.append(pre_model.predict(imgs)) labels.append(l) labels = np.concatenate(labels) data = np.concatenate(data, axis=0) data = data.reshape(-1, np.prod(data.shape[1:])) Data is under github.com/thebeancounter/data — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#84?email_source=notifications&email_token=AAKAOIUVGSFQ74FUQXZGRYDP2YH3TA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXZJ35I#issuecomment-502439413>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAKAOIQAUZ67RBF65SEHGMDP2YH3TANCNFSM4ECHN6GQ>.

thebeancounter · 2019-06-16T15:49:17Z

@TomAugspurger

Hi, I posted the code and the data. It's a solid example :-)

Anyhow, Can you maybe post a working example for using numpy array for logistic regression in dask?

TomAugspurger · 2019-06-16T16:09:12Z

I’m guessing it’s not minimal. Simplifying it may reveal the issue. Why do you want to use dask-ml’s LR on a numpy array?

…

On Jun 16, 2019, at 10:49, thebeancounter ***@***.***> wrote: @TomAugspurger Hi, I posted the code and the data. It's a solid example :-) Anyhow, Can you maybe post a working example for using numpy array for logistic regression in dask? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

thebeancounter · 2019-06-17T09:11:06Z

@TomAugspurger
my data originally comes from a numpy array, I need to convert it to some form that dask can learn on. Can't find any example for that in the tutorial, maybe that's the issue, can you point me to something of that kind?

TomAugspurger · 2019-06-17T13:36:29Z

https://docs.dask.org/en/latest/array-creation.html documents creating dask arrays, including from array-like things like NumPy arrays. Though my (vague) question was a bit deeper. Why do you want to use dask's LR, rather than scikit-learn's or Scipy's? If you're coming from a NumPy array, then does your data fit in memory? If so, you should just use one of those.

…

On Mon, Jun 17, 2019 at 4:11 AM thebeancounter ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> my data originally comes from a numpy array, I need to convert it to some form that dask can learn on. Can't find any example for that in the tutorial, maybe that's the issue, can you point me to something of that kind? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=AAKAOIQRF2DG7VWTZ2IU7RDP25IKXA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX2RKZQ#issuecomment-502601062>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQ4AUET3IQRB3QVNULP25IKXANCNFSM4ECHN6GQ> .

xiaozhongtian · 2019-06-19T10:06:37Z

@TomAugspurger

Unknown chunksize

I have seen above and there is the case:

X2 = df.values  # dask.array with unknown chunks along first dim

For me if i use .values, I will not know the chunksize for this array

x= df_train.values
dask.array<values, shape=(nan, 11), dtype=float64, chunksize=(nan, 11)>

And will this influence the distributed computation?
Like the managing the memory, the speed?

fit_intercept:
The same question with the block above:

m_dkl.fit(df_train.values,df["target"])

NotImplementedError: Can not add intercept to array with unknown chunk shape

Will i need to use fit_intercept = False? will the performance be the same as sci-kit learn?

The difference between dask-ml glm and sci-kit learn glm

import dask_ml.linear_model as dkl  
import sklearn.linear_model as skl 
m_skl = skl.LogisticRegression(C=0.01, penalty='l1', n_jobs=-1,random_state=0)
m_dkl = dkl.LogisticRegression(C=0.01, penalty='l1', n_jobs=-1,random_state=0)

m_skl.fit(df_train,df["target"])
m_dkl.fit(df_train.values,df["target"])

In my case, I find that the sci-kit learn estimator accept the dask data fomat(array, dataframe),so, what is the big difference between these?
Is the dask-glm just fitting better in the case "big data" with the specific chunksize ? If we don't know the chunksize above, dask-ml.glm will do it as sci-kit learn or we will have a auto chunksize for distribution?

thebeancounter · 2019-06-19T10:33:47Z

@TomAugspurger

Scikit learn will not utilize the machines cores, and takes way way way too long to run...
Looking for a multithreaded solution.

thebeancounter · 2019-06-19T10:34:38Z

@xiaozhongtian can you please clarify? are you asking a question? Not sure I see the connection to this thread.

xiaozhongtian · 2019-06-19T11:18:25Z

@TomAugspurger
I'm asking a question with the same confusion in the above.

xiaozhongtian · 2019-06-19T11:23:30Z

@thebeancounter

Scikit learn will not utilize the machines cores, and takes way way way too long to run...

With the n_job = -1 in sci-kit learn, it uses the multi-process to fit. no?

But here, I want to know the manage of the memory for scikit learn and dask-ml.
If we don't use the chunk to divise the dataset, there will be no different with sci-ket learn in my opinion.

carloszanella · 2020-04-26T11:33:50Z

I'm having the same problem by building a dataframe from dask arrays, then calling .values just before passing it to a dask_ml.LinearRegression model. Anyone figured this out?

stsievert · 2020-04-26T16:48:38Z

I'm having the same problem

I presume you mean an NotImplementedError: Can not add intercept to array with unknown chunk shape from #84 (comment). Try dask.DataFrame.to_dask_array(lengths=True) https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_dask_array

This will compute the chunk sizes and the length of the array.

Abhishekdutt9 · 2022-09-30T04:23:06Z

Use lr.fit(X.values, y.values) instead

TomAugspurger mentioned this issue Nov 14, 2017

Intercepts should not be regularized dask/dask-glm#65

Open

TomAugspurger mentioned this issue Sep 5, 2018

CI: Unpin openpyxl pandas-dev/pandas#22605

Closed

stsievert mentioned this issue Apr 26, 2020

MAINT: provide potential solutions in warning on "dataframes not accepted" #653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogisticRegression cannot train from Dask DataFrame #84

LogisticRegression cannot train from Dask DataFrame #84

julioasotodv commented Nov 4, 2017

TomAugspurger commented Nov 6, 2017 •

edited

Loading

julioasotodv commented Nov 6, 2017

TomAugspurger commented Nov 6, 2017

julioasotodv commented Nov 6, 2017

TomAugspurger commented Nov 7, 2017 via email

julioasotodv commented Nov 7, 2017

TomAugspurger commented Nov 7, 2017

julioasotodv commented Nov 12, 2017

TomAugspurger commented Nov 13, 2017

cicdw commented Nov 13, 2017

TomAugspurger commented Nov 14, 2017

jakirkham commented Jun 6, 2018

TomAugspurger commented Jun 6, 2018 via email

asifali22 commented Sep 5, 2018 •

edited

Loading

TomAugspurger commented Sep 5, 2018

thebeancounter commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019

thebeancounter commented Jun 14, 2019

TomAugspurger commented Jun 14, 2019 via email

thebeancounter commented Jun 16, 2019

TomAugspurger commented Jun 16, 2019 via email

thebeancounter commented Jun 16, 2019

TomAugspurger commented Jun 16, 2019 via email

thebeancounter commented Jun 17, 2019

TomAugspurger commented Jun 17, 2019 via email

xiaozhongtian commented Jun 19, 2019 •

edited

Loading

thebeancounter commented Jun 19, 2019

thebeancounter commented Jun 19, 2019

xiaozhongtian commented Jun 19, 2019

xiaozhongtian commented Jun 19, 2019

carloszanella commented Apr 26, 2020

stsievert commented Apr 26, 2020

Abhishekdutt9 commented Sep 30, 2022

LogisticRegression cannot train from Dask DataFrame #84

LogisticRegression cannot train from Dask DataFrame #84

Comments

julioasotodv commented Nov 4, 2017

TomAugspurger commented Nov 6, 2017 • edited Loading

julioasotodv commented Nov 6, 2017

TomAugspurger commented Nov 6, 2017

julioasotodv commented Nov 6, 2017

TomAugspurger commented Nov 7, 2017 via email

julioasotodv commented Nov 7, 2017

TomAugspurger commented Nov 7, 2017

julioasotodv commented Nov 12, 2017

TomAugspurger commented Nov 13, 2017

cicdw commented Nov 13, 2017

TomAugspurger commented Nov 14, 2017

jakirkham commented Jun 6, 2018

TomAugspurger commented Jun 6, 2018 via email

asifali22 commented Sep 5, 2018 • edited Loading

TomAugspurger commented Sep 5, 2018

thebeancounter commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019

thebeancounter commented Jun 14, 2019

TomAugspurger commented Jun 14, 2019 via email

thebeancounter commented Jun 16, 2019

TomAugspurger commented Jun 16, 2019 via email

thebeancounter commented Jun 16, 2019

TomAugspurger commented Jun 16, 2019 via email

thebeancounter commented Jun 17, 2019

TomAugspurger commented Jun 17, 2019 via email

xiaozhongtian commented Jun 19, 2019 • edited Loading

thebeancounter commented Jun 19, 2019

thebeancounter commented Jun 19, 2019

xiaozhongtian commented Jun 19, 2019

xiaozhongtian commented Jun 19, 2019

carloszanella commented Apr 26, 2020

stsievert commented Apr 26, 2020

Abhishekdutt9 commented Sep 30, 2022

TomAugspurger commented Nov 6, 2017 •

edited

Loading

asifali22 commented Sep 5, 2018 •

edited

Loading

xiaozhongtian commented Jun 19, 2019 •

edited

Loading