ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

zexuan-zhou · 2021-11-19T20:31:17Z

from dask_ml.compose import ColumnTransformer as dd_column_transformer
from sklearn.compose import ColumnTransformer as sk_column_transformer
from dask_ml.preprocessing import StandardScaler as dd_standard_scaler
from sklearn.preprocessing import StandardScaler as sk_standard_scaler
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame([[0, 1], [0, 1]])

# Sklearn
sk_p = sk_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])
print("sk_p.fit_transform(df)")
print(sk_p.fit_transform(df))
print()

# dask
dd_p = dd_column_transformer([('standard_scaler', dd_standard_scaler(), [0, 1])])
ddf = dd.from_pandas(df, npartitions=2)
print("dd_p.fit_transform(ddf).compute()")
print(dd_p.fit_transform(ddf).compute())

sk_p.fit_transform(df)
[[0. 0.]
 [0. 0.]]

dd_p.fit_transform(ddf).compute()
Traceback (most recent call last):
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 822, in dispatch_one_batch
    tasks = self._ready_batches.get(block=False)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tests/local_test/dask_dataframe_no_take.py", line 20, in <module>
    print(dd_p.fit_transform(ddf).compute())
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 675, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 615, in _fit_transform
    for idx, (name, trans, column, weight) in enumerate(transformers, 1)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 833, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 615, in <genexpr>
    for idx, (name, trans, column, weight) in enumerate(transformers, 1)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 375, in _safe_indexing
    return _pandas_indexing(X, indices, indices_dtype, axis=axis)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 217, in _pandas_indexing
    return X.take(key, axis=axis)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/dask/dataframe/core.py", line 4167, in __getattr__
    raise AttributeError("'DataFrame' object has no attribute %r" % key)
AttributeError: 'DataFrame' object has no attribute 'take'

What happened:
from dask_ml.compose import ColumnTransformer doesn't support sklearn >= 1.0.0 even though it says it supports it

dask-ml/setup.py

Line 19 in cf24100

"scikit-learn>=1.0.0",

What you expected to happen:
use scikit-learn==0.24.0 with the following code works

from dask_ml.compose import ColumnTransformer as dd_column_transformer
from sklearn.compose import ColumnTransformer as sk_column_transformer
from sklearn.preprocessing import StandardScaler as sk_standard_scaler
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame([[0, 1], [0, 1]])

# Sklearn
sk_p = sk_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])
print("sk_p.fit_transform(df)")
print(sk_p.fit_transform(df))
print()

# dask
dd_p = dd_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])  # note here I use the sklearn standard scaler transformer
ddf = dd.from_pandas(df, npartitions=2)
print("dd_p.fit_transform(ddf).compute()")
print(dd_p.fit_transform(ddf))

sk_p.fit_transform(df)
[[0. 0.]
 [0. 0.]]

dd_p.fit_transform(ddf).compute()
[[0. 0.]
 [0. 0.]]

Minimal Complete Verifiable Example:

Environment:

Versions: dask==2021.11.1 dask-ml==2021.11.16 pandas==1.3.4 scikit-learn==1.0.0
Python version: 3.7.10
Operating System: macOS Big Sur
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

zexuan-zhou · 2021-11-19T20:32:08Z

@TomAugspurger

TomAugspurger · 2021-11-19T20:47:36Z

Thanks for the report @zexuan-zhou. Are you able to debug it further? Most likely scikit-learn previously cast a (dask) DataFrame to an ndarray, but no longer does that. We were apparently relying on that behavior for this example.

That said, scikit-learn not casting is probably a good thing. So we might want to update accordingly.

zexuan-zhou · 2021-11-19T20:49:39Z

I'm not 100% sure but I'm like 90% sure that's the case because I have to update my test cases to differentiate pd object and np object now with the new version of sklearn.

jakirkham · 2021-11-19T20:52:41Z

cc @VibhuJawa (in case this is of interest)

ashokrayal · 2022-05-24T04:54:24Z

Anyone working on this or any update?

jakirkham · 2022-05-24T06:08:55Z

Not aware of anyone working on this. If this is of interest, feel free to pick it up :)

ashokrayal · 2022-05-25T08:34:08Z

@jakirkham , Unfortunately , I don't have that deep knowledge of dask internals.
The problem is that we can't downgrade the scikit version and with current version the column transfer does not work so we are kind of stuck. Can you please suggest any workaround ? or give a suggest direction for resolution.

melgazar9 · 2022-06-29T04:33:50Z

@TomAugspurger please see the below reproducible example.

I went through other github issues regarding dask and column transformer and found some comments suggesting that we downgrade to sklearn 0.24, dask_ml==1.9.0, and dask=='2021-12-03'. Below is a reproducible example of what passes / fails regarding the dask ColumnTransformer.

# python version 3.9.13
import pandas as pd
import numpy as np
from dask_ml.compose import ColumnTransformer as DaskColumnTransformer
from sklearn.compose import ColumnTransformer as SklearnColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import dask.dataframe as dd
from sklearn.preprocessing import MinMaxScaler
from category_encoders import TargetEncoder

df = pd.DataFrame({'hi': [1,2,3,np.nan,5],
    'bye': [1,3,2,3,5],
    'hc': ['hi', 'h', np.nan, 'hi', 'no'],
    'target': [0,0,1,1,1]})

#    hi  bye  hc  target
# 0   1    1  hi       0
# 1   2    3   h       0
# 2   3    2   h       1
# 3   4    3  hi       1
# 4   5    5  no       1

X = df.drop('target', axis=1)
y = df['target']

# sklearn ColumnTransformer with one pipeline passes
ct_sklearn1 = SklearnColumnTransformer([('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc'])], remainder='passthrough')
ct_sklearn1.fit_transform(X, y)

# dask ColumnTransformer with one pipeline passes
dX = dd.from_pandas(X, npartitions=3)
dy = dd.from_pandas(y, npartitions=3)

ct_dask1 = DaskColumnTransformer([('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc'])], remainder='passthrough')
ct_dask1.fit_transform(dX, dy).compute()

### here is where it gets interesting ###

# sklearn ColumnTransformer with multiple pipelines on mixed datatypes passes
ct_sklearn2 = SklearnColumnTransformer([
    ('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc']),
    ('num', make_pipeline(SimpleImputer(), MinMaxScaler()), ['hi', 'bye'])],
    remainder='passthrough')

ct_sklearn2.fit_transform(X, y)

# dask ColumnTransformer with multiple pipelines on mixed datatypes fails

ct_dask2 = DaskColumnTransformer([
    ('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc']),
    ('num', make_pipeline(SimpleImputer(), MinMaxScaler()), ['hi', 'bye'])],
    remainder='passthrough')

ct_dask2.fit_transform(dX, dy) # fails!

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

When I upgrade dask, dask_ml, and sklearn to 1.1.1 I get the same errors above:
DataFrame' object has no attribute 'take'

TomAugspurger · 2022-06-29T11:31:04Z

Thanks for the reproducible example. We'll need someone to step through and figure out exactly what changed in scikit-learn / pandas and adapt. I won't have time to work on this anytime soon.

Buckler89 · 2023-01-16T22:45:31Z

Any update on that?

I got the same issue with dask columnTrasformer using
ColumnTransformer(transformers, remainder='passthrough')
resulting in
DataFrame' object has no attribute 'take'
In case I use
ColumnTransformer(transformers, remainder='drop')
it works but due to the transformer settings it will drop all boolean columns

This is the code I used:

         scaler_infos = {
                "MinMaxScaler": {
                    'columns_type': ['float64', 'int64'],
                },
                "Categorizer": {
                }
            }
        for scaler_name, infos in scaler_infos.items():
            ncols = ddf.columns
            scaler = None
            if "columns_type" in infos.keys():
                ncols = ddf.select_dtypes(include=infos["columns_type"]).columns
            elif "columns_name" in infos.keys():
                ncols = infos["columns_name"]

            if scaler_name == "MinMaxScaler":
                scaler = MinMaxScaler()
            elif scaler_name == "StandardScaler":
                scaler = StandardScaler()
            elif scaler_name == "RobustScaler":
                scaler = RobustScaler()
            elif scaler_name == "Categorizer":
                ncols = ddf.select_dtypes(include=['object', 'string', 'category']).columns
                scaler = Categorizer()
            transformers.append((str(scaler), scaler, list(ncols)))

        column_transformer = ColumnTransformer(transformers, remainder='passthrough')
        ddf2 = column_transformer.fit_transform(ddf)

Version
dask 2023.1.0
dask-glm 0.2.0
dask-ml 2022.5.27

nprihodko mentioned this issue May 24, 2024

ColumnTransformer does not work with Dask dataframes #993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

zexuan-zhou commented Nov 19, 2021

zexuan-zhou commented Nov 19, 2021

TomAugspurger commented Nov 19, 2021

zexuan-zhou commented Nov 19, 2021 •

edited

Loading

jakirkham commented Nov 19, 2021

ashokrayal commented May 24, 2022

jakirkham commented May 24, 2022

ashokrayal commented May 25, 2022 •

edited

Loading

melgazar9 commented Jun 29, 2022

TomAugspurger commented Jun 29, 2022

Buckler89 commented Jan 16, 2023 •

edited

Loading

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

Comments

zexuan-zhou commented Nov 19, 2021

zexuan-zhou commented Nov 19, 2021

TomAugspurger commented Nov 19, 2021

zexuan-zhou commented Nov 19, 2021 • edited Loading

jakirkham commented Nov 19, 2021

ashokrayal commented May 24, 2022

jakirkham commented May 24, 2022

ashokrayal commented May 25, 2022 • edited Loading

melgazar9 commented Jun 29, 2022

TomAugspurger commented Jun 29, 2022

Buckler89 commented Jan 16, 2023 • edited Loading

zexuan-zhou commented Nov 19, 2021 •

edited

Loading

ashokrayal commented May 25, 2022 •

edited

Loading

Buckler89 commented Jan 16, 2023 •

edited

Loading