Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0 #887

Open
zexuan-zhou opened this issue Nov 19, 2021 · 10 comments

Comments

@zexuan-zhou
Copy link

from dask_ml.compose import ColumnTransformer as dd_column_transformer
from sklearn.compose import ColumnTransformer as sk_column_transformer
from dask_ml.preprocessing import StandardScaler as dd_standard_scaler
from sklearn.preprocessing import StandardScaler as sk_standard_scaler
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame([[0, 1], [0, 1]])

# Sklearn
sk_p = sk_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])
print("sk_p.fit_transform(df)")
print(sk_p.fit_transform(df))
print()

# dask
dd_p = dd_column_transformer([('standard_scaler', dd_standard_scaler(), [0, 1])])
ddf = dd.from_pandas(df, npartitions=2)
print("dd_p.fit_transform(ddf).compute()")
print(dd_p.fit_transform(ddf).compute())
sk_p.fit_transform(df)
[[0. 0.]
 [0. 0.]]

dd_p.fit_transform(ddf).compute()
Traceback (most recent call last):
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 822, in dispatch_one_batch
    tasks = self._ready_batches.get(block=False)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tests/local_test/dask_dataframe_no_take.py", line 20, in <module>
    print(dd_p.fit_transform(ddf).compute())
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 675, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 615, in _fit_transform
    for idx, (name, trans, column, weight) in enumerate(transformers, 1)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/joblib/parallel.py", line 833, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 615, in <genexpr>
    for idx, (name, trans, column, weight) in enumerate(transformers, 1)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 375, in _safe_indexing
    return _pandas_indexing(X, indices, indices_dtype, axis=axis)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 217, in _pandas_indexing
    return X.take(key, axis=axis)
  File "/opt/anaconda3/envs/mlpl/lib/python3.7/site-packages/dask/dataframe/core.py", line 4167, in __getattr__
    raise AttributeError("'DataFrame' object has no attribute %r" % key)
AttributeError: 'DataFrame' object has no attribute 'take'

What happened:
from dask_ml.compose import ColumnTransformer doesn't support sklearn >= 1.0.0 even though it says it supports it

"scikit-learn>=1.0.0",

What you expected to happen:
use scikit-learn==0.24.0 with the following code works

from dask_ml.compose import ColumnTransformer as dd_column_transformer
from sklearn.compose import ColumnTransformer as sk_column_transformer
from sklearn.preprocessing import StandardScaler as sk_standard_scaler
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame([[0, 1], [0, 1]])

# Sklearn
sk_p = sk_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])
print("sk_p.fit_transform(df)")
print(sk_p.fit_transform(df))
print()

# dask
dd_p = dd_column_transformer([('standard_scaler', sk_standard_scaler(), [0, 1])])  # note here I use the sklearn standard scaler transformer
ddf = dd.from_pandas(df, npartitions=2)
print("dd_p.fit_transform(ddf).compute()")
print(dd_p.fit_transform(ddf))
sk_p.fit_transform(df)
[[0. 0.]
 [0. 0.]]

dd_p.fit_transform(ddf).compute()
[[0. 0.]
 [0. 0.]]

Minimal Complete Verifiable Example:

Environment:

  • Versions: dask==2021.11.1 dask-ml==2021.11.16 pandas==1.3.4 scikit-learn==1.0.0
  • Python version: 3.7.10
  • Operating System: macOS Big Sur
  • Install method (conda, pip, source): pip
@zexuan-zhou
Copy link
Author

@TomAugspurger

@TomAugspurger
Copy link
Member

Thanks for the report @zexuan-zhou. Are you able to debug it further? Most likely scikit-learn previously cast a (dask) DataFrame to an ndarray, but no longer does that. We were apparently relying on that behavior for this example.

That said, scikit-learn not casting is probably a good thing. So we might want to update accordingly.

@zexuan-zhou
Copy link
Author

zexuan-zhou commented Nov 19, 2021

I'm not 100% sure but I'm like 90% sure that's the case because I have to update my test cases to differentiate pd object and np object now with the new version of sklearn.

@jakirkham
Copy link
Member

cc @VibhuJawa (in case this is of interest)

@ashokrayal
Copy link

Anyone working on this or any update?

@jakirkham
Copy link
Member

Not aware of anyone working on this. If this is of interest, feel free to pick it up :)

@ashokrayal
Copy link

ashokrayal commented May 25, 2022

@jakirkham , Unfortunately , I don't have that deep knowledge of dask internals.
The problem is that we can't downgrade the scikit version and with current version the column transfer does not work so we are kind of stuck. Can you please suggest any workaround ? or give a suggest direction for resolution.

@melgazar9
Copy link

@TomAugspurger please see the below reproducible example.

I went through other github issues regarding dask and column transformer and found some comments suggesting that we downgrade to sklearn 0.24, dask_ml==1.9.0, and dask=='2021-12-03'. Below is a reproducible example of what passes / fails regarding the dask ColumnTransformer.

# python version 3.9.13
import pandas as pd
import numpy as np
from dask_ml.compose import ColumnTransformer as DaskColumnTransformer
from sklearn.compose import ColumnTransformer as SklearnColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import dask.dataframe as dd
from sklearn.preprocessing import MinMaxScaler
from category_encoders import TargetEncoder

df = pd.DataFrame({'hi': [1,2,3,np.nan,5],
    'bye': [1,3,2,3,5],
    'hc': ['hi', 'h', np.nan, 'hi', 'no'],
    'target': [0,0,1,1,1]})

#    hi  bye  hc  target
# 0   1    1  hi       0
# 1   2    3   h       0
# 2   3    2   h       1
# 3   4    3  hi       1
# 4   5    5  no       1

X = df.drop('target', axis=1)
y = df['target']

# sklearn ColumnTransformer with one pipeline passes
ct_sklearn1 = SklearnColumnTransformer([('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc'])], remainder='passthrough')
ct_sklearn1.fit_transform(X, y)

# dask ColumnTransformer with one pipeline passes
dX = dd.from_pandas(X, npartitions=3)
dy = dd.from_pandas(y, npartitions=3)

ct_dask1 = DaskColumnTransformer([('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc'])], remainder='passthrough')
ct_dask1.fit_transform(dX, dy).compute()

### here is where it gets interesting ###

# sklearn ColumnTransformer with multiple pipelines on mixed datatypes passes
ct_sklearn2 = SklearnColumnTransformer([
    ('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc']),
    ('num', make_pipeline(SimpleImputer(), MinMaxScaler()), ['hi', 'bye'])],
    remainder='passthrough')

ct_sklearn2.fit_transform(X, y)

# dask ColumnTransformer with multiple pipelines on mixed datatypes fails

ct_dask2 = DaskColumnTransformer([
    ('hc', make_pipeline(SimpleImputer(strategy='most_frequent'), TargetEncoder()), ['hc']),
    ('num', make_pipeline(SimpleImputer(), MinMaxScaler()), ['hi', 'bye'])],
    remainder='passthrough')

ct_dask2.fit_transform(dX, dy) # fails!

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

When I upgrade dask, dask_ml, and sklearn to 1.1.1 I get the same errors above:
DataFrame' object has no attribute 'take'

@TomAugspurger
Copy link
Member

Thanks for the reproducible example. We'll need someone to step through and figure out exactly what changed in scikit-learn / pandas and adapt. I won't have time to work on this anytime soon.

@Buckler89
Copy link

Buckler89 commented Jan 16, 2023

Any update on that?

I got the same issue with dask columnTrasformer using
ColumnTransformer(transformers, remainder='passthrough')
resulting in
DataFrame' object has no attribute 'take'
In case I use
ColumnTransformer(transformers, remainder='drop')
it works but due to the transformer settings it will drop all boolean columns

This is the code I used:

         scaler_infos = {
                "MinMaxScaler": {
                    'columns_type': ['float64', 'int64'],
                },
                "Categorizer": {
                }
            }
        for scaler_name, infos in scaler_infos.items():
            ncols = ddf.columns
            scaler = None
            if "columns_type" in infos.keys():
                ncols = ddf.select_dtypes(include=infos["columns_type"]).columns
            elif "columns_name" in infos.keys():
                ncols = infos["columns_name"]

            if scaler_name == "MinMaxScaler":
                scaler = MinMaxScaler()
            elif scaler_name == "StandardScaler":
                scaler = StandardScaler()
            elif scaler_name == "RobustScaler":
                scaler = RobustScaler()
            elif scaler_name == "Categorizer":
                ncols = ddf.select_dtypes(include=['object', 'string', 'category']).columns
                scaler = Categorizer()
            transformers.append((str(scaler), scaler, list(ncols)))

        column_transformer = ColumnTransformer(transformers, remainder='passthrough')
        ddf2 = column_transformer.fit_transform(ddf)

Version
dask 2023.1.0
dask-glm 0.2.0
dask-ml 2022.5.27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants