Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in ColumnTransformer #962

Open
aparnakesarkar opened this issue Jan 18, 2023 · 2 comments
Open

Bug in ColumnTransformer #962

aparnakesarkar opened this issue Jan 18, 2023 · 2 comments

Comments

@aparnakesarkar
Copy link

aparnakesarkar commented Jan 18, 2023

I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)

Code:

from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

df = pd.read_csv('path/to/csv')

ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols =  [<list of passthrough columns>]

transformers = [
    ("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
    ("onehot_encoding", OneHotEncoder(), nominal_cols),
    ('select', 'passthrough', passthrough_cols)
]

preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)

this failed with the Traceback

Traceback (most recent call last):
  File ".../helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File ".../dask_testing.py", line 80, in <module>
    df_t = preprocessor.fit_transform(df)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
    return self._hstack(list(Xs))
  File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
    return pd.concat(Xs, axis="columns")
  File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
    op = _Concatenator(
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
    raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

On further debugging the output from the three steps in the transformer give 3 different types of outputs.

  1. OrdinalEncoder() gives a 2darray
  2. OneHotEncoder() gives a csr_matrix
  3. "passthrough" gives a dataframe

Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py line 198 where it is trying to concat the three different types into a an output df

Code snippet:

elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
            return pd.concat(Xs, axis="columns")

Anything else we need to know?:
Shape of my data is (1000, 1076)
label encoding 109 ccolumns
onehot encoding 1 column
passthrough the rest of the columns

I do not want to use remainder="passthrough" param, I want to pass it in the transformers list

Environment:

  • Dask version:
dask               2023.1.0
dask-glm           0.2.0
dask-ml            2022.5.27
  • Python version: 3.8
  • Operating System: MacOS
  • Install method (conda, pip, source): pip
@aparnakesarkar
Copy link
Author

aparnakesarkar commented Jan 18, 2023

Solution?:
The way sklearn processes this is by converting sparse matrix to ndarray

Sklearn code snippet:

Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
return np.hstack(Xs)

@mmccarty
Copy link
Member

mmccarty commented Mar 6, 2023

Hi @aparnakesarkar - Thank you for opening an issue. Would you please update your example to include generated data? See this blog for an example on generating data that reproduces the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants