Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: split() got an unexpected keyword argument 'expand' | string split function doesn't work [TypeError] | dask 0.20 #4179

Closed
ZiyadMoraished opened this issue Nov 6, 2018 · 11 comments
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@ZiyadMoraished
Copy link

Hi,

I'm trying to split a column by space as follows:
df.CUSTOMER.str.split(expand=True)

here is the error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-07645d325084> in <module>()
----> 1 df.CUSTOMER.str.split(expand=True).head()

TypeError: split() got an unexpected keyword argument 'expand'

when I perform it on the top 5 records, it works perfectly.
df.head().CUSTOMER.str.split(expand=True)

i'm using python 3.6 and dask 0.20

@mrocklin
Copy link
Member

mrocklin commented Nov 6, 2018 via email

@ZiyadMoraished
Copy link
Author

pandas version is 0.23.4

here is a quick failing example:

dask_df = dd.from_pandas(pd.DataFrame({'name' :['Ziyad Moraished']*1000000 }), npartitions= 10000)

dask_df['name'].str.split(' ', exapnd=True)

TypeError                                 Traceback (most recent call last)
<ipython-input-22-2833952eac0a> in <module>()
----> 1 dask_df['name'].str.split(' ', exapnd=True)

TypeError: split() got an unexpected keyword argument 'exapnd'

@mrocklin
Copy link
Member

mrocklin commented Nov 8, 2018

Reproduced. Thank you @ZiyadMoraished .

At first it looked like we could just pass through the expand= keyword from our version to Pandas' (though we would want to verify that this works well on previous pandas versions as well). However when I tried this it looked like we weren't getting the metadata correct. Presumably we don't correctly infer that this produces a dataframe rather than a series.

If you start diving in from here:

def split(self, pat=None, n=-1):
return self._function_map('split', pat=pat, n=n)

You'll eventually get to here:

meta = self._delegate_method(self._series._meta_nonempty,
self._accessor_name, attr, args, kwargs)

Which should be a dataframe with a few text columns, but seems not to be.

If anyone wants to investigate this further that would be welcome.

@jcrist jcrist added good first issue Clearly described and easy to accomplish. Good for beginners to the project. dataframe labels Nov 30, 2018
@nixphix
Copy link
Contributor

nixphix commented Dec 5, 2018

@mrocklin after adding expand arg to split function it failed meta data check here

dask/dask/dataframe/core.py

Lines 3691 to 3694 in 113457b

if not np.array_equal(np.nan_to_num(meta.columns),
np.nan_to_num(df.columns)):
raise ValueError("The columns in the computed data do not match"
" the columns in the provided metadata")

@mrocklin
Copy link
Member

mrocklin commented Dec 5, 2018

What are the differences between meta.columns and df.columns (arguably this should also be in the exception). I wonder if that information would direct you to the fix.

@nixphix
Copy link
Contributor

nixphix commented Dec 5, 2018

meta.column is RangeIndex(start=0, stop=1, step=1) where as df.column is RangeIndex(start=0, stop=2, step=1)

@nixphix
Copy link
Contributor

nixphix commented Dec 5, 2018

This is just for the above quoted code sample

dask_df = dd.from_pandas(pd.DataFrame({'name' :['Ziyad Moraished']*1000000 }), npartitions= 10000)

dask_df['name'].str.split(' ', exapnd=True)

we really can't predict the number of splits ahead of time

@nixphix
Copy link
Contributor

nixphix commented Dec 8, 2018

@mrocklin we could make num splits parameter mandatory if expansion is required, that way we can be sure. Let me know what your thinking.

@mrocklin
Copy link
Member

mrocklin commented Dec 8, 2018

we really can't predict the number of splits ahead of time

Hrm, you're right. That is unfortunate.

Let me know what your thinking

I don't know of a good general solution here. I wonder if anyone else has a suggestion.

As you suggest we could ask the user for the information. We could also compute things directly (this would be safer, but more expensive). I don't have strong thoughts on what is best here.

@jakirkham
Copy link
Member

@TomAugspurger, do you have thoughts on this issue?

@mrocklin
Copy link
Member

Fixed in #4744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

No branches or pull requests

5 participants