Skip to content

Add index to meta after .str.split with expand#7026

Merged
jsignell merged 5 commits intodask:masterfrom
rubenvdg:add-index-to-meta-str-split
Jan 6, 2021
Merged

Add index to meta after .str.split with expand#7026
jsignell merged 5 commits intodask:masterfrom
rubenvdg:add-index-to-meta-str-split

Conversation

@rubenvdg
Copy link
Copy Markdown
Contributor

@rubenvdg rubenvdg commented Jan 4, 2021

This PR proposes a fix for #7021.

Without the proposed fix, dask/dataframe/tests/test_accessors.py::test_str_accessor_expand fails if the index is of any other type than int.

  • Tests added / passed
  • Passes black dask / flake8 dask

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks nice, just one suggestion and a question. Good to go once those are addressed.

@@ -129,6 +129,7 @@ def split(self, pat=None, n=-1, expand=False):
delimiter = " " if pat is None else pat
meta = type(self._series._meta)([delimiter.join(["a"] * (n + 1))])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion, to make this more consistent with the rest of dask.

Suggested change
meta = type(self._series._meta)([delimiter.join(["a"] * (n + 1))])
meta = self._series._constructor([delimiter.join(["a"] * (n + 1))], index=self._series._meta_nonempty[:1].index)

That may not be properly linted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self._series._meta_nonempty[:1].index is certainly nice. But if I'm not mistaken, the self._series._constructor needs to be called with dsk, name, meta, divisions (cf. new_dd_object(dsk, name, meta, divisions)). Hence, your suggestion results in TypeError: new_dd_object() got an unexpected keyword argument 'index'.

Maybe I miss the point.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry. Should be self._series._meta._constructor. That's similar to type(self._series._meta)(...), but a bit more correct (newer parts of dask are using the ._constructor approach).

delimiter = " " if pat is None else pat
meta = type(self._series._meta)([delimiter.join(["a"] * (n + 1))])
meta = meta.str.split(n=n, expand=expand, pat=pat)
meta = meta.iloc[:0].set_index(self._series._meta.index)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if you apply my suggestion from above. It's not needed now that we create a meta with the proper index.

@rubenvdg
Copy link
Copy Markdown
Contributor Author

rubenvdg commented Jan 5, 2021

Should be good to go @TomAugspurger

My first ever OS contribution 😄 (apart from docs). Huuraaaah. Thanks for the review and feedback.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dask/maintenance there's some failures on windows-3.8 at https://github.com/dask/dask/pull/7026/checks?check_run_id=1650637862. Are those know failures? (I've been a bit out of the loop recently 😄)

If those can be ignored then this is good to go. Thanks @rubenvdg.

@martindurant
Copy link
Copy Markdown
Member

Yes, we've seen the mysterious disappearance of dot on Windows in other PRs

@jsignell jsignell linked an issue Jan 6, 2021 that may be closed by this pull request
@jsignell
Copy link
Copy Markdown
Member

jsignell commented Jan 6, 2021

That failing windows test is indeed unrelated. It's fixed in #7031. But this one can just be merged as is. Thanks @rubenvdg!

@jsignell jsignell merged commit 5d3df51 into dask:master Jan 6, 2021
abduhbm pushed a commit to abduhbm/dask that referenced this pull request Jan 19, 2021
* Add index to meta after .str.split

* Remove old version

* iloc[:0] more expressive than .drop(0)

* beautify
@rubenvdg rubenvdg deleted the add-index-to-meta-str-split branch March 10, 2021 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index.dtype changes from date to int after .str.split

4 participants