Add missing methods to Series #1259

mrocklin · 2016-06-09T04:52:44Z

Pandas methods like to_timestamp are trivial to add to dask.dataframe. We should go through the API and verify that we've implemented everything like this that is more-or-less trivial to do.

The text was updated successfully, but these errors were encountered:

postelrich · 2016-10-01T02:31:08Z

mrocklin · 2016-10-01T13:34:40Z

Nice list. It might be interesting to organize these methods by communication patterns that are already well supported with generic functions like elemwise (round, to_timestamp, to_string, ge, le, ...) and reduction (nsmallest, any, all, ...). There will also be others, like kurt, that will require more thought, and transpose, which we probably can't easily support.

…tests not yet passing due to unexpected fill_values error.

* add applymap (#1259) * add DataFrame.round (#1259) * add series.round (#1259) * add dataframe.to_timestamp (#1259) * add dataframe and series elementwise comparisons. (#1259) Series tests not yet passing due to unexpected fill_values error. * removed fill_value from Series (in pandas 19.0 but not 18.0). Passes tests now * meta parameter for applymap * update to_timestamp tests * removed copy kwarg in to_timestamp * removed unused StringIO import * flake8 is picky * Moved comparison tests to test_arithmetics_reduction * updated timestamp tests. divisions are properly cast to TimeStamp * remove old dt accessors * apply to_timestamp to divisions * add Series.to_timestamp

Cherrymelon · 2018-06-15T08:22:06Z

i really hope complete unstack() method

kfeaginsiii · 2018-10-18T17:59:36Z

Howdy, came across this trying to use isin(). Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

mrocklin · 2018-10-20T01:26:14Z

Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

I suppose so, yes. If you have an interest in contributing improvements to isin that would be welcome.

We might also consider closing this. I suspect that it is out of date.

kfeaginsiii · 2018-10-23T21:04:05Z

I can see if I can contribute anything worthwhile.

I think this issue is still valid, because at least for me, the ability to interact with dataframse using many of the above methods is pretty useful, if not a requirement. If there are other methods to do so, it might be good to note how those could be done. That would help one in the transition from pandas to dask, which I have struggled with somewhat. My use for dask isn't so much for memory size, but for compute. Yes, I could use other methods, but I'm lazy.

tsktsktsk123 · 2018-11-06T16:26:06Z

Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

I suppose so, yes. If you have an interest in contributing improvements to isin that would be welcome.

We might also consider closing this. I suspect that it is out of date.

Im afraid it is not yet out of date. I ran into .dt.to_timestamp raising an AttributeError issue today and after some searching into the core I think I know what to change but am unsure yet on if it's optimal, or even possible. Perhaps you could shed some of your expertise on the proposition?

The current implementation

https://github.com/dask/dask/blob/master/dask/dataframe/accessor.py#L88 Here in the class underneath the accessor is set once, and never again. This is fine for every Attribute that is found in the Series.dt accessor.

pd.Series.dt--> CombinedDatetimelikeProperties which contains:

 'to_pydatetime',
 'to_pytimedelta',

However, the to_timestamp only occurs in dir(PeriodProperties)

class DatetimeAccessor(Accessor):
    _accessor = pd.Series.dt
    _accessor_name = 'dt'

The pandas implementation

However, the pandas Series DatetimeAccessor is set at runtime. Meaning that whenever you call .dt on a Series object. It fires off the evaluation as found in https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/accessors.py#L294

which actually calls the is_period_arraylike and/or is_datetime_arraylike functions on the data to evaluate the type (Period or Datetime) and thus return the corresponding properties.

A possible solution?

I think this problem can be solved within Dask by allowing the DatetimeAccessor object to:

Call a sample of the underlying Series
Apply the resolution order of Pandas with the is_period_arraylike and is_datetime_arraylilke functions. (I believe this can be done as simple as pushing the entire sample through the CombinedDatetimelikeProperties class within Pandas.)
Set the correct _accessor based on the result. (Datetime / Timedelta / Period properties)

It doesn't seem to difficult to implement this and I'm willing to do it somewhere during this week. I'm not yet sure though if this is possible. @mrocklin what do you think?

TomAugspurger · 2018-11-07T02:53:30Z

I'd recommend waiting for pandas 0.24 before attempting this. Currently pandas doesn't have a proper period dtype, so the meta for a dask series with period data isn't "correct".

In [2]: ser = dd.from_pandas(pd.Series(pd.period_range('2017', periods=4)), 2)

In [3]: ser._meta_nonempty
Out[3]:
0    foo
1    foo
dtype: object

with pandas 0.24, the dtype will be Period[D] and dask can have some correct metadata.

Then, we can maybe just update with the following

diff --git a/dask/dataframe/accessor.py b/dask/dataframe/accessor.py
index 4b7920b8..11398617 100644
--- a/dask/dataframe/accessor.py
+++ b/dask/dataframe/accessor.py
@@ -93,9 +93,12 @@ class DatetimeAccessor(Accessor):
 
     >>> s.dt.microsecond  # doctest: +SKIP
     """
-    _accessor = pd.Series.dt
     _accessor_name = 'dt'
 
+    @property
+    def _accessor(self):
+        return self._series.dt
+
 
 class StringAccessor(Accessor):
     """ Accessor object for string properties of the Series values.

and things will all just work (maybe).

tsktsktsk123 · 2018-11-07T07:02:43Z

That'd be the neatest implementation. Sure worth the wait in that case.

lazarillo · 2019-04-21T02:18:43Z

I'd like to start contributing to Dask, and this was tagged as a "good first issue". But it's not clear to me what items to focus upon. Can I assume (a) to stay away from time-related API methods, and (b) any of the unchecked items in the OP with the checklist are equally helpful / worthwhile?

I'm thinking I'd start with T and reindex.

TomAugspurger · 2019-04-21T02:29:58Z

I don’t think transpose makes sense for dask DataFrame’s design. Reindex should be doable, but will require some care. It sounds like an updated list of missing methods with guesses about the implementation difficulty would be useful, but I’m not sure if anyone will have time to put that together near term.

…

On Apr 20, 2019, at 21:18, Mike Williamson ***@***.***> wrote: I'd like to start contributing to Dask, and this was tagged as a "good first issue". But it's not clear to me what items to focus upon. Can I assume (a) to stay away from time-related API methods, and (b) any of the unchecked items in the OP with the checklist are equally helpful / worthwhile? I'm thinking I'd start with T and reindex. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

lazarillo · 2019-04-21T02:30:43Z

OK. I'll look for another issue with clearer goals where I can help w/out causing more hassle. ;-p

Ref: dask#1259

* Add Series.dot method to dataframe module Ref: #1259 * Add meta kwarg to Series.dot method * Add validation if other operand is not dask series / dataframe * Address review comments * Update comment * Update tests

Madhu94 · 2021-03-21T02:21:17Z

I believe this ticket was closed by mistake while merging #7236. Can this be reopened?

freyam · 2021-04-01T22:04:22Z

I'd like to take up this issue and help in adding the following method(s) to Dask's Serial:

axes
product
divide (I believe this has been already added)
(feel free to suggest any other method from the complete list which is important or easy to implement)

@mrocklin @TomAugspurger which one of the listed ones or the ones I mentioned would be easier to implement?

This would be my first contribution to Dask 😄!

TomAugspurger · 2021-04-02T13:28:24Z

I think product or axes. product would be more useful, and would follow the pattern of Series.add.

freyam · 2021-04-02T13:30:19Z

I think product or axes. product would be more useful, and would follow the pattern of Series.add.

Perfect! I will start working on dd.Serial.product

freyam · 2021-04-03T18:56:32Z

@TomAugspurger

Turns out, dd.DataFrame.prod already exists.

dask/dask/dataframe/core.py

Lines 1718 to 1740 in c5633c2

    
           @derived_from(pd.DataFrame) 
        
           def prod( 
        
               self, 
        
               axis=None, 
        
               skipna=True, 
        
               split_every=False, 
        
               dtype=None, 
        
               out=None, 
        
               min_count=None, 
        
           ): 
        
               result = self._reduction_agg( 
        
                   "prod", axis=axis, skipna=skipna, split_every=split_every, out=out 
        
               ) 
        
               if min_count: 
        
                   cond = self.notnull().sum(axis=axis) >= min_count 
        
                   if is_series_like(cond): 
        
                       return result.where(cond, other=np.NaN) 
        
                   else: 
        
                       return _scalar_binary( 
        
                           lambda x, y: result if x is y else np.NaN, cond, True 
        
                       ) 
        
               else: 
        
                   return result

And, I fond out that it is the same as product and I feel that it would be redundant to implement a new method -> dd.DataFrame.product

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.product.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.prod.html

And np.prod and ndarray.prod both end up calling umath.multiply.reduce, so there is really no difference between them, besides the fact that the free functions can accept array-like types (like Python lists) in addition to NumPy arrays.

For an ndarray, prod() and product() are equivalent.

For an ndarray, prod() and product() will both call um.multiply.reduce().

If the object type is not ndarray but it still has a prod method, then prod() will return prod(axis=axis, dtype=dtype, out=out, **kwargs) whereas product will try to use um.multiply.reduce.

If the object is not an ndarray and it does not have a prod method, then it will behave as product().

The ndarray.prod() is equivalent to prod()

Source: https://stackoverflow.com/questions/49863633/numpy-product-vs-numpy-prod-vs-ndarray-prod

According to my understanding, this is also relevant to Dask. So, I propose adding an alias to the dd.DataFrame.prod method call. Would you be fine with that?

This could be as simple as just inserting this simple line below the prod's method definition,
freyam@68941c5?branch=68941c5d3ccc93da7f3515fc98d7062716ba5b06&diff=unified
and adding the required test methods similar to the prod tests in dask/dataframe/tests/test_arithmetics_reduction.py and the other test files.

Should I proceed with this? or do you have other plans in your mind?

freyam · 2021-04-04T16:28:35Z

I have opened up a new Pull Request #7517 regarding the product() method. They passed all the tests and have no conflicts.

Now, I shall start working on adding the axes() and duplicated() methods, while waiting for your review on the PR.

AbhiSingam · 2021-04-07T12:45:03Z

Hey!!
I'd like to work this issue, specifically on the following methods:

to_string
first_valid_index
last_valid_index

These seemed like good methods to pick for my first contribution to Dask. I'd love any input or suggestions on what other methods should take up or any specifics regarding the methods chosen.
For reference I've gone through the Pandas documentation and looked at the descriptions given there.

CC: @mrocklin @TomAugspurger @postelrich

martindurant · 2021-04-07T12:57:59Z

I'd love any input or suggestions on what other methods should take up or any specifics regarding the methods chosen.

I think you're best off implementing what you can, and then coming back for more.

freyam · 2021-04-16T20:21:41Z

What do you expect as the return value for the axes() when it comes to Dask DataFrames?

Over at Pandas, pandas.DataFrame.axes

Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.axes.html

jsignell · 2021-04-23T12:55:24Z

I would expect axes() to return a list containing:

df.index (which is a dask.dataframe Index)
df.columns (which is a pandas Index)

For Issue dask#1259

mobley-trent · 2024-02-01T14:00:03Z

Hello @postelrich is this issue still open ?

mrocklin added good first issue Clearly described and easy to accomplish. Good for beginners to the project. dataframe labels Jun 9, 2016

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016

add applymap (dask#1259)

02cf818

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016

add DataFrame.round (dask#1259)

2528654

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016

add series.round (dask#1259)

7d0e9d7

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016

add dataframe.to_timestamp (dask#1259)

0a49748

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016

add dataframe and series elementwise comparisons. (dask#1259) Series …

ebd84bb

…tests not yet passing due to unexpected fill_values error.

JamesJeffryes mentioned this issue Oct 8, 2016

Implementing elementwise dataframe methods #1636

Closed

mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016

add applymap (dask#1259)

c9d1f3b

mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016

add DataFrame.round (dask#1259)

612b978

mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016

add series.round (dask#1259)

e1b7ae4

mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016

add dataframe.to_timestamp (dask#1259)

0be6437

mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016

add dataframe and series elementwise comparisons. (dask#1259) Series …

f3149b3

…tests not yet passing due to unexpected fill_values error.

jcrist changed the title ~~Add to_timestamp to Series~~ Add missing methods to Series Jan 30, 2018

msbrown47 mentioned this issue Jul 13, 2019

add divide method dasks series and dataframe #5094

Merged

2 tasks

Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021

Add Series.dot method to dataframe module

708c1ec

Ref: dask#1259

Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021

Add Series.dot method to dataframe module

b3977e0

Ref: dask#1259

Madhu94 mentioned this issue Feb 16, 2021

Add Series.dot method to dataframe module #7236

Merged

3 tasks

Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021

Add Series.dot method to dataframe module

2f1ee31

Ref: dask#1259

Madhu94 added a commit to Madhu94/dask that referenced this issue Mar 19, 2021

Add Series.dot method to dataframe module

e717c1b

Ref: dask#1259

Madhu94 added a commit to Madhu94/dask that referenced this issue Mar 19, 2021

Add Series.dot method to dataframe module

790d759

Ref: dask#1259

jsignell closed this as completed in #7236 Mar 19, 2021

martindurant reopened this Mar 22, 2021

This was referenced Apr 4, 2021

Added product (alias of prod) #7517

Merged

Google Summer of Code 2021 #7250

Closed

dotNomad added a commit to dotNomad/dask that referenced this issue Aug 19, 2021

Add axes property to DataFrame and Series

0531e37

For Issue dask#1259

dotNomad mentioned this issue Aug 19, 2021

Add axes property to DataFrame and Series #8069

Merged

2 tasks

mobley-trent mentioned this issue Feb 14, 2024

feat(Series) - Added items() method #10923

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing methods to Series #1259

Add missing methods to Series #1259

mrocklin commented Jun 9, 2016

postelrich commented Oct 1, 2016 •

edited by jsignell

mrocklin commented Oct 1, 2016

Cherrymelon commented Jun 15, 2018

kfeaginsiii commented Oct 18, 2018

mrocklin commented Oct 20, 2018

kfeaginsiii commented Oct 23, 2018

tsktsktsk123 commented Nov 6, 2018 •

edited

TomAugspurger commented Nov 7, 2018

tsktsktsk123 commented Nov 7, 2018

lazarillo commented Apr 21, 2019

TomAugspurger commented Apr 21, 2019 via email

lazarillo commented Apr 21, 2019

Madhu94 commented Mar 21, 2021

freyam commented Apr 1, 2021 •

edited

TomAugspurger commented Apr 2, 2021

freyam commented Apr 2, 2021

freyam commented Apr 3, 2021

freyam commented Apr 4, 2021

AbhiSingam commented Apr 7, 2021

martindurant commented Apr 7, 2021

freyam commented Apr 16, 2021

jsignell commented Apr 23, 2021

mobley-trent commented Feb 1, 2024

Add missing methods to Series #1259

Add missing methods to Series #1259

Comments

mrocklin commented Jun 9, 2016

postelrich commented Oct 1, 2016 • edited by jsignell

mrocklin commented Oct 1, 2016

Cherrymelon commented Jun 15, 2018

kfeaginsiii commented Oct 18, 2018

mrocklin commented Oct 20, 2018

kfeaginsiii commented Oct 23, 2018

tsktsktsk123 commented Nov 6, 2018 • edited

The current implementation

The pandas implementation

A possible solution?

TomAugspurger commented Nov 7, 2018

tsktsktsk123 commented Nov 7, 2018

lazarillo commented Apr 21, 2019

TomAugspurger commented Apr 21, 2019 via email

lazarillo commented Apr 21, 2019

Madhu94 commented Mar 21, 2021

freyam commented Apr 1, 2021 • edited

TomAugspurger commented Apr 2, 2021

freyam commented Apr 2, 2021

freyam commented Apr 3, 2021

freyam commented Apr 4, 2021

AbhiSingam commented Apr 7, 2021

martindurant commented Apr 7, 2021

freyam commented Apr 16, 2021

jsignell commented Apr 23, 2021

mobley-trent commented Feb 1, 2024

postelrich commented Oct 1, 2016 •

edited by jsignell

tsktsktsk123 commented Nov 6, 2018 •

edited

freyam commented Apr 1, 2021 •

edited