Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing methods to Series #1259

Open
mrocklin opened this issue Jun 9, 2016 · 23 comments · Fixed by #7236
Open

Add missing methods to Series #1259

mrocklin opened this issue Jun 9, 2016 · 23 comments · Fixed by #7236
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2016

Pandas methods like to_timestamp are trivial to add to dask.dataframe. We should go through the API and verify that we've implemented everything like this that is more-or-less trivial to do.

@mrocklin mrocklin added good first issue Clearly described and easy to accomplish. Good for beginners to the project. dataframe labels Jun 9, 2016
@postelrich
Copy link
Contributor

postelrich commented Oct 1, 2016

This is the list I came up of missing methods that should be relatively simple to add.

  • T
  • align
  • all
  • any
  • applymap
  • axes
  • combine
  • combineAdd
  • combineMult
  • combine_first
  • compound
  • corrwith
  • diff
  • divide
  • dot
  • duplicated
  • eq
  • ewm
  • expanding
  • first_valid_index
  • ge
  • get_value
  • gt
  • iget_value
  • irow
  • isin
  • items
  • iteritems
  • kurt -> manually calculate 4th moment?
  • kurtosis ^^^
  • last_valid_index
  • le
  • lt
  • mad -> manually calculate?
  • ne
  • nsmallest
  • prod
  • product
  • reindex
  • reindex_axis
  • round
  • select_dtypes
  • stack
  • std -> manually calculate?
  • to_string
  • to_timestamp
  • transpose
  • unstack
  • var -> manually calculate?

@mrocklin
Copy link
Member Author

mrocklin commented Oct 1, 2016

Nice list. It might be interesting to organize these methods by communication patterns that are already well supported with generic functions like elemwise (round, to_timestamp, to_string, ge, le, ...) and reduction (nsmallest, any, all, ...). There will also be others, like kurt, that will require more thought, and transpose, which we probably can't easily support.

JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016
JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016
JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016
JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016
JamesJeffryes added a commit to JamesJeffryes/dask that referenced this issue Oct 8, 2016
…tests not yet passing due to unexpected fill_values error.
mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016
mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016
mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016
mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016
mrocklin pushed a commit to mrocklin/dask that referenced this issue Oct 14, 2016
…tests not yet passing due to unexpected fill_values error.
mrocklin added a commit that referenced this issue Oct 14, 2016
* add applymap (#1259)

* add DataFrame.round (#1259)

* add series.round (#1259)

* add dataframe.to_timestamp (#1259)

* add dataframe and series elementwise comparisons. (#1259) Series tests not yet passing due to unexpected fill_values error.

* removed fill_value from Series (in pandas 19.0 but not 18.0). Passes tests now

* meta parameter for applymap

* update to_timestamp tests

* removed copy kwarg in to_timestamp

* removed unused StringIO import

* flake8 is picky

* Moved comparison tests to test_arithmetics_reduction

* updated timestamp tests. divisions are properly cast to TimeStamp

* remove old dt accessors

* apply to_timestamp to divisions

* add Series.to_timestamp
mrocklin pushed a commit that referenced this issue Oct 14, 2016
* add applymap (#1259)

* add DataFrame.round (#1259)

* add series.round (#1259)

* add dataframe.to_timestamp (#1259)

* add dataframe and series elementwise comparisons. (#1259) Series tests not yet passing due to unexpected fill_values error.

* removed fill_value from Series (in pandas 19.0 but not 18.0). Passes tests now

* meta parameter for applymap

* update to_timestamp tests

* removed copy kwarg in to_timestamp

* removed unused StringIO import

* flake8 is picky

* Moved comparison tests to test_arithmetics_reduction

* updated timestamp tests. divisions are properly cast to TimeStamp

* remove old dt accessors

* apply to_timestamp to divisions

* add Series.to_timestamp
@jcrist jcrist changed the title Add to_timestamp to Series Add missing methods to Series Jan 30, 2018
@Cherrymelon
Copy link

i really hope complete unstack() method

@kfeaginsiii
Copy link

Howdy, came across this trying to use isin(). Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

@mrocklin
Copy link
Member Author

Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

I suppose so, yes. If you have an interest in contributing improvements to isin that would be welcome.

We might also consider closing this. I suspect that it is out of date.

@kfeaginsiii
Copy link

I can see if I can contribute anything worthwhile.

I think this issue is still valid, because at least for me, the ability to interact with dataframse using many of the above methods is pretty useful, if not a requirement. If there are other methods to do so, it might be good to note how those could be done. That would help one in the transition from pandas to dask, which I have struggled with somewhat. My use for dask isn't so much for memory size, but for compute. Yes, I could use other methods, but I'm lazy.

@tsktsktsk123
Copy link

tsktsktsk123 commented Nov 6, 2018

Looks like this has been open for some time. Is the idea that contributors will add functionality as needed?

I suppose so, yes. If you have an interest in contributing improvements to isin that would be welcome.

We might also consider closing this. I suspect that it is out of date.

Im afraid it is not yet out of date. I ran into .dt.to_timestamp raising an AttributeError issue today and after some searching into the core I think I know what to change but am unsure yet on if it's optimal, or even possible. Perhaps you could shed some of your expertise on the proposition?

The current implementation

https://github.com/dask/dask/blob/master/dask/dataframe/accessor.py#L88 Here in the class underneath the accessor is set once, and never again. This is fine for every Attribute that is found in the Series.dt accessor.

pd.Series.dt--> CombinedDatetimelikeProperties which contains:

 'to_pydatetime',
 'to_pytimedelta',

However, the to_timestamp only occurs in dir(PeriodProperties)

class DatetimeAccessor(Accessor):
    _accessor = pd.Series.dt
    _accessor_name = 'dt'

The pandas implementation

However, the pandas Series DatetimeAccessor is set at runtime. Meaning that whenever you call .dt on a Series object. It fires off the evaluation as found in https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/accessors.py#L294

which actually calls the is_period_arraylike and/or is_datetime_arraylike functions on the data to evaluate the type (Period or Datetime) and thus return the corresponding properties.

A possible solution?

I think this problem can be solved within Dask by allowing the DatetimeAccessor object to:

  1. Call a sample of the underlying Series
  2. Apply the resolution order of Pandas with the is_period_arraylike and is_datetime_arraylilke functions. (I believe this can be done as simple as pushing the entire sample through the CombinedDatetimelikeProperties class within Pandas.)
  3. Set the correct _accessor based on the result. (Datetime / Timedelta / Period properties)

It doesn't seem to difficult to implement this and I'm willing to do it somewhere during this week. I'm not yet sure though if this is possible. @mrocklin what do you think?

@TomAugspurger
Copy link
Member

I'd recommend waiting for pandas 0.24 before attempting this. Currently pandas doesn't have a proper period dtype, so the meta for a dask series with period data isn't "correct".

In [2]: ser = dd.from_pandas(pd.Series(pd.period_range('2017', periods=4)), 2)

In [3]: ser._meta_nonempty
Out[3]:
0    foo
1    foo
dtype: object

with pandas 0.24, the dtype will be Period[D] and dask can have some correct metadata.

Then, we can maybe just update with the following

diff --git a/dask/dataframe/accessor.py b/dask/dataframe/accessor.py
index 4b7920b8..11398617 100644
--- a/dask/dataframe/accessor.py
+++ b/dask/dataframe/accessor.py
@@ -93,9 +93,12 @@ class DatetimeAccessor(Accessor):
 
     >>> s.dt.microsecond  # doctest: +SKIP
     """
-    _accessor = pd.Series.dt
     _accessor_name = 'dt'
 
+    @property
+    def _accessor(self):
+        return self._series.dt
+
 
 class StringAccessor(Accessor):
     """ Accessor object for string properties of the Series values.

and things will all just work (maybe).

@tsktsktsk123
Copy link

That'd be the neatest implementation. Sure worth the wait in that case.

@lazarillo
Copy link

I'd like to start contributing to Dask, and this was tagged as a "good first issue". But it's not clear to me what items to focus upon. Can I assume (a) to stay away from time-related API methods, and (b) any of the unchecked items in the OP with the checklist are equally helpful / worthwhile?

I'm thinking I'd start with T and reindex.

@TomAugspurger
Copy link
Member

TomAugspurger commented Apr 21, 2019 via email

@lazarillo
Copy link

OK. I'll look for another issue with clearer goals where I can help w/out causing more hassle. ;-p

Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021
Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021
Madhu94 added a commit to Madhu94/dask that referenced this issue Feb 16, 2021
Madhu94 added a commit to Madhu94/dask that referenced this issue Mar 19, 2021
Madhu94 added a commit to Madhu94/dask that referenced this issue Mar 19, 2021
jsignell pushed a commit that referenced this issue Mar 19, 2021
* Add Series.dot method to dataframe module

Ref: #1259

* Add meta kwarg to Series.dot method

* Add validation if other operand is not dask series / dataframe

* Address review comments

* Update comment

* Update tests
@Madhu94
Copy link
Contributor

Madhu94 commented Mar 21, 2021

I believe this ticket was closed by mistake while merging #7236. Can this be reopened?

@martindurant martindurant reopened this Mar 22, 2021
@freyam
Copy link
Contributor

freyam commented Apr 1, 2021

I'd like to take up this issue and help in adding the following method(s) to Dask's Serial:

  • axes
  • product
  • divide (I believe this has been already added)
  • (feel free to suggest any other method from the complete list which is important or easy to implement)

@mrocklin @TomAugspurger which one of the listed ones or the ones I mentioned would be easier to implement?

This would be my first contribution to Dask 😄!

@TomAugspurger
Copy link
Member

I think product or axes. product would be more useful, and would follow the pattern of Series.add.

@freyam
Copy link
Contributor

freyam commented Apr 2, 2021

I think product or axes. product would be more useful, and would follow the pattern of Series.add.

Perfect! I will start working on dd.Serial.product

@freyam
Copy link
Contributor

freyam commented Apr 3, 2021

@TomAugspurger

Turns out, dd.DataFrame.prod already exists.

dask/dask/dataframe/core.py

Lines 1718 to 1740 in c5633c2

@derived_from(pd.DataFrame)
def prod(
self,
axis=None,
skipna=True,
split_every=False,
dtype=None,
out=None,
min_count=None,
):
result = self._reduction_agg(
"prod", axis=axis, skipna=skipna, split_every=split_every, out=out
)
if min_count:
cond = self.notnull().sum(axis=axis) >= min_count
if is_series_like(cond):
return result.where(cond, other=np.NaN)
else:
return _scalar_binary(
lambda x, y: result if x is y else np.NaN, cond, True
)
else:
return result

And, I fond out that it is the same as product and I feel that it would be redundant to implement a new method -> dd.DataFrame.product

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.product.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.prod.html

And np.prod and ndarray.prod both end up calling umath.multiply.reduce, so there is really no difference between them, besides the fact that the free functions can accept array-like types (like Python lists) in addition to NumPy arrays.

  • For an ndarray, prod() and product() are equivalent.
  • For an ndarray, prod() and product() will both call um.multiply.reduce().
  • If the object type is not ndarray but it still has a prod method, then prod() will return prod(axis=axis, dtype=dtype, out=out, **kwargs) whereas product will try to use um.multiply.reduce.
  • If the object is not an ndarray and it does not have a prod method, then it will behave as product().
  • The ndarray.prod() is equivalent to prod()

Source: https://stackoverflow.com/questions/49863633/numpy-product-vs-numpy-prod-vs-ndarray-prod

According to my understanding, this is also relevant to Dask. So, I propose adding an alias to the dd.DataFrame.prod method call. Would you be fine with that?

This could be as simple as just inserting this simple line below the prod's method definition,
freyam@68941c5?branch=68941c5d3ccc93da7f3515fc98d7062716ba5b06&diff=unified
and adding the required test methods similar to the prod tests in dask/dataframe/tests/test_arithmetics_reduction.py and the other test files.

Should I proceed with this? or do you have other plans in your mind?

@freyam
Copy link
Contributor

freyam commented Apr 4, 2021

I have opened up a new Pull Request #7517 regarding the product() method. They passed all the tests and have no conflicts.

Now, I shall start working on adding the axes() and duplicated() methods, while waiting for your review on the PR.

@AbhiSingam
Copy link

Hey!!
I'd like to work this issue, specifically on the following methods:

  • to_string
  • first_valid_index
  • last_valid_index

These seemed like good methods to pick for my first contribution to Dask. I'd love any input or suggestions on what other methods should take up or any specifics regarding the methods chosen.
For reference I've gone through the Pandas documentation and looked at the descriptions given there.

CC: @mrocklin @TomAugspurger @postelrich

@martindurant
Copy link
Member

I'd love any input or suggestions on what other methods should take up or any specifics regarding the methods chosen.

I think you're best off implementing what you can, and then coming back for more.

@freyam
Copy link
Contributor

freyam commented Apr 16, 2021

What do you expect as the return value for the axes() when it comes to Dask DataFrames?

Over at Pandas, pandas.DataFrame.axes

Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.axes.html

@jsignell
Copy link
Member

I would expect axes() to return a list containing:

  • df.index (which is a dask.dataframe Index)
  • df.columns (which is a pandas Index)

dotNomad added a commit to dotNomad/dask that referenced this issue Aug 19, 2021
@mobley-trent
Copy link

Hello @postelrich is this issue still open ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

Successfully merging a pull request may close this issue.