Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some of the lowhanging fruit from dask.dataframe #1317

Merged
merged 21 commits into from
Dec 2, 2015

Conversation

cowlicks
Copy link
Member

No description provided.

def compute_up(expr, data, **kwargs):
values = [compute(val, {expr._child: data}) for val in expr.values]
if expr.keepdims:
return DataFrame([values], columns=expr.fields)
if isinstance(data, Dataframe):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be type(data)(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also a typo Dataframe should be DataFrame ... i think type(self)(...) would solve the problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is much better.

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

looks like dask dataframes are missing a shape attribute ... does that even make sense for that object?

@mrocklin
Copy link
Member

No, dask.dataframes do not know their own length, except through computation.

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

@cowlicks You'll probably have to implement a new dispatched version of those expressions.

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

@cowlicks You could probably do nelements with series.isnull().sum() + series.count() and count should probably be series.count() if it isn't already

@cowlicks cowlicks force-pushed the dataframe branch 2 times, most recently from 2cb3a94 to 238c963 Compare November 24, 2015 16:50
@cowlicks
Copy link
Member Author

@mrocklin dask.dataframes do have a len method, which silently does computation because it is usually cheap.

@cpcloud hows this? 238c963

@mrocklin
Copy link
Member

The cost of len(df) depends strongly on the graph that creates the dataframe. If it is based off of an in-memory pandas dataframe then yes, it is cheap. However, if it is based off of something more expensive, like a CSV file, then it can be quite expensive to compute.

@cowlicks
Copy link
Member Author

@mrocklin I'll leave it out then.

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

I'd be okay with leaving out nelements but we can't leave out count

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

@cowlicks On second thought, I still think nelements should work. I don't think we should fail because it may take a long time to compute. A user is going to need to have some knowledge of the thing they are computing on to be able to predict performance. A warning would suffice I think

@cowlicks
Copy link
Member Author

@cpcloud any idea why this call to count does not dispatch to here?

@cpcloud
Copy link
Member

cpcloud commented Nov 24, 2015

because the test is operating on a Field (which corresponds to a Series) and that implementation is for count on an entire Record datashape object (corresponds to a pandas DataFrame)

@cowlicks cowlicks removed the wip label Dec 1, 2015
@cowlicks cowlicks changed the title WIP: Add some of the lowhanging fruit from dask.dataframe Add some of the lowhanging fruit from dask.dataframe Dec 1, 2015
@cowlicks
Copy link
Member Author

cowlicks commented Dec 1, 2015

This is not comprehensive, but includes a lot of easy stuff from dask dataframe. I believe the tests cover all the added compute_up methods.

@cpcloud
Copy link
Member

cpcloud commented Dec 1, 2015

@cowlicks can you add a release note in docs/source/whatsnew/0.9.0.txt? then good to merge

@cowlicks
Copy link
Member Author

cowlicks commented Dec 2, 2015

Added release notes. Merging soon unless there are further comments.

cpcloud added a commit that referenced this pull request Dec 2, 2015
Add some of the lowhanging fruit from dask.dataframe
@cpcloud cpcloud merged commit d076183 into blaze:master Dec 2, 2015
@cowlicks cowlicks deleted the dataframe branch December 3, 2015 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants