Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace chunks backend with dask for CSV reading. #1549

Merged
merged 5 commits into from
Jul 25, 2016
Merged

Replace chunks backend with dask for CSV reading. #1549

merged 5 commits into from
Jul 25, 2016

Conversation

stefanseefeld
Copy link
Contributor

This implements the simple backend replacment. It was tested against dask master. While explicitly specifying a blocksize affects performance, even the default is (now) better than performance with the chunks backend.

This change may obsolete a number of other functions that are no longer reachable, so a little more cleanup may be warranted.
I'm also not quite sure how much testing this needs.

set(['sepal_length', 'species'])
result = compute(s[s.sepal_length > 5.0].species,
csv, comfortable_memory=10)
assert len(result) == 118
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To verify we're doing the lean projection, this needs to be something like:

result = pre_compute(s[s.sepal_length > 5.0].species, csv, comfortable_memory=10)
assert set(result.columns) == set(['sepal_length', 'species'])

Specifically, we need to test the pre_compute() stage explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@coveralls
Copy link

coveralls commented Jul 22, 2016

Coverage Status

Coverage decreased (-0.02%) to 89.567% when pulling 21d6632 on stefanseefeld:dask into e601b98 on blaze:master.

@coveralls
Copy link

coveralls commented Jul 22, 2016

Coverage Status

Coverage decreased (-0.02%) to 89.567% when pulling 5b5e024 on stefanseefeld:dask into e601b98 on blaze:master.

@kwmsmith
Copy link
Member

kwmsmith commented Jul 22, 2016

We can remove these functions now that we no longer need chunks with CSV:

https://github.com/blaze/blaze/blob/master/blaze/compute/csv.py#L76-L97

Looking at the coverage reports, these functions aren't hit anymore during testing.

@coveralls
Copy link

coveralls commented Jul 22, 2016

Coverage Status

Coverage increased (+0.2%) to 89.741% when pulling b2b3575 on stefanseefeld:dask into e601b98 on blaze:master.

@llllllllll
Copy link
Member

what version of pandas is required by dask.dataframe?

@stefanseefeld
Copy link
Contributor Author

pandas>=0.18.0

@llllllllll
Copy link
Member

I was going to say that I am using blaze on a project that supported pandas 0.17, but we just had a talk and decided to drop pandas 0.17 support so that is no longer an issue.

@kwmsmith
Copy link
Member

@llllllllll great -- any other comments on this PR? Otherwise will merge.

@llllllllll
Copy link
Member

We need to add dask[dataframe] to the requirements, other than that it lgtm

@kwmsmith kwmsmith merged commit 9b110f9 into blaze:master Jul 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants