-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace chunks backend with dask for CSV reading. #1549
Conversation
set(['sepal_length', 'species']) | ||
result = compute(s[s.sepal_length > 5.0].species, | ||
csv, comfortable_memory=10) | ||
assert len(result) == 118 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To verify we're doing the lean projection, this needs to be something like:
result = pre_compute(s[s.sepal_length > 5.0].species, csv, comfortable_memory=10)
assert set(result.columns) == set(['sepal_length', 'species'])
Specifically, we need to test the pre_compute()
stage explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
We can remove these functions now that we no longer need chunks with CSV: https://github.com/blaze/blaze/blob/master/blaze/compute/csv.py#L76-L97 Looking at the coverage reports, these functions aren't hit anymore during testing. |
what version of pandas is required by dask.dataframe? |
pandas>=0.18.0 |
I was going to say that I am using blaze on a project that supported pandas 0.17, but we just had a talk and decided to drop pandas 0.17 support so that is no longer an issue. |
@llllllllll great -- any other comments on this PR? Otherwise will merge. |
We need to add |
This implements the simple backend replacment. It was tested against dask master. While explicitly specifying a blocksize affects performance, even the default is (now) better than performance with the chunks backend.
This change may obsolete a number of other functions that are no longer reachable, so a little more cleanup may be warranted.
I'm also not quite sure how much testing this needs.