Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very minor change: clearer exception message when using index or index_col in read_csv #4651

Merged
merged 1 commit into from Mar 29, 2019

Conversation

Projects
None yet
3 participants
@alvaroabascar
Copy link
Contributor

commented Mar 29, 2019

Using dask.dataframe.read_csv with either index or index_col keyword produces the same message: ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead. This just changes it to ValueError: Keywords 'index' and 'index_col' are not supported. Use dd.read_csv(...).set_index('my-index') instead.

  • Tests added / passed
  • Passes flake8 dask
@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

I think this is reasonable. For reference, use of index_col= currently causes a more obscure unexpected keyword exception.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Although, before merging, it may be worthwhile asking the question of why setting the index in the reader doesn't work - should produce a meta with that index, but unknown divisions, right?

If no further comments on the change or my question, will merge later.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Setting the index would trigger a full sort operation

So is the worry that allowing index_col= suggests to the user that a sort would take place, when in fact we'd be doing the simpler pandas operation? Since the result would have unknown divisions, I think it is a useful option to allow. Disallowing the kwarg and suggesting to call set_index seems to force users into the slow shuffle path, which they may not need.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

So is the worry that allowing index_col= suggests to the user that a sort would take place, when in fact we'd be doing the simpler pandas operation?

Yes, I suspect so, though it's been a long time since I've thought about this.

Since the result would have unknown divisions, I think it is a useful option to allow. Disallowing the kwarg and suggesting to call set_index seems to force users into the slow shuffle path, which they may not need.

I agree that there might be some use here. I suggest that if this is important to you that we move this to a separate issue and merge this in. This PR is clearly a step in a positive direction. I don't think that there is a need to change behavior in this PR.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

This PR is clearly a step in a positive direction.

agree

@martindurant martindurant merged commit c7d6024 into dask:master Mar 29, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@alvaroabascar alvaroabascar deleted the alvaroabascar:clearer_exception branch Mar 30, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.