Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very minor change: clearer exception message when using index or index_col in read_csv #4651

Merged
merged 1 commit into from Mar 29, 2019

Conversation

@alvaroabascar
Copy link
Contributor

@alvaroabascar alvaroabascar commented Mar 29, 2019

Using dask.dataframe.read_csv with either index or index_col keyword produces the same message: ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead. This just changes it to ValueError: Keywords 'index' and 'index_col' are not supported. Use dd.read_csv(...).set_index('my-index') instead.

  • Tests added / passed
  • Passes flake8 dask
@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

I think this is reasonable. For reference, use of index_col= currently causes a more obscure unexpected keyword exception.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Although, before merging, it may be worthwhile asking the question of why setting the index in the reader doesn't work - should produce a meta with that index, but unknown divisions, right?

If no further comments on the change or my question, will merge later.

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 29, 2019

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

Setting the index would trigger a full sort operation

So is the worry that allowing index_col= suggests to the user that a sort would take place, when in fact we'd be doing the simpler pandas operation? Since the result would have unknown divisions, I think it is a useful option to allow. Disallowing the kwarg and suggesting to call set_index seems to force users into the slow shuffle path, which they may not need.

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 29, 2019

So is the worry that allowing index_col= suggests to the user that a sort would take place, when in fact we'd be doing the simpler pandas operation?

Yes, I suspect so, though it's been a long time since I've thought about this.

Since the result would have unknown divisions, I think it is a useful option to allow. Disallowing the kwarg and suggesting to call set_index seems to force users into the slow shuffle path, which they may not need.

I agree that there might be some use here. I suggest that if this is important to you that we move this to a separate issue and merge this in. This PR is clearly a step in a positive direction. I don't think that there is a need to change behavior in this PR.

@martindurant
Copy link
Member

@martindurant martindurant commented Mar 29, 2019

This PR is clearly a step in a positive direction.

agree

@martindurant martindurant merged commit c7d6024 into dask:master Mar 29, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alvaroabascar alvaroabascar deleted the clearer_exception branch Mar 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants