Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support in read_csv for parameter skiprows to not only be int or list (like in pandas) #4560

Merged
merged 4 commits into from Mar 7, 2019

Conversation

Projects
None yet
2 participants
@JulianWgs
Copy link
Contributor

commented Mar 6, 2019

  • The dask.dataframe.read_csv function now mimics more the behavior of
    the pandas read_csv function
  • The pandas read_csv function allows the
    skiprows parameter to be of type range or set
  • Also case when not assigning skiprows is now explicit

Test it out for yourself:
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix

  • Tests added / passed
  • Passes flake8 dask
add support in read_csv(skiprows != list, int)
- The dask.dataframe.read_csv function now mimics more the behavior of
the pandas read_csv function
- The pandas read_csv function allows the
skiprows parameter to be of type range or set
- Also case when not assigning skiprows is now explicit

Test it out for yourself:
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix
@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Thanks @JulianWgs . Would you be willing to add a small test to https://github.com/dask/dask/blob/master/dask/dataframe/io/tests/test_csv.py ?

@JulianWgs

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2019

Is this OK? Fails with old version, but passes after fix.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Yes! That looks great. Thank you!

Update dask/dataframe/io/tests/test_csv.py test_read_csv_skiprows_range
Pass dask object directly into the camparison to pandas dataframe in test_read_csv_skiprows_range.

Co-Authored-By: JulianWgs <31596773+JulianWgs@users.noreply.github.com>
skiprows = lastskiprow = firstrow = kwargs.get('skiprows')
elif kwargs.get('skiprows') is None:
skiprows = lastskiprow = firstrow = 0
elif not callable(kwargs.get('skiprows')):

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 7, 2019

Member

Sorry, one more comment/question: In what context is skiprows callable?

This comment has been minimized.

Copy link
@JulianWgs

JulianWgs Mar 7, 2019

Author Contributor

The pandas library did it that way, but here it is not necessary. I committed a new version.

change iscallable to else and rm other else
- pandas does this and I looked at their code. However for dask this should
not be necessary.
- Code now fails early
@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Thanks @JulianWgs . Merging.

Also I notice that this is your first commit to the dask repository. Welcome!

@mrocklin mrocklin merged commit 8d4e165 into dask:master Mar 7, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Add support in read_csv for parameter skiprows for other iterables (d…
…ask#4560)

- The dask.dataframe.read_csv function now mimics more the behavior of
the pandas read_csv function
- The pandas read_csv function allows the
skiprows parameter to be of type range or set
- Also case when not assigning skiprows is now explicit

Test it out for yourself:

```python
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.