Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support in read_csv for parameter skiprows to not only be int or list (like in pandas) #4560

Merged
merged 4 commits into from Mar 7, 2019
Merged

Conversation

@JulianWgs
Copy link
Contributor

@JulianWgs JulianWgs commented Mar 6, 2019

  • The dask.dataframe.read_csv function now mimics more the behavior of
    the pandas read_csv function
  • The pandas read_csv function allows the
    skiprows parameter to be of type range or set
  • Also case when not assigning skiprows is now explicit

Test it out for yourself:
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix

  • Tests added / passed
  • Passes flake8 dask
- The dask.dataframe.read_csv function now mimics more the behavior of
the pandas read_csv function
- The pandas read_csv function allows the
skiprows parameter to be of type range or set
- Also case when not assigning skiprows is now explicit

Test it out for yourself:
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix
@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 6, 2019

Thanks @JulianWgs . Would you be willing to add a small test to https://github.com/dask/dask/blob/master/dask/dataframe/io/tests/test_csv.py ?

@JulianWgs
Copy link
Contributor Author

@JulianWgs JulianWgs commented Mar 6, 2019

Is this OK? Fails with old version, but passes after fix.

dask/dataframe/io/tests/test_csv.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 6, 2019

Yes! That looks great. Thank you!

Pass dask object directly into the camparison to pandas dataframe in test_read_csv_skiprows_range.

Co-Authored-By: JulianWgs <31596773+JulianWgs@users.noreply.github.com>
skiprows = lastskiprow = firstrow = kwargs.get('skiprows')
elif kwargs.get('skiprows') is None:
skiprows = lastskiprow = firstrow = 0
elif not callable(kwargs.get('skiprows')):
Copy link
Member

@mrocklin mrocklin Mar 7, 2019

Sorry, one more comment/question: In what context is skiprows callable?

Copy link
Contributor Author

@JulianWgs JulianWgs Mar 7, 2019

The pandas library did it that way, but here it is not necessary. I committed a new version.

- pandas does this and I looked at their code. However for dask this should
not be necessary.
- Code now fails early
@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 7, 2019

Thanks @JulianWgs . Merging.

Also I notice that this is your first commit to the dask repository. Welcome!

@mrocklin mrocklin merged commit 8d4e165 into dask:master Mar 7, 2019
2 checks passed
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
…ask#4560)

- The dask.dataframe.read_csv function now mimics more the behavior of
the pandas read_csv function
- The pandas read_csv function allows the
skiprows parameter to be of type range or set
- Also case when not assigning skiprows is now explicit

Test it out for yourself:

```python
import dask.dataframe as dd
import pandas as pd
df = pd.read_csv("file.csv", skiprows=range(18)) # this works anyway
df = dd.read_csv("file.csv", skiprows=range(18)) # this works only after fix
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants