Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add include_path option for dask.bag.read_text #5836

Merged
merged 5 commits into from Feb 6, 2020

Conversation

gyf304
Copy link
Contributor

@gyf304 gyf304 commented Jan 28, 2020

  • Tests added / passed
  • Passes black dask / flake8 dask

@jakirkham
Copy link
Member

@mrocklin, would you have time to review this or recommend someone who could?

@gyf304
Copy link
Contributor Author

gyf304 commented Jan 29, 2020

I'm internally debating if include_path should return a tuple of (bag, paths) like dask.bytes does, or return a bag of (line, path) like what dask.dataframe does. Not sure which way is more useful.

@TomAugspurger
Copy link
Member

@gyf304 just FYI, when you force push to the branch we don't get notifications and it makes things a bit harder to review. Can you just push additional commits? We squash on merge.

internally debating if include_path should return a tuple of (bag, paths) like dask.bytes does, or return a bag of (line, path) like what dask.dataframe does.

Can you include documentation / examples for what this does? That may aid in determining what is more natural. My initial reaction is that a keyword probably shouldn't change the return type of a method (from Bag to Tuple), though it's hard to say.

@gyf304
Copy link
Contributor Author

gyf304 commented Feb 4, 2020

Sorry about the force push, was used to the workflow of gerrit of doing git commit --amend. Will directly commit next time.

Example for first approach

>>> bag, paths = dask.bag.read_text("./test/*.txt", include_path=True)
>>> paths
["/home/.../test/1.txt", "/home/.../test/2.txt"]
>>> len(paths) == bag.npartitions
True

Example for second approach

>>> bag = dask.bag.read_text("./test/*.txt", include_path=True)
>>> bag.take(1)
(("hello", "/home/.../test/1.txt"),)

@TomAugspurger
Copy link
Member

Do you have a suggestion for which is more useful?

I would think that the first is relatively easily accomplished by

paths = glob.glob("*/test/*.txt")
bag = db.read_text(paths)

The second is harder to achieve without the include_path keyword.

@gyf304
Copy link
Contributor Author

gyf304 commented Feb 4, 2020

I agree. This also mirrors well with the include_path_column option in dask.dataframe.read_csv
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
I already updated the patch to the second behavior after the same thought process.

dask/bag/tests/test_text.py Show resolved Hide resolved
dask/bag/tests/test_text.py Show resolved Hide resolved
dask/bag/text.py Show resolved Hide resolved
dask/bag/text.py Outdated Show resolved Hide resolved
dask/bag/text.py Outdated

>>> b = read_text('myfiles.*.txt', include_path=True) # doctest: +SKIP
>>> b.pluck(1).take(1) # doctest: +SKIP
('.../myfiles.0.txt',)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a bit easier to understand with something like

>>> b.pluck(1)
(('first line of the first file\n', 'myfiles.0.txt'),)

Why is the ... present in your output? Are the paths relative or absolute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it should be

>>> b.take(1)
(('first line of the first file\n', '.../myfiles.0.txt'),)

Paths are absolute. The ... is just ellipsis for path. Open to suggestions for alternatives to ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something generic like /home/dask/myfiles.0.txt?

@TomAugspurger TomAugspurger merged commit ae25001 into dask:master Feb 6, 2020
@TomAugspurger
Copy link
Member

Thanks @gyf304!

@SultanOrazbayev
Copy link
Contributor

This is great, thanks @gyf304!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants