New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add include_path option for dask.bag.read_text #5836
Conversation
@mrocklin, would you have time to review this or recommend someone who could? |
I'm internally debating if |
@gyf304 just FYI, when you force push to the branch we don't get notifications and it makes things a bit harder to review. Can you just push additional commits? We squash on merge.
Can you include documentation / examples for what this does? That may aid in determining what is more natural. My initial reaction is that a keyword probably shouldn't change the return type of a method (from |
Sorry about the force push, was used to the workflow of gerrit of doing Example for first approach >>> bag, paths = dask.bag.read_text("./test/*.txt", include_path=True)
>>> paths
["/home/.../test/1.txt", "/home/.../test/2.txt"]
>>> len(paths) == bag.npartitions
True Example for second approach >>> bag = dask.bag.read_text("./test/*.txt", include_path=True)
>>> bag.take(1)
(("hello", "/home/.../test/1.txt"),) |
Do you have a suggestion for which is more useful? I would think that the first is relatively easily accomplished by paths = glob.glob("*/test/*.txt")
bag = db.read_text(paths) The second is harder to achieve without the |
I agree. This also mirrors well with the |
dask/bag/text.py
Outdated
|
||
>>> b = read_text('myfiles.*.txt', include_path=True) # doctest: +SKIP | ||
>>> b.pluck(1).take(1) # doctest: +SKIP | ||
('.../myfiles.0.txt',) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be a bit easier to understand with something like
>>> b.pluck(1)
(('first line of the first file\n', 'myfiles.0.txt'),)
Why is the ...
present in your output? Are the paths relative or absolute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it should be
>>> b.take(1)
(('first line of the first file\n', '.../myfiles.0.txt'),)
Paths are absolute. The ...
is just ellipsis for path. Open to suggestions for alternatives to ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something generic like /home/dask/myfiles.0.txt
?
Thanks @gyf304! |
This is great, thanks @gyf304! |
black dask
/flake8 dask