Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use recursive glob in LocalFileSystem #4186

Merged
merged 2 commits into from Mar 12, 2019

Conversation

Projects
None yet
3 participants
@bnaul
Copy link
Contributor

commented Nov 8, 2018

Currently using a wildcard like dd.read_csv(dir/**/*.csv) works with a remote path (at least for GCS) but not locally. That functionality is quite useful so it seems nice to have locally in case where it's possible, which it is for the std lib glob in 3.5+.

@bnaul bnaul force-pushed the bnaul:patch-1 branch from d3de5c1 to 5d7ba70 Nov 8, 2018

@martindurant

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

A generic implementation like this should always be possible - in fact, that is exactly what is used in the hdfs wrapper within Dask.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

Thanks for the contribution @bnaul !

From my perspective this could use a test.

@martindurant I'm not sure I understand your comment. Are you recommending that we use the glob function in fsspec instead of the proposal here?

@martindurant

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

@mrocklin - no (well, maybe yes in the long term).

I suppose I should have linked here in the existing Dask code, which may do the job for Local files as-is. That would mean you get "**" working whether the system has the newer glob module or not.

@bnaul

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

I also noticed generic_glob when I was looking for the relevant code here and wasn't quite sure how it's supposed to be used; it seems like that would make the most sense as a method on some FileSystem base class rather than being called inside LocalFileSystem?

@mrocklin what exactly would a test entail? Both branches are currently tested by the existing glob tests, as I noticed when I pushed this version with a typo :) https://travis-ci.org/dask/dask/jobs/452188831

@mrocklin

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

what exactly would a test entail?

Perhaps a test where the ** globstring was important? Presumably you want this change for some reason. There is some functionality that Dask doesn't currently do that you'd like to ensure that it does. Adding a test would be a good way to ensure that Dask continues to have that functionality into the future.

@martindurant

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

It is used in the hdfs code: https://github.com/dask/dask/blob/master/dask/bytes/hdfs3.py#L27 and I assume it would be the same.

Making a base class with this kind of generic functionality is exactly what fsspec is for, and why I mentioned it, but it isn't ready yet. It seemed to me that such file-system stuff didn't really belong here in Dask.

@bnaul

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

@mrocklin are there any tests that currently use files in nested directories? I didn't see anything like that in dask.utils. My preference for this is only mild and largely aesthetic (it seems wrong that the behavior is inconsistent between local and remote files) so personally I am not very invested in making sure there's not a regression and holding out for fsspec to eventually ensure consistency between implementations.

@martindurant

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

@bnaul - sorry that this discussion has been allowed to go stale.
I suggest that you follow the use of generic_glob in hdfs and call it in the local file-system in the same way. The test would only need to check that a set of files can be found by 'open_files` (i.e., no need to apply to CSVs or actually load any files).

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

@bnaul , any progress on this?

@bnaul bnaul force-pushed the bnaul:patch-1 branch from 5d7ba70 to 6301ecb Mar 7, 2019

@bnaul

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

@martindurant took a stab at using generic_glob, if it looks ok to you i'll add a simple test.

curiously this actually is different from the last version in its handling of multiple levels of files; probably not important but i figured i'd mention it

(Pdb) path
'/tmp/files/**/*.csv'

(Pdb) from glob import glob
(Pdb) glob(path, recursive=True)
['/tmp/files/a.csv', '/tmp/files/b.csv', '/tmp/files/v/a.csv', '/tmp/files/q/a.csv']

(Pdb) sorted(generic_glob(self, os.path, path))
['/tmp/files/q/a.csv', '/tmp/files/v/a.csv']
@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

I don't think "**" is supported, it's being treated as a single "*"

@bnaul

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

I guess not; I'm not sure what the advantage of this change would be if ** is not supported, were you mistaken about it being supported in your previous comment or am I misunderstanding something? I guess when I have a chance I'll revert to the version I had before and add a test per @mrocklin's suggestion.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

Sorry to have let this lie.
I suppose it's up to you, @bnaul - I'm OK with passing to glob.glob after all, it if provides the ** support not seen elsewhere. It would be still be nice to have this applicable in all file-systems.

@bnaul bnaul force-pushed the bnaul:patch-1 branch from 6301ecb to 6bd5672 Mar 12, 2019

@bnaul

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

thanks @martindurant; I definitely don't feel too strongly about any of this but I do think it'd be handy so I pushed the old version w/ a test. lmk if it looks ok

@bnaul bnaul force-pushed the bnaul:patch-1 branch from 6bd5672 to e7433b6 Mar 12, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

OK, I think it's good for now.
However, I think this ought to be cleared up/factored out in favour of fsspec as soon as Dask goes py3-only. On the other hand, since it would have to be me that made the necessary changes, we'll see!

@martindurant martindurant referenced this pull request Mar 12, 2019

Open

add recursive glob #36

@martindurant martindurant merged commit 4bcd0fb into dask:master Mar 12, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

Thanks, @bnaul

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Use recursive glob in LocalFileSystem (dask#4186)
* Use recursive glob in LocalFileSystem

* Add recursive glob test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.