-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to read small files from GCE bucket with large number of files #2746
Comments
Yes, I can confirm that calling |
Alright, thanks for confirming my suspicions :)
Yes, I suppose my use-case if kind of unusual.
Alright, should I open as issue for that over there? It would definitely be a useful feature.
You mean so that if there is no asterisk in the path then it does not retrieve the file listing? |
The natural place for the issue is on gcsfs. Not sure which solution to follow, though. Recoding for prefix/delimiter would be more complex, although it should be very similar to the s3fs version.
yes |
Alright, I'll open an issue over there then.
As an added feature, it would also be useful to be able to specify a list of filenames. This could also work without having to download the entire file listing. If there's no reason why this couldn't be a feature, then I'll open a new issue for that :) |
There is a cost to each lookup, so it would be good for your situation, but most people probably still want the full listing in one for multiple files |
@martindurant Yes, I agree that it's an unusual need. But would it be difficult to support both? If the user passes a string, then use the current method, if the user passes a list, then iterate over the list. The main reason I see not to support both is that it increases the code complexity a bit. Of course that's a valid concern, so I'm just asking because I'm curious. |
Hi there
I'm using dask to read files from a bucket in google cloud storage, and I've encountered some strange behaviour.
The bucket I'm reading from has around 4.5 million files. What I do is this
The file that I'm trying to read is around 1MB. When I execute
bag.take(1)
), the client starts downloading at several MB/s for a long time, but never manages to finish reading the file. When I place this single file in a separate bucket with just a few files, dask downloads and loads the file immediately.So the problem seems to be fetching files from a bucket with a large number of files.
My uneducated guess is that dask tries to get a listing of all the files in the directory, which takes around 15 minutes of download. It takes the same amount of time to get a list of the files using the
google.cloud.storage
library.I'm using Python 3.6.2 on OSX, and relevant packages are
Cheers,
Halfdan
PS. I suppose it's likely that this is an issue with
gcsfs
, but I thought I'd post here first and see what you think :).The text was updated successfully, but these errors were encountered: