Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read small files from GCE bucket with large number of files #2746

Closed
halfdanrump opened this issue Oct 5, 2017 · 6 comments
Closed

Comments

@halfdanrump
Copy link

Hi there

I'm using dask to read files from a bucket in google cloud storage, and I've encountered some strange behaviour.

The bucket I'm reading from has around 4.5 million files. What I do is this

from dask import bag as db
import gcsfs
bag = db.read_text('gs://BUCKET/FILENAME')
bag.take(1)

The file that I'm trying to read is around 1MB. When I execute bag.take(1)), the client starts downloading at several MB/s for a long time, but never manages to finish reading the file. When I place this single file in a separate bucket with just a few files, dask downloads and loads the file immediately.

So the problem seems to be fetching files from a bucket with a large number of files.

My uneducated guess is that dask tries to get a listing of all the files in the directory, which takes around 15 minutes of download. It takes the same amount of time to get a list of the files using the google.cloud.storage library.

I'm using Python 3.6.2 on OSX, and relevant packages are

dask==0.15.3
distributed==1.19.1
gcsfs==0.0.3

Cheers,
Halfdan

PS. I suppose it's likely that this is an issue with gcsfs, but I thought I'd post here first and see what you think :).

@martindurant
Copy link
Member

Yes, I can confirm that calling info on a file in gcsfs in turn calls ls, and gets all of the files in a bucket. It is required every time a file is opened, to get the download URL and the file-size.
This is the right thing to do if you anticipate the user is likely to want to access multiple files out of a listing that is not too massive.
Unlike s3fs, gcsfs does not yet implement prefix/delimiter listing (i.e., separating out the "folders"), but it could. Alternatively, a call to info could default to getting the metadata for one object only if there is no file listing in the cache.

@halfdanrump
Copy link
Author

halfdanrump commented Oct 6, 2017

Yes, I can confirm that calling info on a file in gcsfs in turn calls ls, and gets all of the files in a bucket. It is required every time a file is opened, to get the download URL and the file-size.

Alright, thanks for confirming my suspicions :)

This is the right thing to do if you anticipate the user is likely to want to access multiple files out of a listing that is not too massive.

Yes, I suppose my use-case if kind of unusual.

Unlike s3fs, gcsfs does not yet implement prefix/delimiter listing (i.e., separating out the "folders"), but it could.

Alright, should I open as issue for that over there? It would definitely be a useful feature.

Alternatively, a call to info could default to getting the metadata for one object only if there is no file listing in the cache.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?

@martindurant
Copy link
Member

should I open as issue for that over there?

The natural place for the issue is on gcsfs. Not sure which solution to follow, though. Recoding for prefix/delimiter would be more complex, although it should be very similar to the s3fs version.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?

yes

@halfdanrump
Copy link
Author

The natural place for the issue is on gcsfs. Not sure which solution to follow, though. Recoding for prefix/delimiter would be more complex, although it should be very similar to the s3fs version.

Alright, I'll open an issue over there then.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?
yes

As an added feature, it would also be useful to be able to specify a list of filenames. This could also work without having to download the entire file listing. If there's no reason why this couldn't be a feature, then I'll open a new issue for that :)

@martindurant
Copy link
Member

specify a list of filenames

There is a cost to each lookup, so it would be good for your situation, but most people probably still want the full listing in one for multiple files

@halfdanrump
Copy link
Author

@martindurant Yes, I agree that it's an unusual need. But would it be difficult to support both? If the user passes a string, then use the current method, if the user passes a list, then iterate over the list.

The main reason I see not to support both is that it increases the code complexity a bit. Of course that's a valid concern, so I'm just asking because I'm curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants