Unable to read small files from GCE bucket with large number of files #2746

halfdanrump · 2017-10-05T05:15:08Z

Hi there

I'm using dask to read files from a bucket in google cloud storage, and I've encountered some strange behaviour.

The bucket I'm reading from has around 4.5 million files. What I do is this

from dask import bag as db
import gcsfs
bag = db.read_text('gs://BUCKET/FILENAME')
bag.take(1)

The file that I'm trying to read is around 1MB. When I execute bag.take(1)), the client starts downloading at several MB/s for a long time, but never manages to finish reading the file. When I place this single file in a separate bucket with just a few files, dask downloads and loads the file immediately.

So the problem seems to be fetching files from a bucket with a large number of files.

My uneducated guess is that dask tries to get a listing of all the files in the directory, which takes around 15 minutes of download. It takes the same amount of time to get a list of the files using the google.cloud.storage library.

I'm using Python 3.6.2 on OSX, and relevant packages are

dask==0.15.3
distributed==1.19.1
gcsfs==0.0.3

Cheers,
Halfdan

PS. I suppose it's likely that this is an issue with gcsfs, but I thought I'd post here first and see what you think :).

The text was updated successfully, but these errors were encountered:

martindurant · 2017-10-05T13:47:30Z

Yes, I can confirm that calling info on a file in gcsfs in turn calls ls, and gets all of the files in a bucket. It is required every time a file is opened, to get the download URL and the file-size.
This is the right thing to do if you anticipate the user is likely to want to access multiple files out of a listing that is not too massive.
Unlike s3fs, gcsfs does not yet implement prefix/delimiter listing (i.e., separating out the "folders"), but it could. Alternatively, a call to info could default to getting the metadata for one object only if there is no file listing in the cache.

halfdanrump · 2017-10-06T07:27:36Z

Yes, I can confirm that calling info on a file in gcsfs in turn calls ls, and gets all of the files in a bucket. It is required every time a file is opened, to get the download URL and the file-size.

Alright, thanks for confirming my suspicions :)

This is the right thing to do if you anticipate the user is likely to want to access multiple files out of a listing that is not too massive.

Yes, I suppose my use-case if kind of unusual.

Unlike s3fs, gcsfs does not yet implement prefix/delimiter listing (i.e., separating out the "folders"), but it could.

Alright, should I open as issue for that over there? It would definitely be a useful feature.

Alternatively, a call to info could default to getting the metadata for one object only if there is no file listing in the cache.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?

martindurant · 2017-10-06T12:55:11Z

should I open as issue for that over there?

The natural place for the issue is on gcsfs. Not sure which solution to follow, though. Recoding for prefix/delimiter would be more complex, although it should be very similar to the s3fs version.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?

yes

halfdanrump · 2017-10-11T03:01:40Z

The natural place for the issue is on gcsfs. Not sure which solution to follow, though. Recoding for prefix/delimiter would be more complex, although it should be very similar to the s3fs version.

Alright, I'll open an issue over there then.

You mean so that if there is no asterisk in the path then it does not retrieve the file listing?
yes

As an added feature, it would also be useful to be able to specify a list of filenames. This could also work without having to download the entire file listing. If there's no reason why this couldn't be a feature, then I'll open a new issue for that :)

martindurant · 2017-10-11T03:41:08Z

specify a list of filenames

There is a cost to each lookup, so it would be good for your situation, but most people probably still want the full listing in one for multiple files

halfdanrump · 2017-10-11T04:02:53Z

@martindurant Yes, I agree that it's an unusual need. But would it be difficult to support both? If the user passes a string, then use the current method, if the user passes a list, then iterate over the list.

The main reason I see not to support both is that it increases the code complexity a bit. Of course that's a valid concern, so I'm just asking because I'm curious.

halfdanrump mentioned this issue Oct 11, 2017

Avoid downloading list of files in bucket when object path contains no wildcard fsspec/gcsfs#21

Closed

martindurant closed this as completed Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read small files from GCE bucket with large number of files #2746

Unable to read small files from GCE bucket with large number of files #2746

halfdanrump commented Oct 5, 2017

martindurant commented Oct 5, 2017

halfdanrump commented Oct 6, 2017 •

edited

martindurant commented Oct 6, 2017

halfdanrump commented Oct 11, 2017

martindurant commented Oct 11, 2017

halfdanrump commented Oct 11, 2017

Unable to read small files from GCE bucket with large number of files #2746

Unable to read small files from GCE bucket with large number of files #2746

Comments

halfdanrump commented Oct 5, 2017

martindurant commented Oct 5, 2017

halfdanrump commented Oct 6, 2017 • edited

martindurant commented Oct 6, 2017

halfdanrump commented Oct 11, 2017

martindurant commented Oct 11, 2017

halfdanrump commented Oct 11, 2017

halfdanrump commented Oct 6, 2017 •

edited