-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat container name as directory #35
Conversation
* the container_name is inferred as the first directory listing in the path * removed special implementations for several member methods
* pyarrow works but fs.walk() does not * some globs work
* no need for info, exists. they are based on .ls() * added rm
* ls, info, glob, open_file
Thanks for delving in here, @AlbertDeFusco ! I haven't looked at the code, but your explanation of expected behaviour sounds very reasonable. |
I just noticed that there is a corner case I also need consider. Since there are no real directories and Blob names can contain Two blobs:
|
^ that is an issue for s3fs (et al) too. The logic there is supposed to be (and I hope is!):
In this case, |
Ah! I can replicate that. If this gets merged before I complete it, that's fine. |
I think this would be a great addition, and my first look through the code seems like it should be good. I tried cloning the repo, reading a partitioned csv file with Dask and writing it back to the datalake, which returned an encoding error. It came from |
Here's how I read and write partitioned parquet files. Notice that on Azure there is a problem reading a pyarrow-partitioned file with fastparquet. It may be a bug with my branch. I'll keep investigating. |
I assume that apparent bug does not manifest against s3 or other storage? |
Same error on S3. https://anaconda.org/defusco/partitioned-s3/notebook
|
And same error on local filesystem |
OK, so we conclude that adfls is fine here, but there is a bug that needs reporting to dask. |
@AlbertDeFusco -- Appreciate all the work here. You've mentioned a few other updates to this PR. I'll hold off on merging into master until these are added. |
Fix chucked upload incompatibilities with #35
To begin with, a few issues were identified in version 0.1.5
"Collision between inferred and specified storage options:\n- 'container_name'"
.walk()
routine would get very confused.Taking inspiration from s3fs I propose treating the leading directory name as the container name. This allows a single FS object to access data across multiple containers.
During this effort I discovered that I only needed to implement
.ls()
. Everything else is derived from it, like.glob()
,.walk()
, and.info()
.New behavior
To instantiate, just pass storage options.
Using the blobs defined in the updated tests as an example the following scenarios are provided in this PR
fs.ls('')
orfs.ls('/')
returns the list of accessible containers to account name['data/']
fs.ls('<container-name>')
will return both subdirectories (prefixes) and files (blobs)['data/root/', 'data/top_file.txt']
fs.ls()
will utilize the bbs.list_blobs().next_marker (see _generate_blobs) value to read all files in a subdirectory or containerNew methods
fs.rm(<path-to-file>)
will delete files and userecursive=True
to remove subdirectoriesfs.mkdir(<container>)
will create a new containerfs.rmdir(<container>)
will delete containersIntake catalog structure
Starting with the last issue our catalog is stored at
<container-name>/<sub-dir>/catalog.yml
and the parquet files are stored in nested subirectories at<container-name>/<sub-dir>/data.parquet
. Here's the catalog.yml fileI then read the catalog as
In version 0.1.5 the
<container-name>
would get stripped from the path at the level where Intake reads the catalog. Adlfs would then attempt to infer<sub-dir>
as the container name and attempt to merge that with the inherited<container-name>
, which lead to the collision.