Treat container name as directory #35

AlbertDeFusco · 2020-01-20T22:09:54Z

To begin with, a few issues were identified in version 0.1.5

ls would not reliably list files in the top-level of a container (not in a subirectory)
ls would not reliably list some files deeply nested in subdirectories
ls would only return ~3000 files from a directory known to contain 31,000 files
Reading an Intake catalog stored on Azure would throw errors
- "Collision between inferred and specified storage options:\n- 'container_name'"
- This was the motivating factor for treating containers as directories since the .walk() routine would get very confused.
- see below for more details

Taking inspiration from s3fs I propose treating the leading directory name as the container name. This allows a single FS object to access data across multiple containers.

During this effort I discovered that I only needed to implement .ls(). Everything else is derived from it, like .glob(), .walk(), and .info().

New behavior

To instantiate, just pass storage options.

import fsspec
fs = fsspec.filesystem('abfs', is_emulated=True)

Using the blobs defined in the updated tests as an example the following scenarios are provided in this PR

    data = b'0123456789'
    bbs.create_container("data",)

    bbs.create_blob_from_bytes("data", "top_file.txt", data)
    bbs.create_blob_from_bytes("data", "root/rfile.txt", data)
    bbs.create_blob_from_bytes("data", "root/a/file.txt", data)
    bbs.create_blob_from_bytes("data", "root/b/file.txt", data)
    bbs.create_blob_from_bytes("data", "root/c/file1.txt", data)
    bbs.create_blob_from_bytes("data", "root/c/file2.txt", data)

fs.ls('') or fs.ls('/') returns the list of accessible containers to account name
- ['data/']
fs.ls('<container-name>') will return both subdirectories (prefixes) and files (blobs)
- ['data/root/', 'data/top_file.txt']
- this behavior maintains through all levels of subdirectories ('data/root/' is a directory)
fs.ls() will utilize the bbs.list_blobs().next_marker (see _generate_blobs) value to read all files in a subdirectory or container

New methods

fs.rm(<path-to-file>) will delete files and use recursive=True to remove subdirectories
fs.mkdir(<container>) will create a new container
fs.rmdir(<container>) will delete containers

Intake catalog structure

Starting with the last issue our catalog is stored at <container-name>/<sub-dir>/catalog.yml and the parquet files are stored in nested subirectories at <container-name>/<sub-dir>/data.parquet. Here's the catalog.yml file

sources:
  big_data:
    args:
      engine: pyarrow
      urlpath: '{{ CATALOG_DIR }}/data.parquet'
    description: Partitioned file written with PyArrow
    driver: parquet

I then read the catalog as

import intake
catalog = intake.open_catalog(f'abfs://<container-name>/<sub-dir>/catalog.yml',
                              storage_options=STORAGE_OPTIONS)

df = catalog.big_data.to_dask()

In version 0.1.5 the <container-name> would get stripped from the path at the level where Intake reads the catalog. Adlfs would then attempt to infer <sub-dir> as the container name and attempt to merge that with the inherited <container-name>, which lead to the collision.

* the container_name is inferred as the first directory listing in the path * removed special implementations for several member methods

* pyarrow works but fs.walk() does not * some globs work

* no need for info, exists. they are based on .ls() * added rm

* ls, info, glob, open_file

martindurant · 2020-01-21T14:32:13Z

Thanks for delving in here, @AlbertDeFusco ! I haven't looked at the code, but your explanation of expected behaviour sounds very reasonable.

AlbertDeFusco · 2020-01-21T14:43:44Z

I just noticed that there is a corner case I also need consider. Since there are no real directories and Blob names can contain / it is possible to have the same name used for a file and a virtual directory.

Two blobs:

<conatiner-name>/a
<container-name>/a/file.txt

a acts as both a file and a directory. This PR prioritizes a as a directory in both .ls() and .open(). I'll submit an issue and another PR to work out what to do here.

martindurant · 2020-01-21T14:55:21Z

^ that is an issue for s3fs (et al) too. The logic there is supposed to be (and I hope is!):

ls('container/a') returns ['container/a/file.txt'], i.e., assumes you meant a directory, whether or not the final "/" is included in the path
info('container/a')
- first checks the parent directory, if in the cache, for an exact match, which would include both the file and the prefix/directory, but the file first.
- direct HEAD lookup on the exact path, which would find the file
- does ls, and if that returns a listing, conclude that the path is a directory

In this case, info would return the file's details. It is, of course, bad practice to have sets of keys that don't make a posix-compliant tree..., but unfortunately the S3 console emulates creating directories exactly by creating an empty key (real file, no content) with the prefix's name. I suppose they want you to be able to have an empty directory to put things into. I don't know if the other key stores have any such convention.

AlbertDeFusco · 2020-01-21T15:02:06Z

Ah! I can replicate that. If this gets merged before I complete it, that's fine.

hayesgb · 2020-01-22T01:18:36Z

I think this would be a great addition, and my first look through the code seems like it should be good. I tried cloning the repo, reading a partitioned csv file with Dask and writing it back to the datalake, which returned an encoding error. It came from create_blob_from_bytes, so not sure it’s related but I'd like to dig into this a bit before merging it in...

AlbertDeFusco · 2020-01-22T15:12:39Z

Here's how I read and write partitioned parquet files. Notice that on Azure there is a problem reading a pyarrow-partitioned file with fastparquet. It may be a bug with my branch. I'll keep investigating.

https://anaconda.org/defusco/partitioned

martindurant · 2020-01-22T15:17:34Z

I assume that apparent bug does not manifest against s3 or other storage?

AlbertDeFusco · 2020-01-22T15:33:47Z

Same error on S3.

https://anaconda.org/defusco/partitioned-s3/notebook

# Name                    Version                   Build  Channel
fastparquet               0.3.2            py37h1d22016_0

AlbertDeFusco · 2020-01-22T15:47:02Z

And same error on local filesystem

https://anaconda.org/defusco/partitioned-local

martindurant · 2020-01-22T15:51:51Z

OK, so we conclude that adfls is fine here, but there is a bug that needs reporting to dask.

hayesgb · 2020-01-23T12:00:18Z

@AlbertDeFusco -- Appreciate all the work here. You've mentioned a few other updates to this PR. I'll hold off on merging into master until these are added.

Fix chucked upload incompatibilities with #35

AlbertDeFusco added 12 commits January 10, 2020 15:30

remove need for container_name

e480c5c

* the container_name is inferred as the first directory listing in the path * removed special implementations for several member methods

we seem to be making progress

fdb4595

* pyarrow works but fs.walk() does not * some globs work

forgot to add container_name to output of ls

bcdb85e

debug

15d2d43

remove a few more member methods

313666d

* no need for info, exists. they are based on .ls() * added rm

rmdir and improved file/dir checking

7711be7

containers need a trailing /

eb391a0

respect delimiter choices

bcc6494

update tests for container-as-directory

bb92bf2

* ls, info, glob, open_file

forgot to split path in _rm

3cec595

delimiter!

47f9492

test mkdir, rm and rmdir

1bf7dc5

AlbertDeFusco requested review from TomAugspurger and hayesgb January 20, 2020 22:10

AlbertDeFusco added 2 commits January 20, 2020 17:18

linted!

a4254f4

two unused imports

f39ee59

martindurant mentioned this pull request Jan 22, 2020

Fix for "KeyError: 'type' using fs.glob" #34 #36

Closed

hayesgb approved these changes Jan 23, 2020

View reviewed changes

hayesgb mentioned this pull request Feb 5, 2020

Fix chunked upload of large files to blob #37

Merged

hayesgb merged commit 1f08c4c into fsspec:master Feb 6, 2020

cjalmeida mentioned this pull request Feb 7, 2020

abfs produces corrupted files when data is big enough #33

Closed

cjalmeida pushed a commit to cjalmeida/adlfs that referenced this pull request Feb 7, 2020

fix incompatibilities with fsspec#35; added tests for large blobs

5768d8b

cjalmeida mentioned this pull request Feb 7, 2020

Fix chucked upload incompatibilities with #35 #38

Merged

hayesgb added a commit that referenced this pull request Feb 8, 2020

Merge pull request #38 from cjalmeida/unrevert-chunked

0070763

Fix chucked upload incompatibilities with #35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat container name as directory #35

Treat container name as directory #35

AlbertDeFusco commented Jan 20, 2020

martindurant commented Jan 21, 2020

AlbertDeFusco commented Jan 21, 2020

martindurant commented Jan 21, 2020

AlbertDeFusco commented Jan 21, 2020

hayesgb commented Jan 22, 2020 •

edited

Loading

AlbertDeFusco commented Jan 22, 2020 •

edited

Loading

martindurant commented Jan 22, 2020

AlbertDeFusco commented Jan 22, 2020

AlbertDeFusco commented Jan 22, 2020

martindurant commented Jan 22, 2020

hayesgb commented Jan 23, 2020 •

edited

Loading

Treat container name as directory #35

Treat container name as directory #35

Conversation

AlbertDeFusco commented Jan 20, 2020

New behavior

New methods

Intake catalog structure

martindurant commented Jan 21, 2020

AlbertDeFusco commented Jan 21, 2020

martindurant commented Jan 21, 2020

AlbertDeFusco commented Jan 21, 2020

hayesgb commented Jan 22, 2020 • edited Loading

AlbertDeFusco commented Jan 22, 2020 • edited Loading

martindurant commented Jan 22, 2020

AlbertDeFusco commented Jan 22, 2020

AlbertDeFusco commented Jan 22, 2020

martindurant commented Jan 22, 2020

hayesgb commented Jan 23, 2020 • edited Loading

hayesgb commented Jan 22, 2020 •

edited

Loading

AlbertDeFusco commented Jan 22, 2020 •

edited

Loading

hayesgb commented Jan 23, 2020 •

edited

Loading