Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat container name as directory #35

Merged
merged 14 commits into from
Feb 6, 2020

Conversation

AlbertDeFusco
Copy link
Contributor

To begin with, a few issues were identified in version 0.1.5

  • ls would not reliably list files in the top-level of a container (not in a subirectory)
  • ls would not reliably list some files deeply nested in subdirectories
  • ls would only return ~3000 files from a directory known to contain 31,000 files
  • Reading an Intake catalog stored on Azure would throw errors
    • "Collision between inferred and specified storage options:\n- 'container_name'"
    • This was the motivating factor for treating containers as directories since the .walk() routine would get very confused.
    • see below for more details

Taking inspiration from s3fs I propose treating the leading directory name as the container name. This allows a single FS object to access data across multiple containers.

During this effort I discovered that I only needed to implement .ls(). Everything else is derived from it, like .glob(), .walk(), and .info().

New behavior

To instantiate, just pass storage options.

import fsspec
fs = fsspec.filesystem('abfs', is_emulated=True)

Using the blobs defined in the updated tests as an example the following scenarios are provided in this PR

    data = b'0123456789'
    bbs.create_container("data",)

    bbs.create_blob_from_bytes("data", "top_file.txt", data)
    bbs.create_blob_from_bytes("data", "root/rfile.txt", data)
    bbs.create_blob_from_bytes("data", "root/a/file.txt", data)
    bbs.create_blob_from_bytes("data", "root/b/file.txt", data)
    bbs.create_blob_from_bytes("data", "root/c/file1.txt", data)
    bbs.create_blob_from_bytes("data", "root/c/file2.txt", data)
  • fs.ls('') or fs.ls('/') returns the list of accessible containers to account name
    • ['data/']
  • fs.ls('<container-name>') will return both subdirectories (prefixes) and files (blobs)
    • ['data/root/', 'data/top_file.txt']
    • this behavior maintains through all levels of subdirectories ('data/root/' is a directory)
  • fs.ls() will utilize the bbs.list_blobs().next_marker (see _generate_blobs) value to read all files in a subdirectory or container

New methods

  • fs.rm(<path-to-file>) will delete files and use recursive=True to remove subdirectories
  • fs.mkdir(<container>) will create a new container
  • fs.rmdir(<container>) will delete containers

Intake catalog structure

Starting with the last issue our catalog is stored at <container-name>/<sub-dir>/catalog.yml and the parquet files are stored in nested subirectories at <container-name>/<sub-dir>/data.parquet. Here's the catalog.yml file

sources:
  big_data:
    args:
      engine: pyarrow
      urlpath: '{{ CATALOG_DIR }}/data.parquet'
    description: Partitioned file written with PyArrow
    driver: parquet

I then read the catalog as

import intake
catalog = intake.open_catalog(f'abfs://<container-name>/<sub-dir>/catalog.yml',
                              storage_options=STORAGE_OPTIONS)

df = catalog.big_data.to_dask()

In version 0.1.5 the <container-name> would get stripped from the path at the level where Intake reads the catalog. Adlfs would then attempt to infer <sub-dir> as the container name and attempt to merge that with the inherited <container-name>, which lead to the collision.

* the container_name is inferred as the first directory listing
  in the path
* removed special implementations for several member methods
* pyarrow works but fs.walk() does not
* some globs work
* no need for info, exists. they are based on .ls()
* added rm
* ls, info, glob, open_file
@martindurant
Copy link
Member

Thanks for delving in here, @AlbertDeFusco ! I haven't looked at the code, but your explanation of expected behaviour sounds very reasonable.

@AlbertDeFusco
Copy link
Contributor Author

I just noticed that there is a corner case I also need consider. Since there are no real directories and Blob names can contain / it is possible to have the same name used for a file and a virtual directory.

Two blobs:

<conatiner-name>/a
<container-name>/a/file.txt

a acts as both a file and a directory. This PR prioritizes a as a directory in both .ls() and .open(). I'll submit an issue and another PR to work out what to do here.

@martindurant
Copy link
Member

^ that is an issue for s3fs (et al) too. The logic there is supposed to be (and I hope is!):

  • ls('container/a') returns ['container/a/file.txt'], i.e., assumes you meant a directory, whether or not the final "/" is included in the path
  • info('container/a')
    • first checks the parent directory, if in the cache, for an exact match, which would include both the file and the prefix/directory, but the file first.
    • direct HEAD lookup on the exact path, which would find the file
    • does ls, and if that returns a listing, conclude that the path is a directory

In this case, info would return the file's details. It is, of course, bad practice to have sets of keys that don't make a posix-compliant tree..., but unfortunately the S3 console emulates creating directories exactly by creating an empty key (real file, no content) with the prefix's name. I suppose they want you to be able to have an empty directory to put things into. I don't know if the other key stores have any such convention.

@AlbertDeFusco
Copy link
Contributor Author

Ah! I can replicate that. If this gets merged before I complete it, that's fine.

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 22, 2020

I think this would be a great addition, and my first look through the code seems like it should be good. I tried cloning the repo, reading a partitioned csv file with Dask and writing it back to the datalake, which returned an encoding error. It came from create_blob_from_bytes, so not sure it’s related but I'd like to dig into this a bit before merging it in...

@AlbertDeFusco
Copy link
Contributor Author

AlbertDeFusco commented Jan 22, 2020

Here's how I read and write partitioned parquet files. Notice that on Azure there is a problem reading a pyarrow-partitioned file with fastparquet. It may be a bug with my branch. I'll keep investigating.

https://anaconda.org/defusco/partitioned

@martindurant
Copy link
Member

I assume that apparent bug does not manifest against s3 or other storage?

@AlbertDeFusco
Copy link
Contributor Author

Same error on S3.

https://anaconda.org/defusco/partitioned-s3/notebook

# Name                    Version                   Build  Channel
fastparquet               0.3.2            py37h1d22016_0  

@AlbertDeFusco
Copy link
Contributor Author

And same error on local filesystem

https://anaconda.org/defusco/partitioned-local

@martindurant
Copy link
Member

OK, so we conclude that adfls is fine here, but there is a bug that needs reporting to dask.

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 23, 2020

@AlbertDeFusco -- Appreciate all the work here. You've mentioned a few other updates to this PR. I'll hold off on merging into master until these are added.

@hayesgb hayesgb merged commit 1f08c4c into fsspec:master Feb 6, 2020
cjalmeida pushed a commit to cjalmeida/adlfs that referenced this pull request Feb 7, 2020
hayesgb added a commit that referenced this pull request Feb 8, 2020
Fix chucked upload incompatibilities with #35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants