Skip to content
fsspec-compatible Azure Datake and Azure Blob Storage access
Python Dockerfile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
adlfs pass through cache_options Nov 18, 2019
.gitignore Updated .gitignore Sep 26, 2019
.gitmodules Removed filesystem-spec submodule Nov 12, 2019
.pre-commit-config.yaml added pre-commit Nov 15, 2019
CONTRIBUTING.md
Dockerfile
LICENSE.txt Set up package Dec 14, 2017
README.md Removed code block from readme Oct 29, 2019
azure-pipelines.yml
docker-compose.yml ci more Nov 18, 2019
requirements.txt
setup.cfg fixups Nov 15, 2019
setup.py

README.md

Dask interface to Azure-Datalake Gen1 and Gen2 Storage

Warning: this code is experimental and untested.

Quickstart

This package is on PyPi and can be installed using:

pip install adlfs

To use the Gen1 filesystem:

import dask.dataframe as dd
from fsspec.registry import known_implementations
known_implementations['adl'] = {'class': 'adlfs.AzureDatalakeFileSystem'}
STORAGE_OPTIONS={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=STORAGE_OPTIONS}

To use the Gen2 filesystem:

import dask.dataframe as dd
from fsspec.registry import known_implementations
known_implementations['abfs'] = {'class': 'adlfs.AzureDatalakeFileSystem'}
STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=STORAGE_OPTIONS}
ddf = dd.read_csv('abfs://{CONTAINER}/folder.parquet', storage_options=STORAGE_OPTIONS}

Details

The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions between both Azure Datalake implementations and Dask. This is done leveraging the intake/filesystem_spec base class and Azure Python SDKs.

Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.

Operations against the Gen2 Datalake are implemented by leveraging multi-protocol access, using the Azure Blob Storage Python SDK. Authentication is currently implemented only by the ACCOUNT_NAME and an ACCOUNT_KEY.

You can’t perform that action at this time.