Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureHttpError - unable to read file from blob #48

Closed
raybellwaves opened this issue Apr 16, 2020 · 14 comments
Closed

AzureHttpError - unable to read file from blob #48

raybellwaves opened this issue Apr 16, 2020 · 14 comments

Comments

@raybellwaves
Copy link
Contributor

Following up from a SO Q here: https://stackoverflow.com/questions/61220615/dask-read-parquet-from-azure-blob-azurehttperror/61229497#61229497

Unfortunately, i'm still getting an AzureHttpError. Not sure if anyone here has encountered this? Unfortunately, it's persistent for me.

@hayesgb
Copy link
Collaborator

hayesgb commented Apr 16, 2020

This is the first report I've gotten of this error. Noted mdurant's suggestion, which seems the most likely explanation. Can I assume your filepath is formatted as:
"abfs://{filesystem_name}/file.parquet"? Also, which Azure region are you in?

@raybellwaves
Copy link
Contributor Author

I was actually doing abfs://{filesystem_name}/file but I updated my code (and the SO Q's) to be abfs://{filesystem_name}/file.parquet. However, I still get the AzureHttpError.

I'm in East US 2.

@hayesgb
Copy link
Collaborator

hayesgb commented Apr 17, 2020

For reference, I'm working in East US 2 daily without issue, so I would assume it's not an availability problem. Can you answer a few other questions?

  • What package versions are you running? (adlfs, fsspec, dask, and azure-storage-blob).
  • Are you running Dask locally or distributed? If distributed, what version.
  • Is this parquet file one that was written to abfs with Dask? If no, does a simple read-write operation with another file work, and how was the existing parquet file created? If yes, does a read-write operation to-from CSV work successfully?
  • Have you recreated the problem with a minimal working example (small example dummy dataframe)? If so can you share that example so I can try to re-create your issue?

@raybellwaves
Copy link
Contributor Author

Thanks for the prompt for a MCVE.

  • What package versions are you running? (adlfs, fsspec, dask, and azure-storage-blob).

Windows 10
adlfs==0.2.0, fsspec==0.6.2, dask==2.10.1, azure-storage-blob==2.1.0

further details below

> conda list:

_anaconda_depends 2019.03 py37_0
_ipyw_jlab_nb_ext_conf 0.1.0 py37_0
adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
alabaster 0.7.12 py37_0
alembic 1.4.0 py_0 conda-forge
anaconda custom py37_1
anaconda-client 1.7.2 py37_0
anaconda-navigator 1.9.7 py37_0
anaconda-project 0.8.4 py_0
appdirs 1.4.3 pypi_0 pypi
argh 0.26.2 py37_0
arrow-cpp 0.13.0 py37h49ee12d_0
asn1crypto 1.3.0 py37_0
astroid 2.3.3 py37_0
astropy 4.0 py37he774522_0
atomicwrites 1.3.0 py37_1
attrs 19.3.0 py_0
autopep8 1.4.4 py_0
azure-common 1.1.25 pypi_0 pypi
azure-core 1.2.2 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
babel 2.8.0 py_0
backcall 0.1.0 py37_0
backports 1.0 py_2
backports.functools_lru_cache 1.6.1 py_0
backports.os 0.1.1 py37_0
backports.shutil_get_terminal_size 1.0.0 py37_2
backports.tempfile 1.0 py_1
backports.weakref 1.0.post1 py_1
bcrypt 3.1.7 py37he774522_0
beautifulsoup4 4.8.2 py37_0
bitarray 1.2.1 py37he774522_0
bkcharts 0.2 py37_0
black 19.10b0 pypi_0 pypi
blackcellmagic 0.0.2 pypi_0 pypi
blas 1.0 mkl
bleach 3.1.0 py37_0
blosc 1.16.3 h7bd577a_0
bokeh 1.4.0 py37_0
boost-cpp 1.67.0 hfa6e2cd_4
boto 2.49.0 py37_0
bottleneck 1.3.1 py37h8c2d366_0
brotli 1.0.7 h33f27b4_0
bzip2 1.0.8 he774522_0
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py37hc8dfbb8_0 conda-forge
cffi 1.14.0 py37h7a1dbc1_0
chardet 3.0.4 py37_1003
click 7.0 py37_0
cloudpickle 1.3.0 py_0
clyent 1.2.2 py37_1
colorama 0.4.3 py_0
colorcet 2.0.2 py_0
comtypes 1.1.7 py37_0
conda 4.8.3 py37hc8dfbb8_1 conda-forge
conda-build 3.18.11 py37_0
conda-env 2.6.0 1
conda-package-handling 1.6.0 py37h62dcd97_0
conda-verify 3.4.2 py_1
configparser 3.7.3 py37_1 conda-forge
console_shortcut 0.1.1 3
contextlib2 0.6.0.post1 py_0
cryptography 2.8 py37h7a1dbc1_0
cudatoolkit 10.1.243 h74a9793_0
curl 7.68.0 h2a8f88b_0
cx-oracle 7.3.0 pypi_0 pypi
cycler 0.10.0 py37_0
cymem 2.0.2 py37h74a9793_0
cython 0.29.15 py37ha925a31_0
cython-blis 0.2.4 py37hfa6e2cd_1 fastai
cytoolz 0.10.1 py37he774522_0
dask 2.10.1 py_0
dask-core 2.10.1 py_0
databricks-cli 0.9.1 py_0 conda-forge
dataclasses 0.6 py_0 fastai
decorator 4.4.1 py_0
defusedxml 0.6.0 py_0
diff-match-patch 20181111 py_0
distributed 2.10.0 py_0
doc8 0.8.0 pypi_0 pypi
docker-py 4.1.0 py37_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
docutils 0.16 py37_0
double-conversion 3.1.5 ha925a31_1
entrypoints 0.3 py37_0
et_xmlfile 1.0.1 py37_0
fastai 1.0.60 1 fastai
fastcache 1.1.0 py37he774522_0
fastparquet 0.3.3 py37hc8d92b1_0 conda-forge
fastprogress 0.2.2 py_0 fastai
filelock 3.0.12 py_0
flake8 3.7.9 py37_0
flask 1.1.1 py_0
freetype 2.9.1 ha9979f8_1
fsspec 0.6.2 py_0
future 0.18.2 py37_0
get_terminal_size 1.0.0 h38e98db_0
gevent 1.4.0 py37he774522_0
gflags 2.2.2 ha925a31_0
gitdb2 3.0.2 py_0 conda-forge
gitpython 3.0.5 py_0 conda-forge
glob2 0.7 py_0
glog 0.4.0 h33f27b4_0
gorilla 0.3.0 py_0 conda-forge
greenlet 0.4.15 py37hfa6e2cd_0
h5py 2.10.0 py37h5e291fa_0
hdf5 1.10.4 h7ebc959_0
heapdict 1.0.1 py_0
holoviews 1.12.7 py_0
html5lib 1.0.1 py37_0
hvplot 0.5.2 py_0 conda-forge
hypothesis 5.4.1 py_0
icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha66f8fd_1
idna 2.8 py37_0
imageio 2.6.1 py37_0
imagesize 1.2.0 py_0
importlib_metadata 1.5.0 py37_0
intel-openmp 2020.0 166
intervaltree 3.0.2 py_0
ipykernel 5.1.4 py37h39e3cac_0
ipython 7.12.0 py37h5ca1d4c_0
ipython_genutils 0.2.0 py37_0
ipywidgets 7.5.1 py_0
isodate 0.6.0 pypi_0 pypi
isort 4.3.21 py37_0
itsdangerous 1.1.0 py37_0
jdcal 1.4.1 py_0
jedi 0.14.1 py37_0
jinja2 2.11.1 py_0
joblib 0.14.1 py_0
jpeg 9b hb83a4c4_2
json5 0.9.1 py_0
jsonschema 3.2.0 py37_0
jupyter 1.0.0 py37_7
jupyter_client 5.3.4 py37_0
jupyter_console 6.1.0 py_0
jupyter_core 4.6.1 py37_0
jupyterlab 1.2.6 pyhf63ae98_0
jupyterlab_server 1.0.6 py_0
keyring 21.1.0 py37_0
kiwisolver 1.1.0 py37ha925a31_0
krb5 1.17.1 hc04afaa_0
lazy-object-proxy 1.4.3 py37he774522_0
libarchive 3.3.3 h0643e63_5
libboost 1.67.0 hfd51bdf_4
libcurl 7.68.0 h2a8f88b_0
libiconv 1.15 h1df5818_7
liblief 0.9.0 ha925a31_2
libpng 1.6.37 h2a8f88b_0
libprotobuf 3.6.0 h1a1b453_0
libsodium 1.0.16 h9d3ae62_0
libspatialindex 1.9.3 h33f27b4_0
libssh2 1.8.2 h7a1dbc1_0
libtiff 4.1.0 h56a325e_0
libxml2 2.9.9 h464c3ec_0
libxslt 1.1.33 h579f668_0
llvmlite 0.31.0 py37ha925a31_0
locket 0.2.0 py37_1
lxml 4.5.0 py37h1350720_0
lz4-c 1.8.1.2 h2fa13f4_0
lzo 2.10 h6df0209_2
m2w64-gcc-libgfortran 5.3.0 6
m2w64-gcc-libs 5.3.0 7
m2w64-gcc-libs-core 5.3.0 7
m2w64-gmp 6.1.0 2
m2w64-libwinpthread-git 5.0.0.4634.697f757 2
mako 1.1.0 py_0 conda-forge
markupsafe 1.1.1 py37he774522_0
matplotlib 3.1.3 py37_0
matplotlib-base 3.1.3 py37h64f37c6_0
mccabe 0.6.1 py37_1
menuinst 1.4.16 py37he774522_0
mistune 0.8.4 py37he774522_0
mkl 2020.0 166
mkl-service 2.3.0 py37hb782905_0
mkl_fft 1.0.15 py37h14836fe_0
mkl_random 1.1.0 py37h675688f_0
mlflow 1.6.0 pypi_0 pypi
mock 4.0.1 py_0
more-itertools 8.2.0 py_0
mpmath 1.1.0 py37_0
msgpack-python 0.6.1 py37h74a9793_1
msrest 0.6.11 pypi_0 pypi
msys2-conda-epoch 20160418 1
multipledispatch 0.6.0 py37_0
murmurhash 1.0.2 py37h33f27b4_0
navigator-updater 0.2.1 py37_0
nbconvert 5.6.1 py37_0
nbformat 5.0.4 py_0
networkx 2.4 py_0
ninja 1.9.0 py37h74a9793_0
nltk 3.4.5 py37_0
nose 1.3.7 py37_2
notebook 6.0.3 py37_0
numba 0.48.0 py37h47e9c7a_0
numexpr 2.7.1 py37h25d0782_0
numpy 1.18.1 py37h93ca92e_0
numpy-base 1.18.1 py37hc3f5095_1
numpydoc 0.9.2 py_0
nvidia-ml-py3 7.352.0 py_0 fastai
oauthlib 3.1.0 pypi_0 pypi
olefile 0.46 py37_0
openpyxl 3.0.3 py_0
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0
pandas 1.0.1 py37h47e9c7a_0
pandoc 2.2.3.2 0
pandocfilters 1.4.2 py37_1
param 1.9.3 py_0
paramiko 2.6.0 py37_0
parso 0.5.2 py_0
partd 1.1.0 py_0
path 13.1.0 py37_0
path.py 12.4.0 0
pathlib2 2.3.5 py37_0
pathspec 0.7.0 pypi_0 pypi
pathtools 0.1.2 py_1
patsy 0.5.1 py37_0
pbr 5.4.4 pypi_0 pypi
pep8 1.7.1 py37_0
pexpect 4.8.0 py37_0
pickleshare 0.7.5 py37_0
pillow 7.0.0 py37hcc1f983_0
pip 20.0.2 py37_1
pkginfo 1.5.0.1 py37_0
plac 0.9.6 py37_0
pluggy 0.13.1 py37_0
ply 3.11 py37_0
powershell_shortcut 0.0.1 2
preshed 2.0.1 py37h33f27b4_0
prometheus_client 0.7.1 py_0
prometheus_flask_exporter 0.12.2 py_0 conda-forge
prompt_toolkit 3.0.3 py_0
properscoring 0.1 py_0 conda-forge
protobuf 3.6.0 py37he025d50_1 conda-forge
psutil 5.6.7 py37he774522_0
py 1.8.1 py_0
py-lief 0.9.0 py37ha925a31_2
pyarrow 0.13.0 py37ha925a31_0
pycodestyle 2.5.0 py37_0
pycosat 0.6.3 py37he774522_0
pycparser 2.19 py37_0
pycrypto 2.6.1 py37hfa6e2cd_9
pyct 0.4.6 py37_0
pycurl 7.43.0.5 py37h7a1dbc1_0
pydocstyle 4.0.1 py_0
pyflakes 2.1.1 py37_0
pygments 2.5.2 py_0
pyjwt 1.7.1 pypi_0 pypi
pylint 2.4.4 py37_0
pynacl 1.3.0 py37h62dcd97_0
pyodbc 4.0.30 py37ha925a31_0
pyopenssl 19.1.0 py37_0
pyparsing 2.4.6 py_0
pypiwin32 223 pypi_0 pypi
pyqt 5.9.2 py37h6538335_2
pyreadline 2.1 py37_1
pyrsistent 0.15.7 py37he774522_0
pysocks 1.7.1 py37_0
pytables 3.6.1 py37h1da0976_0
pytest 5.3.5 py37_0
pytest-arraydiff 0.3 py37h39e3cac_0
pytest-astropy 0.8.0 py_0
pytest-astropy-header 0.1.2 py_0
pytest-doctestplus 0.5.0 py_0
pytest-openfiles 0.4.0 py_0
pytest-remotedata 0.3.2 py37_0
python 3.7.6 h60c2a47_2
python-dateutil 2.8.1 py_0
python-editor 1.0.4 py_0 conda-forge
python-jsonrpc-server 0.3.4 py_0
python-language-server 0.31.7 py37_0
python-libarchive-c 2.8 py37_13
python-snappy 0.5.4 py37hd25c944_1 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytorch 1.4.0 py3.7_cuda101_cudnn7_0 pytorch
pytz 2019.3 py_0
pyviz_comms 0.7.3 py_0
pywavelets 1.1.1 py37he774522_0
pywin32 227 py37he774522_1
pywin32-ctypes 0.2.0 py37_1000
pywinpty 0.5.7 py37_0
pyyaml 5.3 py37he774522_0
pyzmq 18.1.1 py37ha925a31_0
qdarkstyle 2.8 py_0
qt 5.9.7 vc14h73c81de_0
qtawesome 0.6.1 py_0
qtconsole 4.6.0 py_1
qtpy 1.9.0 py_0
querystring_parser 1.2.4 py_0 conda-forge
re2 2019.08.01 vc14ha925a31_0
regex 2020.1.8 pypi_0 pypi
requests 2.22.0 py37_1
requests-oauthlib 1.3.0 pypi_0 pypi
restructuredtext-lint 1.3.0 pypi_0 pypi
rope 0.16.0 py_0
rtree 0.9.3 py37h21ff451_0
ruamel_yaml 0.15.87 py37he774522_0
scikit-image 0.16.2 py37h47e9c7a_0
scikit-learn 0.22.1 py37h6288b17_0
scipy 1.4.1 py37h9439919_0
seaborn 0.10.0 py_0
send2trash 1.5.0 py37_0
setuptools 45.2.0 py37_0
simplegeneric 0.8.1 py37_2
simplejson 3.17.0 py37hfa6e2cd_0 conda-forge
singledispatch 3.4.0.3 py37_0
sip 4.19.8 py37h6538335_0
six 1.14.0 py37_0
smmap2 2.0.5 py_0 conda-forge
snappy 1.1.7 h777316e_3
snowballstemmer 2.0.0 py_0
sortedcollections 1.1.2 py37_0
sortedcontainers 2.1.0 py37_0
soupsieve 1.9.5 py37_0
spacy 2.1.8 py37he980bc4_0 fastai
sphinx 2.4.0 py_0
sphinxcontrib 1.0 py37_1
sphinxcontrib-applehelp 1.0.1 py_0
sphinxcontrib-devhelp 1.0.1 py_0
sphinxcontrib-htmlhelp 1.0.2 py_0
sphinxcontrib-jsmath 1.0.1 py_0
sphinxcontrib-qthelp 1.0.2 py_0
sphinxcontrib-serializinghtml 1.1.3 py_0
sphinxcontrib-websupport 1.2.0 py_0
spyder 4.0.1 py37_0
spyder-kernels 1.8.1 py37_0
sqlalchemy 1.3.13 py37he774522_0
sqlite 3.31.1 he774522_0
sqlparse 0.3.0 py_0 conda-forge
srsly 0.1.0 py37h6538335_0 fastai
statsmodels 0.11.0 py37he774522_0
stevedore 1.32.0 pypi_0 pypi
sympy 1.5.1 py37_0
tabulate 0.8.6 py_0 conda-forge
tbb 2020.0 h74a9793_0
tblib 1.6.0 py_0
terminado 0.8.3 py37_0
testpath 0.4.4 py_0
thinc 7.0.8 py37he980bc4_0 fastai
thrift 0.11.0 py37h6538335_1001 conda-forge
thrift-cpp 0.11.0 h1ebf3fd_3
tk 8.6.8 hfa6e2cd_0
toml 0.10.0 pypi_0 pypi
toolz 0.10.0 py_0
torchvision 0.5.0 py37_cu101 pytorch
tornado 6.0.3 py37he774522_3
tqdm 4.42.1 py_0
traitlets 4.3.3 py37_0
typed-ast 1.4.1 pypi_0 pypi
ujson 1.35 py37hfa6e2cd_0
unicodecsv 0.14.1 py37_0
urllib3 1.25.8 py37_0
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
waitress 1.4.3 py_0 conda-forge
wasabi 0.2.2 py_0 fastai
watchdog 0.10.2 py37_0
wcwidth 0.1.8 py_0
webencodings 0.5.1 py37_1
websocket-client 0.57.0 py37_0 conda-forge
werkzeug 1.0.0 py_0
wheel 0.34.2 py37_0
widgetsnbextension 3.5.1 py37_0
win_inet_pton 1.1.0 py37_0
win_unicode_console 0.5 py37_0
wincertstore 0.2 py37_0
winpty 0.4.3 4
wrapt 1.11.2 py37he774522_0
xarray 0.15.0 py_0 conda-forge
xlrd 1.2.0 py37_0
xlsxwriter 1.2.7 py_0
xlwings 0.17.1 py37_0
xlwt 1.3.0 py37_0
xskillscore 0.0.15 py_0 conda-forge
xz 5.2.4 h2fa13f4_4
yaml 0.1.7 hc54c509_2
yapf 0.28.0 py_0
zeromq 4.3.1 h33f27b4_3
zict 1.0.0 py_0
zipp 2.2.0 py_0
zlib 1.2.11 h62dcd97_3
zstd 1.3.7 h508b16e_0

  • Are you running Dask locally or distributed? If distributed, what version.

distributed (2.10.1) using a LocalCluster.

from dask.distributed import Client
client = Client()
  • Is this parquet file one that was written to abfs with Dask? If no, does a simple read-write operation with another file work, and how was the existing parquet file created? If yes, does a read-write operation to-from CSV work successfully?
  • Have you recreated the problem with a minimal working example (small example dummy dataframe)? If so can you share that example so I can try to re-create your issue?

Good questions. I tackle then both in the MCVE code.

I get EmptyDataError: No columns to parse from file with the csv files and AzureHttpError: Server encountered an internal error with the parquet file.

import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
client = Client()

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

STORAGE_OPTIONS={'account_name': 'ACCOUNT_NAME',
                 'account_key': 'ACCOUNT_KEY'}
# This works fine and I see the files in Microsoft Azure Storage Explorer
dd.to_csv(df=ddf,
          filename='abfs://BLOB/FILE/*.csv',
          storage_options=STORAGE_OPTIONS)

df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)
---------------------------------------------------------------------------
EmptyDataError                            Traceback (most recent call last)
<ipython-input-33-4ef0af5e9369> in <module>
----> 1 df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    442 
    443     # Use sample to infer dtypes and check for presence of include_path_column
--> 444     head = reader(BytesIO(b_sample), **kwargs)
    445     if include_path_column and (include_path_column in head.columns):
    446         raise ValueError(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

EmptyDataError: No columns to parse from file


# This works and I see it in Microsoft Azure Storage Explorer
dd.to_parquet(df=df,
              path='abfs://BLOB/FILE.parquet',
              storage_options=STORAGE_OPTIONS)

df = dd.read_parquet('abfs://tmp/tmp.parquet',
                     storage_options=STORAGE_OPTIONS)
ERROR:azure.storage.common.storageclient:Client-Request-ID=fe8a8c36-8120-11ea-a33c-a0afbd853445 Retry policy did not allow for a retry: Server-Timestamp=Sat, 18 Apr 2020 03:03:08 GMT, Server-Request-ID=a5160140-d01e-006b-642d-1518c8000000, HTTP status code=500, Exception=Server encountered an internal error. Please try again after some time. ErrorCode: InternalError<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.RequestId:a5160140-d01e-006b-642d-1518c8000000Time:2020-04-18T03:03:09.2047334Z</Message></Error>.
AzureHttpError                            Traceback (most recent call last)
<ipython-input-35-0b3e24138208> in <module>
      1 df = dd.read_parquet('abfs://tmp/tmp.parquet',
----> 2                      storage_options=STORAGE_OPTIONS)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
    231         filters=filters,
    232         split_row_groups=split_row_groups,
--> 233         **kwargs
    234     )
    235     if meta.index.name is not None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in read_metadata(fs, paths, categories, index, gather_statistics, filters, **kwargs)
    176         # correspond to a row group (populated below).
    177         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
--> 178             fs, paths, gather_statistics, **kwargs
    179         )
    180 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in _determine_pf_parts(fs, paths, gather_statistics, **kwargs)
    127                 open_with=fs.open,
    128                 sep=fs.sep,
--> 129                 **kwargs.get("file", {})
    130             )
    131             if gather_statistics is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fastparquet\api.py in __init__(self, fn, verify, open_with, root, sep)
    109                 fn2 = join_path(fn, '_metadata')
    110                 self.fn = fn2
--> 111                 with open_with(fn2, 'rb') as f:
    112                     self._parse_header(f, verify)
    113                 fn = fn2

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
    722                 autocommit=ac,
    723                 cache_options=cache_options,
--> 724                 **kwargs
    725             )
    726             if not ac:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _open(self, path, mode, block_size, autocommit, cache_options, **kwargs)
    552             autocommit=autocommit,
    553             cache_options=cache_options,
--> 554             **kwargs,
    555         )
    556 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
    582             cache_type=cache_type,
    583             cache_options=cache_options,
--> 584             **kwargs,
    585         )
    586 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
    954         if mode == "rb":
    955             if not hasattr(self, "details"):
--> 956                 self.details = fs.info(path)
    957             self.size = self.details["size"]
    958             self.cache = caches[cache_type](

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in info(self, path, **kwargs)
    499         if out:
    500             return out[0]
--> 501         out = self.ls(path, detail=True, **kwargs)
    502         path = path.rstrip("/")
    503         out1 = [o for o in out if o["name"].rstrip("/") == path]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in ls(self, path, detail, invalidate_cache, delimiter, **kwargs)
    446             # then return the contents
    447             elif self._matches(
--> 448                 container_name, path, as_directory=True, delimiter=delimiter
    449             ):
    450                 logging.debug(f"{path} appears to be a directory")

~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _matches(self, container_name, path, as_directory, delimiter)
    386             prefix=path,
    387             delimiter=delimiter,
--> 388             num_results=None,
    389         )
    390 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in list_blob_names(self, container_name, prefix, num_results, include, delimiter, marker, timeout)
   1360                   '_context': operation_context,
   1361                   '_converter': _convert_xml_to_blob_name_list}
-> 1362         resp = self._list_blobs(*args, **kwargs)
   1363 
   1364         return ListGenerator(resp, self._list_blobs, args, kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in _list_blobs(self, container_name, prefix, marker, max_results, include, delimiter, timeout, _context, _converter)
   1435         }
   1436 
-> 1437         return self._perform_request(request, _converter, operation_context=_context)
   1438 
   1439     def get_blob_account_information(self, container_name=None, blob_name=None, timeout=None):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    444                                  status_code,
    445                                  exception_str_in_one_line)
--> 446                     raise ex
    447             finally:
    448                 # If this is a location locked operation and the location is not set,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    372                 except AzureException as ex:
    373                     retry_context.exception = ex
--> 374                     raise ex
    375                 except Exception as ex:
    376                     retry_context.exception = ex

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    358                         # and raised as an azure http exception
    359                         _http_error_handler(
--> 360                             HTTPError(response.status, response.message, response.headers, response.body))
    361 
    362                     # Parse the response

~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\_error.py in _http_error_handler(http_error)
    113     ex.error_code = error_code
    114 
--> 115     raise ex
    116 
    117 

AzureHttpError: Server encountered an internal error. Please try again after some time. ErrorCode: InternalError
<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.
RequestId:a5160140-d01e-006b-642d-1518c8000000
Time:2020-04-18T03:03:09.2047334Z</Message></Error>

@hayesgb
Copy link
Collaborator

hayesgb commented Apr 18, 2020

I've just attempted to reproduce your example, but it worked on my end. Below is my code and results:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)
dd.to_csv(df=ddf,
          filename='abfs://<container>/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv("abfs://datascience-dev/test_csvfile/*.csv", storage_options=storage_options)
df2.head() <returns successfully in Jupyter Notebook>

dd.to_parquet(ddf,
          'abfs://datascience-dev/testfile.parquet',
          storage_options=storage_options)

df3 = dd.read_parquet("abfs://datascience-dev/testfile.parquet",
                     storage_options=storage_options)
df3.head() <returns successfully in Jupyter Notebook>

This was run on Linux with Anaconda Python. Python v3.6.7. Confirmed it works on my Windows 10 as well.

Versions of adlfs, fsspec, azure-storage-blob == 2.1.0, azure-common==1.1.24, and azure-datalake-store==0.0.48. I see that you have azure-core installed, which I do not have installed, and is not a dependency. You may want to try removing. Looking through other packages that are logical suspects, I also have requests 2.23 rather than 2.22.

I will investigate further later today.

@raybellwaves
Copy link
Contributor Author

Thanks a lot for running. Regarding the packages I’ll try in a new env

@raybellwaves
Copy link
Contributor Author

Here's a new environment. Slightly different error message but same thing along the lines of file(s) not found?

Create new env:

> conda create -n adlfs python=3.8
> conda activate adlfs
> pip install adlfs
> conda install -c conda-forge dask fastparquet ipython

Check packages:

> conda list:

adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
azure-common 1.1.25 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
backcall 0.1.0 py_0 conda-forge
bokeh 2.0.1 py38h32f6830_0 conda-forge
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py38h32f6830_0 conda-forge
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
click 7.1.1 pyh8c360ce_0 conda-forge
cloudpickle 1.3.0 py_0 conda-forge
colorama 0.4.3 py_0 conda-forge
cryptography 2.9 pypi_0 pypi
cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge
dask 2.14.0 py_0 conda-forge
dask-core 2.14.0 py_0 conda-forge
decorator 4.4.2 py_0 conda-forge
distributed 2.14.0 py38h32f6830_0 conda-forge
fastparquet 0.3.3 py38hc8d92b1_0 conda-forge
freetype 2.10.1 ha9979f8_0 conda-forge
fsspec 0.7.2 py_0 conda-forge
heapdict 1.0.1 py_0 conda-forge
idna 2.9 pypi_0 pypi
intel-openmp 2020.0 166
ipython 7.13.0 py38h32f6830_2 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.17.0 py38h32f6830_0 conda-forge
jinja2 2.11.2 pyh9f0ad1d_0 conda-forge
jpeg 9c hfa6e2cd_1001 conda-forge
libblas 3.8.0 15_mkl conda-forge
libcblas 3.8.0 15_mkl conda-forge
liblapack 3.8.0 15_mkl conda-forge
libpng 1.6.37 hfe6a214_1 conda-forge
libtiff 4.1.0 h885aae3_6 conda-forge
llvmlite 0.31.0 py38h32f6830_1 conda-forge
locket 0.2.0 py_2 conda-forge
lz4-c 1.9.2 h33f27b4_0 conda-forge
markupsafe 1.1.1 py38h9de7a3e_1 conda-forge
mkl 2020.0 166
msgpack-python 1.0.0 py38heaebd3c_1 conda-forge
numba 0.48.0 py38he350917_0 conda-forge
numpy 1.18.1 py38ha749109_1 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0 conda-forge
pandas 1.0.3 py38he6e81aa_1 conda-forge
parso 0.7.0 pyh9f0ad1d_0 conda-forge
partd 1.1.0 py_0 conda-forge
pickleshare 0.7.5 py38h32f6830_1001 conda-forge
pillow 7.1.1 py38h8103267_0 conda-forge
pip 20.0.2 py38_1
prompt-toolkit 3.0.5 py_0 conda-forge
psutil 5.7.0 py38h9de7a3e_1 conda-forge
pycparser 2.20 pypi_0 pypi
pygments 2.6.1 py_0 conda-forge
pyjwt 1.7.1 pypi_0 pypi
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
python 3.8.2 h5fd99cc_11
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2019.3 py_0 conda-forge
pyyaml 5.3.1 py38h9de7a3e_0 conda-forge
requests 2.23.0 pypi_0 pypi
setuptools 46.1.3 py38_0
six 1.14.0 py_1 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.31.1 he774522_0
tblib 1.6.0 py_0 conda-forge
thrift 0.11.0 py38h6538335_1001 conda-forge
tk 8.6.10 hfa6e2cd_0 conda-forge
toolz 0.10.0 py_0 conda-forge
tornado 6.0.4 py38hfa6e2cd_0 conda-forge
traitlets 4.3.3 py38h32f6830_1 conda-forge
typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge
urllib3 1.25.9 pypi_0 pypi
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py38_0
wincertstore 0.2 py38_0
xz 5.2.5 h2fa13f4_0 conda-forge
yaml 0.2.3 he774522_0 conda-forge
zict 2.0.0 py_0 conda-forge
zlib 1.2.11 h2fa13f4_1006 conda-forge
zstd 1.4.4 h9f78265_3 conda-forge

Setup code:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

csv example:

dd.to_csv(df=ddf,
          filename='abfs://<container>/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv('abfs://<container>/test_csvfile/*.csv',
                  storage_options=storage_options)

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 566, in read
    return read_pandas(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 398, in read_pandas
    b_out = read_bytes(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\bytes\core.py", line 96, in read_bytes
    raise IOError("%s resolved to no files" % urlpath)
OSError: abfs://<container>/test_csvfile/*.csv resolved to no files

Print a few things using %debug:

ipdb> urlpath
'abfs://tmp/test_csvfile/*.csv'
ipdb> paths
[]
ipdb> b_lineterminator
b'\n'

parquet example:

dd.to_parquet(ddf,
             'abfs://<container>/testfile.parquet',
              storage_options=storage_options)

df3 = dd.read_parquet("abfs://<container>/testfile.parquet",
                      storage_options=storage_options)

Error message:

>>> df3 = dd.read_parquet("abfs://<container>/testfile.parquet", storage_options=storage_options)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py", line 225, in read_parquet
    meta, statistics, parts = engine.read_metadata(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 202, in read_metadata
    parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 147, in _determine_pf_parts
    base, fns = _analyze_paths(paths, fs)
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\utils.py", line 405, in _analyze_paths
    basepath = path_parts_list[0][:-1]
IndexError: list index out of range

Print a few things using %debug:

ipdb> path_parts_list
[]
ipdb> file_list
[]
ipdb> paths
[]
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py(225)read_parquet()
    223         index = [index]
    224
--> 225     meta, statistics, parts = engine.read_metadata(
    226         fs,
    227         paths,

ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
ipdb> paths
['tmp/testfile.parquet']
ipdb> gather_statistics
ipdb> 
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(147)_determine_pf_parts()
    145         # This is a directory, check for _metadata, then _common_metadata
    146         paths = fs.glob(paths[0] + fs.sep + "*")
--> 147         base, fns = _analyze_paths(paths, fs)
    148         if "_metadata" in fns:
    149             # Using _metadata file (best-case scenario)

ipdb> paths
[]
ipdb> u
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(202)read_metadata()
    200         # then each part will correspond to a file.  Otherwise, each part will
    201         # correspond to a row group (populated below).
--> 202         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
    203             fs, paths, gather_statistics, **kwargs
    204         )

ipdb> paths
['tmp/testfile.parquet']
ipdb> paths[0]
'tmp/testfile.parquet'

It seems paths moves from ['tmp/testfile.parquet'] to [] at some point. I think around https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py#L146

I'll try pyarrow

@raybellwaves
Copy link
Contributor Author

raybellwaves commented Apr 18, 2020

Create new env:

> conda create -n adlfs-pa python=3.8
> conda activate adlfs-pa
> pip install adlfs
> conda install -c conda-forge dask pyarrow ipython

Check packages:

> conda list:

abseil-cpp 20200225.1 he025d50_2 conda-forge
adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
arrow-cpp 0.16.0 py38hd3bb158_3 conda-forge
aws-sdk-cpp 1.7.164 vc14h867dc94_1 [vc14] conda-forge
azure-common 1.1.25 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
backcall 0.1.0 py_0 conda-forge
bokeh 2.0.1 py38h32f6830_0 conda-forge
boost-cpp 1.72.0 h0caebb8_0 conda-forge
brotli 1.0.7 he025d50_1001 conda-forge
bzip2 1.0.8 hfa6e2cd_2 conda-forge
c-ares 1.15.0 h2fa13f4_1001 conda-forge
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py38h32f6830_0 conda-forge
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
click 7.1.1 pyh8c360ce_0 conda-forge
cloudpickle 1.3.0 py_0 conda-forge
colorama 0.4.3 py_0 conda-forge
cryptography 2.9 pypi_0 pypi
curl 7.69.1 h1dcc11c_0 conda-forge
cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge
dask 2.14.0 py_0 conda-forge
dask-core 2.14.0 py_0 conda-forge
decorator 4.4.2 py_0 conda-forge
distributed 2.14.0 py38h32f6830_0 conda-forge
freetype 2.10.1 ha9979f8_0 conda-forge
fsspec 0.7.2 py_0 conda-forge
gflags 2.2.2 he025d50_1002 conda-forge
glog 0.4.0 h0174b99_3 conda-forge
grpc-cpp 1.28.1 hb1a2610_1 conda-forge
heapdict 1.0.1 py_0 conda-forge
idna 2.9 pypi_0 pypi
intel-openmp 2020.0 166
ipython 7.13.0 py38h32f6830_2 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.17.0 py38h32f6830_0 conda-forge
jinja2 2.11.2 pyh9f0ad1d_0 conda-forge
jpeg 9c hfa6e2cd_1001 conda-forge
krb5 1.17.1 hdd46e55_0 conda-forge
libblas 3.8.0 15_mkl conda-forge
libcblas 3.8.0 15_mkl conda-forge
libcurl 7.69.1 h1dcc11c_0 conda-forge
liblapack 3.8.0 15_mkl conda-forge
libpng 1.6.37 hfe6a214_1 conda-forge
libprotobuf 3.11.4 h1a1b453_0 conda-forge
libssh2 1.8.2 h642c060_2 conda-forge
libtiff 4.1.0 h885aae3_6 conda-forge
locket 0.2.0 py_2 conda-forge
lz4-c 1.9.2 h33f27b4_0 conda-forge
markupsafe 1.1.1 py38h9de7a3e_1 conda-forge
mkl 2020.0 166
msgpack-python 1.0.0 py38heaebd3c_1 conda-forge
numpy 1.18.1 py38ha749109_1 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0 conda-forge
pandas 1.0.3 py38he6e81aa_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.7.0 pyh9f0ad1d_0 conda-forge
partd 1.1.0 py_0 conda-forge
pickleshare 0.7.5 py38h32f6830_1001 conda-forge
pillow 7.1.1 py38h8103267_0 conda-forge
pip 20.0.2 py38_1
prompt-toolkit 3.0.5 py_0 conda-forge
psutil 5.7.0 py38h9de7a3e_1 conda-forge
pyarrow 0.16.0 py38h57df961_2 conda-forge
pycparser 2.20 pypi_0 pypi
pygments 2.6.1 py_0 conda-forge
pyjwt 1.7.1 pypi_0 pypi
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
python 3.8.2 h5fd99cc_11
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2019.3 py_0 conda-forge
pyyaml 5.3.1 py38h9de7a3e_0 conda-forge
re2 2020.04.01 vc14h6538335_0 [vc14] conda-forge
requests 2.23.0 pypi_0 pypi
setuptools 46.1.3 py38_0
six 1.14.0 py_1 conda-forge
snappy 1.1.8 he025d50_1 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.31.1 he774522_0
tblib 1.6.0 py_0 conda-forge
thrift-cpp 0.13.0 h1907cbf_2 conda-forge
tk 8.6.10 hfa6e2cd_0 conda-forge
toolz 0.10.0 py_0 conda-forge
tornado 6.0.4 py38hfa6e2cd_0 conda-forge
traitlets 4.3.3 py38h32f6830_1 conda-forge
typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge
urllib3 1.25.9 pypi_0 pypi
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py38_0
wincertstore 0.2 py38_0
xz 5.2.5 h2fa13f4_0 conda-forge
yaml 0.2.3 he774522_0 conda-forge
zict 2.0.0 py_0 conda-forge
zlib 1.2.11 h2fa13f4_1006 conda-forge
zstd 1.4.4 h9f78265_3 conda-forge

Setup code:

import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()

storage_options = <DEFINED>

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)

ddf = dd.from_pandas(df, npartitions=2)

csv example:

dd.to_csv(df=ddf,
          filename='abfs://tmp/test_csvfile/*.csv',
          storage_options=storage_options)
df2 = dd.read_csv('abfs://tmp/test_csvfile/*.csv',
                  storage_options=storage_options)

Same error as above

parquet example:

dd.to_parquet(ddf,
             'abfs://tmp/testfile.parquet',
              storage_options=storage_options)

df3 = dd.read_parquet("abfs://tmp/testfile.parquet",
                      storage_options=storage_options)

Same error as above

Some output of %debug:

> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\utils.py(405)_analyze_paths()
    403     path_parts_list = [_join_path(fn).split("/") for fn in file_list]
    404     if root is False:
--> 405         basepath = path_parts_list[0][:-1]
    406         for i, path_parts in enumerate(path_parts_list):
    407             j = len(path_parts) - 1
ipdb> path_parts_list
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(129)_determine_dataset_parts()
    127         # This is a directory, check for _metadata, then _common_metadata
    128         allpaths = fs.glob(paths[0] + fs.sep + "*")
--> 129         base, fns = _analyze_paths(allpaths, fs)
    130         if "_metadata" in fns and "validate_schema" not in dataset_kwargs:
    131             dataset_kwargs["validate_schema"] = False
ipdb> allpaths
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(220)read_metadata()
    218         # then each part will correspond to a file.  Otherwise, each part will
    219         # correspond to a row group (populated below)
--> 220         parts, dataset = _determine_dataset_parts(
    221             fs, paths, gather_statistics, filters, kwargs.get("dataset", {})
    222         )
ipdb> paths
['tmp/testfile.parquet']
ipdb> parts
*** NameError: name 'parts' is not defined
ipdb> dataset
*** NameError: name 'dataset' is not defined
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000020136448D60>
ipdb> gather_statistics
ipdb> filters
ipdb>  

pyarrow over fastparquet doesn't seem to matter.

@raybellwaves
Copy link
Contributor Author

Just tested reading the csv file and worked on my linux machine. Although got the AzureHttpError for the parquet file. I was also curious about path

> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/fsspec/spec.py(542)info()
    540         if out:
    541             return out[0]
--> 542         out = self.ls(path, detail=True, **kwargs)
    543         path = path.rstrip("/")
    544         out1 = [o for o in out if o["name"].rstrip("/") == path]

ipdb> path                                                                                                                    
'tmp/testfile.parquet/_metadata/_metadata'
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/adlfs/core.py(576)__init__()
    574         self.blob = blob
    575 
--> 576         super().__init__(
    577             fs=fs,
    578             path=path,

ipdb> fs                                                                                                                      
<adlfs.core.AzureBlobFileSystem object at 0x7efdfca6fe80>
ipdb> path                                                                                                                    
'tmp/testfile.parquet/_metadata/_metadata'
ipdb>     
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py(202)read_metadata()
    200         # then each part will correspond to a file.  Otherwise, each part will
    201         # correspond to a row group (populated below).
--> 202         parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
    203             fs, paths, gather_statistics, **kwargs
    204         )

ipdb> paths                                                                                                                   
['tmp/testfile.parquet']

@raybellwaves
Copy link
Contributor Author

reading the csv file worked fine on my Mac. Same AzureHttpError on the parquet file.

I see there are two things here:

  • Try and read the csv file on my windows as i'm able to on my linux and mac
  • Try and read the parquet file.

@hayesgb
Copy link
Collaborator

hayesgb commented Apr 20, 2020

I've spent some time on this today. I can replicate your issue on my Windows machine, but it works as expected on Ubuntu and my Mac. I've found one compatibility issue with the 0.7.2 release of fsspec, which I will work on fixing tomorrow. Currently comparing package dependencies between Windows and Linux.

@hayesgb
Copy link
Collaborator

hayesgb commented Apr 20, 2020

I just uploaded v0.2.2. Give it a shot and let me know if it works for you. Seems there was an issue with parsing container names in Windows, which should be fixed. Also found a change in fsspec v0.6.3 that is causing adlfs to fail one of its unit tests. Need to verify everything is OK before I allow fsspec >= 0.6.3, so pinned to fsspec0.6.0 to 0.6.2. Let me know if it solves your issue.

@raybellwaves
Copy link
Contributor Author

Thanks. I'll try tomorrow

@raybellwaves
Copy link
Contributor Author

Thanks @hayesgb! I was able to read in the csv file on my windows.

Going to move the parquet file read to a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants