Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError #74

Closed
martindut opened this issue Jan 17, 2021 · 6 comments
Closed

FileNotFoundError #74

martindut opened this issue Jan 17, 2021 · 6 comments
Labels
binding/python Issues for the Python package storage/azure Azure Blog storage related

Comments

@martindut
Copy link

Hi. I run the following code to open an delta table on Azure Datalake Gen 2

`python

from deltalake import DeltaTable
import os
os.environ['AZURE_STORAGE_ACCOUNT'] = 'xxxxxx'
os.environ['AZURE_STORAGE_KEY'] = 'xxxxxxxxxxxxxxxxx'

dt = DeltaTable("abfss://xxxxxxxx@xxxxxxx.dfs.core.windows.net/delta/silver/rawdata/holdings/taxhld/v1.0")

dt.version()
dt.files()
dt.file_paths()
`
This all works fine and it lists all the parquet files in the folders, but when I do

df = dt.to_pyarrow_table()
I get this error

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.8/site-packages/deltalake/__init__.py", line 29, in to_pyarrow_table return self.to_pyarrow_dataset().to_table() File "/usr/local/lib/python3.8/site-packages/deltalake/__init__.py", line 26, in to_pyarrow_dataset return dataset(self._table.file_paths(), format="parquet") File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 674, in dataset return _filesystem_dataset(source, **kwargs) File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 426, in _filesystem_dataset fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 312, in _ensure_multiple_sources raise FileNotFoundError(info.path) FileNotFoundError: abfss://xxxxx@xxxxx.dfs.core.windows.net/delta/silver/rawdata/holdings/taxhld/v1.0/company_name=xxx/source_db_name=xxx/source_fund_name=02/file_date=2020-01-01/part-00002-ab186831-cb3b-4294-8d2b-c2377e8eea52.c000.snappy.parquet
but the file does exist and is listed in the dt.file_paths()

@houqp
Copy link
Member

houqp commented Jan 17, 2021

looks like a duplicate to #72, should be fixed by #73.

@houqp houqp added binding/python Issues for the Python package storage/azure Azure Blog storage related labels Jan 17, 2021
@houqp
Copy link
Member

houqp commented Jan 18, 2021

@martindut fix as been merged into master, please reopen if you are still experiencing this issue with 0.2.1 release in PyPI.

@houqp houqp closed this as completed Jan 18, 2021
@martindut
Copy link
Author

@houqp , now I'm getting this error if I run df = dt.to_pyarrow_table():
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.8/site-packages/deltalake/__init__.py", line 38, in to_pyarrow_table return self.to_pyarrow_dataset().to_table() File "/usr/local/lib/python3.8/site-packages/deltalake/__init__.py", line 33, in to_pyarrow_dataset return dataset(keys, filesystem=f"{paths[0].scheme}://{paths[0].netloc}") File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 674, in dataset return _filesystem_dataset(source, **kwargs) File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 426, in _filesystem_dataset fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 298, in _ensure_multiple_sources filesystem, is_local = _ensure_fs(filesystem) File "/usr/local/lib/python3.8/site-packages/pyarrow/dataset.py", line 232, in _ensure_fs filesystem, prefix = FileSystem.from_uri(fs_or_uri) File "pyarrow/_fs.pyx", line 347, in pyarrow._fs.FileSystem.from_uri File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: abfss://xxxxxxx@xxxxxx.dfs.core.windows.net

@rtyler
Copy link
Member

rtyler commented Jan 19, 2021

@martindut I think our integration tests aren't covering the use of pyarrow and the Azure storage engine. Would you mind opening a new issue with some details?

@samuel100
Copy link

@houqp I think this is still an issue. I am on version

deltalake-0.4.8 numpy-1.20.3 pyarrow-4.0.0

When I run:

from deltalake import DeltaTable
import os

os.environ['AZURE_STORAGE_ACCOUNT']='xxx'
os.environ['AZURE_STORAGE_KEY']='xx'

dt = DeltaTable('abfss://xxx@xxx.dfs.core.windows.net/delta_example/')
print(f'table version: {dt.version()}')
print(f'list of files: {dt.file_paths()}')

# convert to data.frame
df = dt.to_pyarrow_table().to_pandas()

The dt.to_pyarrow_table() causes the following error:

ArrowInvalid: Unrecognized filesystem type in URI: abfss://XXX@XXX.dfs.core.windows.net

@houqp
Copy link
Member

houqp commented May 26, 2021

@samuel100 are you able to read one of the parquet file that's causing the error with pyarrow directly? I don't have an azure environment to test and debug this at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package storage/azure Azure Blog storage related
Projects
None yet
Development

No branches or pull requests

4 participants