-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue reading parquet using pyarrow #59
Comments
This exception seems to follow checks for
|
Sure! I've had some issues with
But there is data in the container:
Previous behaviour (just verified by rolling back)
Interestingly, both
Previous behaviour
Which probably is not good 😄 |
Do you observe this behavior with adlfs<0.3.0? The update to 0.3 migrates to Azure storage v12 from v2.0. It would be helpful to know if its related to this change. |
@hayesgb You're fast 😄 Just updated my reply with equivalent behaviour from v0.2.4 |
LOL. I just sat down at my computer and saw this. I just added a branch "isfile_tests" that checks to verify if files and directories in the top level directory are identified properly. These pass, and can be found under test_core.py:
Two questions. 1) Any chance you can add a failing example, and 2) Can you share the versions of pyarrow, fastparquet, and fsspec you are using now vs what was working previously? |
Sure - with a "pure" AzureBlobFileSystem (essentially copying your test) works fine - it seems the issue is with the way So a complete failing example would be (again, using Azurite): Versions: import dask.dataframe as dd
import pandas as pd
import adlfs
from azure.storage.blob import BlobServiceClient
conn_str = "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey" \
"=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr" \
"/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"
STORAGE_OPTIONS = {"account_name": "devstoreaccount1",
"connection_string": conn_str}
client: BlobServiceClient = BlobServiceClient.from_connection_string(conn_str)
container_client = client.create_container("test")
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": [2, 4, 6, 8],
"index_key": [1, 1, 2, 2],
"partition_key": [1, 1, 2, 2],
}
)
dask_dataframe = dd.from_pandas(df, npartitions=1)
dask_dataframe.to_parquet(
"abfs://test/test_group",
storage_options=STORAGE_OPTIONS,
engine="pyarrow",
)
fs = adlfs.AzureBlobFileSystem(**STORAGE_OPTIONS)
fs.ls("test")
Traceback (most recent call last):
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 518, in ls
elif len(blobs) == 1 and blobs[0]["blob_type"] == "BlockBlob":
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/azure/storage/blob/_shared/models.py", line 191, in __getitem__
return self.__dict__[key]
KeyError: 'blob_type'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 534, in ls
raise FileNotFoundError(f"File {path} does not exist!!")
FileNotFoundError: File does not exist!! Should this be a Dask issue instead? Seems like |
Can you tell whether files have indeed been created in the blob container? |
Yes, I can do the following: >>> fs.ls("test/test_group")
['test/test_group/_common_metadata', 'test/test_group/_metadata', 'test/test_group/part.0.parquet'] I have also "manually" confirmed by checking with the BlobStoreClient directly |
If |
I think the issue arises, because dask is trying to do Digging into the adlfs/fsspec code, this is probably because What I don't understand is why |
Sorry to resurrect, but just wanted to know if this example was reproducible or if I'm the only one having the problem? |
I'm looking into it. |
I believe I have a fix for this. In some instances, Azure Blob Filesystem will return an ItemPaged iterator instead of a BlobPrefix. The scenario you were seeing appears to be one of those instances, so it wasn't being picked up. I'm going to do a push to master. Any chance you can take a look and see if it fixes your issue? |
@hayesgb That's great news! My test case is working fine now, looks like you found the issue 👍 |
Sounds great. I've released this in 0.3.1. |
When trying to read parquet files using Dask==2.15.0 and adlfs==0.3.0 I got exceptions that I didn't have before.
I boiled it down to the following example:
Using Azurite:
Using fastparquet instead solves the problem - but last I tried, fastparquet didn't handle my filters properly, so I would prefer to be able to use pyarrow
The text was updated successfully, but these errors were encountered: