Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR - ... HTTP status code=404, Exception=The specified blob does not exist. ErrorCode: BlobNotFound. #20

Closed
danielsc opened this issue Nov 26, 2019 · 8 comments

Comments

@danielsc
Copy link

I am running the below code and everything works just fine -- in can process the whole dataset and no parts are missing.

import dask.dataframe as dd
from fsspec.registry import known_implementations
known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}
STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
df = dd.read_csv(f'abfs://{CONTAINER}/nyctaxi/2015/*.csv', 
                 storage_options=STORAGE_OPTIONS,
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

Still I am getting this error message as the code is run:

ERROR - Client-Request-ID=bcfef538-1079-11ea-9010-37c2dc712507 Retry policy did not allow for a retry: Server-Timestamp=Tue, 26 Nov 2019 18:22:42 GMT, Server-Request-ID=3807c197-101e-000b-5c86-a41b61000000, HTTP status code=404, Exception=The specified blob does not exist. ErrorCode: BlobNotFound.

It seems to be inconsequential, but I would like to know if it can be avoided.

@hayesgb
Copy link
Collaborator

hayesgb commented Nov 27, 2019

I've only seen this when I'm referencing a file or container that isn't present. Can you try running:

from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY, container_name=CONTAINER)
files = fs.glob('/nyctaxi/2015/*.csv')

This instantiates the filesystem and should return a list of all files Dask will expect to find. The most likely explanation is that one of the items being returned has a size of 0. You can also try fs.walk(filepath) and fs.info(file) to get more detailed information.

@danielsc
Copy link
Author

the same error occurs when I just run your code above:

from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY, container_name=CONTAINER)
files = fs.glob('/nyctaxi/2015/*.csv')

so the glob is somewhat unhappy. If I look at the files found by glob, they are complete:

['nyctaxi/2015/yellow_tripdata_2015-01.csv',
 'nyctaxi/2015/yellow_tripdata_2015-02.csv',
 'nyctaxi/2015/yellow_tripdata_2015-03.csv',
 'nyctaxi/2015/yellow_tripdata_2015-04.csv',
 'nyctaxi/2015/yellow_tripdata_2015-05.csv',
 'nyctaxi/2015/yellow_tripdata_2015-06.csv',
 'nyctaxi/2015/yellow_tripdata_2015-07.csv',
 'nyctaxi/2015/yellow_tripdata_2015-08.csv',
 'nyctaxi/2015/yellow_tripdata_2015-09.csv',
 'nyctaxi/2015/yellow_tripdata_2015-10.csv',
 'nyctaxi/2015/yellow_tripdata_2015-11.csv',
 'nyctaxi/2015/yellow_tripdata_2015-12.csv']

Which is the same that I see in the storage explorer:
image

@martindurant
Copy link
Member

^ there appears to be a different in the initial "/"

@danielsc
Copy link
Author

danielsc commented Nov 28, 2019

I tried files = fs.glob('nyctaxi/2015/*.csv'), but that yields the same error....

it looks like it is happening in info in core.py when testing the directory for whether it is a file:

image

Which is called from find in spec.py:
image

Path is 'nyctaxi/2015/' in the above case and getting blob properties on a directory path seems to fail since directories don't really exist as entities in Blob (unless created by BlobFUSE, but even then not with a trailing /)?).

SInce it is happening after the files in the directory were listed, I am getting all my results, but somewhat consistently, this is also giving me an error: files = fs.glob('/nyctaxi/*/*.csv') and the returned array is empty.

@hayesgb
Copy link
Collaborator

hayesgb commented Nov 30, 2019

@danielsc -- I've written a few tests that (I think) replicate the source of the problem you're observing, and then pushed a branch (blob_not_exist_exception). Any chance you can test that branch and give some feeback?

@AlbertDeFusco
Copy link
Contributor

I get the same kind of error when reading a partitioned parquet data set. ls returns

['bike.parq/_common_metadata',
 'bike.parq/_metadata',
 'bike.parq/part.0.parquet',
 'bike.parq/part.1.parquet',
 'bike.parq/part.10.parquet',
 'bike.parq/part.2.parquet',
 'bike.parq/part.3.parquet',
 'bike.parq/part.4.parquet',
 'bike.parq/part.5.parquet',
 'bike.parq/part.6.parquet',
 'bike.parq/part.7.parquet',
 'bike.parq/part.8.parquet',
 'bike.parq/part.9.parquet',

then when I attempt to read the directory it fails.

ERROR:azure.storage.common.storageclient:Client-Request-ID=31e51f18-1452-11ea-a105-3af9d3e408b5 Retry policy did not allow for a retry: Server-Timestamp=Sun, 01 Dec 2019 15:49:44 GMT, Server-Request-ID=99d864d0-4a64-42b5-b7ab-120b04143175, HTTP status code=404, Exception=The specified blob does not exist. ErrorCode: BlobNotFound.
---------------------------------------------------------------------------
AzureMissingResourceHttpError             Traceback (most recent call last)
<ipython-input-33-0f207f108c7a> in <module>
----> 1 b = dd.read_parquet('abfs://data/bike.parquet', engine='fastparquet', storage_options=STORAGE_OPTIONS)


...

~/Development/AnacondaPlatform/training-ae5-projects/RemoteDataAzure/envs/default/lib/python3.7/site-packages/azure/storage/common/_error.py in _http_error_handler(http_error)
    113     ex.error_code = error_code
    114 
--> 115     raise ex
    116 
    117 

AzureMissingResourceHttpError: The specified blob does not exist. ErrorCode: BlobNotFound

I will give your branch a run in few days

@AlbertDeFusco
Copy link
Contributor

I've confirmed that combining the two PRs fixes the glob and dask issues.

@hayesgb
Copy link
Collaborator

hayesgb commented Dec 15, 2019

Thanks for verifying @AlbertDeFusco.

@hayesgb hayesgb closed this as completed Dec 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants