-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive remove do not work with newest versions #389
Comments
@PeterFogh , what is |
@daavoo - I have tried both direct paths like "container/blob" and protocal paths link "az://container/blob" - but both fail the same way. |
Hi @PeterFogh , I am unable to reproduce. Does the dir have any particular structure? Do you call Apart from the existing Output of `pip list`$ pip list
Package Version
-------------------- ---------
adal 1.2.7
adlfs 2023.1.0
aiohttp 3.8.3
aiosignal 1.3.1
async-timeout 4.0.2
attrs 22.2.0
azure-core 1.26.2
azure-datalake-store 0.0.52
azure-identity 1.12.0
azure-storage-blob 12.14.1
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 2.1.1
cryptography 39.0.0
frozenlist 1.3.3
fsspec 2023.1.0
gitdb 4.0.10
GitPython 3.1.29
idna 3.4
isodate 0.6.1
msal 1.20.0
msal-extensions 1.0.0
msrest 0.7.1
multidict 6.0.4
oauthlib 3.2.2
pip 21.1.1
portalocker 2.7.0
pycparser 2.21
PyJWT 2.6.0
python-dateutil 2.8.2
requests 2.28.2
requests-oauthlib 1.3.1
setuptools 56.0.0
six 1.16.0
smmap 5.0.0
typing-extensions 4.4.0
urllib3 1.26.14
yarl 1.8.2 `test_rm.py`from pathlib import Path
from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem()
Path("foofile.txt").write_text("foo")
print("INITIAL LS", fs.ls("test-rm", recursive=True))
fs.mkdir("test-rm/foodir")
fs.put_file("foofile.txt", "test-rm/foodir/foofile.txt")
print("AFTER PUT FILE", fs.ls("test-rm", recursive=True))
fs.rm("test-rm/foodir", recursive=True)
print("AFTER RM foodir", fs.ls("test-rm", recursive=True))
fs.mkdir("test-rm/foo/dir")
fs.put_file("foofile.txt", "test-rm/foo/dir/foofile.txt")
print("AFTER PUT FILE", fs.ls("test-rm", recursive=True))
fs.rm("test-rm/foo/dir", recursive=True)
print("AFTER RM FOO/DIR", fs.ls("test-rm", recursive=True))
fs.mkdir("test-rm/foo/dir")
fs.put_file("foofile.txt", "test-rm/foo/dir/foofile.txt")
print("AFTER PUT FILE", fs.ls("test-rm", recursive=True))
fs.rm("test-rm/foo", recursive=True)
print("AFTER RM FOO", fs.ls("test-rm", recursive=True)) Output of `python test_rm.py`$ python test_rm.py
INITIAL LS []
AFTER PUT FILE ['test-rm/foodir']
AFTER RM foodir []
AFTER PUT FILE ['test-rm/foo']
AFTER RM FOO/DIR []
AFTER PUT FILE ['test-rm/foo']
AFTER RM FOO [] |
@daavoo - I still get the error, but I can see that we differ in fsspec version. because mine is After I forced the fsspec version to Rigth now, I'm solving a new conda environment to see if the versions are compatiable without any versions specifications.
Still solves to missmatching adlfs and fsspec verisons:
|
I think the problem might be in the Changing your file to use only $ conda list | grep -E "adlfs|fsspec|dask"
adlfs 2023.1.0 pyhd8ed1ab_0 conda-forge
dask 2023.1.0 pyhd8ed1ab_0 conda-forge
dask-core 2023.1.0 pyhd8ed1ab_0 conda-forge
fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge |
Hello. I am having the exact same issue. |
Hi @igorng , how did you install |
@daavoo adlfs is installed as part of transitive dep of a package, which is installed wit pip |
Could you share As in my comment , it works for me with latest adlfs and fsspec versions, there might be some package in the dependency list that is causing you to install an older version of fsspec |
For some reason, when I do not put a constraint on the version to use, or when I set it to the latest 2023.1.0, fsspec seems to be taken from conda-forge, while adlfs comes from pypi, and it does not work
When I set version to 2022.11.0, both are from pypi, and it works.
So I downgraded to 2022.11.0 . Edit: I don't have time to investigate why fsspec 2023.1.0 is coming from conda-forge (complex env here), os since 2022.11.0 works, fine by me ;) Thank you! |
Thanks for the details @nosterlu ! |
Bumping as I'm having the same issue. calling
Versions:
Installed from pip. Also it only happens sometimes, usually when I try to remove shortly (eg, < 5min) after creating the files |
For the record, can't reproduce the bug when on |
Still have the same issue also, been working around it to remove all files manually for the time being 😄 |
I'm running into the same problem with version 2023.10.0 (adlfs/fsspec). Conda environment with Python 3.11.5 and azure-storage-blob version 12.18.3. If I repeatedly run the command, it eventually works. It seems to delete some of the nested folders/files each time I run it, but errors before it completes all of them. |
I think this problem might only effect storage accounts with hierarchical namespace enabled (ADLS gen2). I can reproduce it with basically any recursive delete on a hierarchical namespace account but not on a flat namespace account. Is everyone else having issues, also using hierarchical namespace accounts? This would also explain why it can be reproduced in any integration test because azurite does not support hierarchical namespace. The Azure error is also says "non-empty directory" but flat namespace accounts don't have directories so it would be strange to receive that error on a flat namespace account. |
I'm pretty sure #383 is the cause. The commit before this it works correctly but with this commit it fails. I have not properly understood what this PR does but from the description of this PR I think it makes sense why this is happening:
When using the azure blob client hierarchical namespace directories look a lot like blobs. If we asynchronously delete all the relevant blobs its highly likely that we attempt to delete the directory marker blob before we've finished deleting all of its contents. |
Using hierarchical here! |
I think it should be quite straightforward to fix. I'll give it a try |
Probably not the neatest solution but I think it works. Tom-Newton#1 I think ideally we would change the way it does listing so that |
Nice! I played around a little also when trying to understand how your code worked! This could maybe make it a little bit clearer? async def _identify_directory_markers(self, files):
"""
Identify the files and directory markers from the given list of files.
A directory marker is identified if another file starts with the marker's name followed by '/'.
"""
files = sorted(set(files)) # Remove duplicates and sort
directory_markers = []
blobs = []
for i, file in enumerate(files):
if i + 1 < len(files) and files[i + 1].startswith(file + "/"):
# If the next file starts with the current file's name followed by '/',
# consider it a directory marker.
directory_markers.append(file)
else:
# Otherwise, it's a regular file/blob.
blobs.append(file)
return blobs, directory_markers for example for testing if __name__ == "__main__":
def _identify_directory_markers_test(files):
"""
Identify the files and directory markers from the given list of files.
A directory marker is identified if another file starts with the marker's name
followed by '/'.
"""
files = sorted(set(files)) # Remove duplicates and sort
directory_markers = []
blobs = []
for i, file in enumerate(files):
if i + 1 < len(files) and files[i + 1].startswith(file + "/"):
# If the next file starts with the current file's name followed by '/',
# consider it a directory marker.
directory_markers.append(file)
else:
# Otherwise, it's a regular file/blob.
blobs.append(file)
return blobs, directory_markers
files = [
"ptp_parquets/transports_unmapped.parquet",
"ptp_parquets/transports.parquet",
"ptp_parquets/road_transports.parquet",
"ptp_parquets/ptp_parquets/transports_unmapped.parquet",
"ptp_parquets/ptp_parquets/transports.parquet",
"ptp_parquets/ptp_parquets/road_transports.parquet",
"ptp_parquets/ptp_parquets/prod_prc.parquet",
"ptp_parquets/ptp_parquets/packed_yesterday.parquet",
"ptp_parquets/ptp_parquets/packed_transports.parquet",
"ptp_parquets/ptp_parquets/orders.parquet",
"ptp_parquets/ptp_parquets/lines.parquet",
"ptp_parquets/ptp_parquets/flight_transports.parquet",
"ptp_parquets/ptp_parquets/dist_freight.parquet",
"ptp_parquets/ptp_parquets/container_transports.parquet",
"ptp_parquets/ptp_parquets", # folder
"ptp_parquets/ptp_parquets", # folder
"ptp_parquets/prod_prc.parquet",
"ptp_parquets/packed_yesterday.parquet",
"ptp_parquets/packed_transports.parquet",
"ptp_parquets/orders.parquet",
"ptp_parquets/lines.parquet",
"ptp_parquets/flight_transports.parquet",
"ptp_parquets/dist_freight.parquet",
"ptp_parquets/container_transports", # file with no file ending!
"ptp_parquets", # folder
"ptp_parquets", # folder
]
blobs, directory_markers = _identify_directory_markers_test(files)
print("FILES")
for blob in blobs:
print(blob)
print("\nDIRs")
for d in directory_markers:
print(d) if all files are directories like this files = [ they will all end up as |
Hi, my old conda environment:
runs the following without any error
But after updating to these package versions:
it raises this error:
RuntimeError: ('Failed to remove %s for %s', [PATHS], ResourceExistsError('This operation is not permitted on a non-empty directory.\nRequestId:ID\nTime:2023-01-20T09:33:26.1693752Z\nErrorCode:DirectoryIsNotEmpty'))
Bacially, the
recursive=True
do not work as intended.The text was updated successfully, but these errors were encountered: