-
Notifications
You must be signed in to change notification settings - Fork 416
Open
Description
I need to implement the FUSE getattr (stat) callback. I.e., I need to get at least the file type and size, and possibly name for a given path.
I am failing to do this with the HTTP filesystem implementation because:
info(path)always returns the file information for the HTML file, i.e., the file type is also always a file. This is already inconsistent to all other fsspec implementations. The same forisfile, which always returns true.isdir(path)hangs and when looking at my local HTTP server log or at my network bandwidth when testing with an external server, I see that this call downloads the whole file. This means that currently anls -lawill download all files in the given folder...
Test to reproduce:
import pprint
import time
import fsspec
prefix="https://ash-speed.hetzner.com/"
def timedCall(f, *args):
t0 = time.time()
result = f(*args)
t1 = time.time()
print(f"{f} took {t1 - t0:.3f} s")
pprint.pprint(result)
print()
f = fsspec.open(prefix)
print(f"# Testing {prefix}\n")
timedCall(f.fs.exists, prefix)
timedCall(f.fs.listdir, prefix)
timedCall(f.fs.info, prefix)
timedCall(f.fs.isfile, prefix)
timedCall(f.fs.isdir, prefix)
path = prefix + "100MB.bin"
print(f"# Testing {path}\n")
timedCall(f.fs.exists, path)
timedCall(f.fs.info, path)
timedCall(f.fs.isfile, path)
timedCall(f.fs.isdir, path)Output:
# Testing https://ash-speed.hetzner.com/
<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.362 s
True
<bound method AbstractFileSystem.listdir of <fsspec.implementations.http.HTTPFileSystem object at 0x7fb0beaca680>> took 0.110 s
[{'name': 'https://ash-speed.hetzner.com/10GB.bin',
'size': None,
'type': 'file'},
{'name': 'https://ash-speed.hetzner.com/100MB.bin',
'size': None,
'type': 'file'},
{'name': 'https://ash-speed.hetzner.com/1GB.bin',
'size': None,
'type': 'file'}]
<function HTTPFileSystem._info at 0x7fb0be244550> took 0.763 s
{'ETag': '"60f52d50-143"',
'mimetype': 'text/html',
'name': 'https://ash-speed.hetzner.com/',
'size': 323,
'type': 'file',
'url': 'https://ash-speed.hetzner.com/'}
<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.108 s
True
<function HTTPFileSystem._isdir at 0x7fb0be244670> took 0.108 s
True
# Testing https://ash-speed.hetzner.com/100MB.bin
<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.216 s
True
<function HTTPFileSystem._info at 0x7fb0be244550> took 1.098 s
{'ETag': '"60c9b8bd-6400000"',
'mimetype': 'application/octet-stream',
'name': 'https://ash-speed.hetzner.com/100MB.bin',
'size': 104857600,
'type': 'file',
'url': 'https://ash-speed.hetzner.com/100MB.bin'}
<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.428 s
True
<function HTTPFileSystem._isdir at 0x7fb0be244670> took 38.450 s
FalseImho, isdir should be implemented via a listdir to the parent if there is no other way. I am also wondering what it does check. Is it simply doing a mimetype check whether it is HTML? If so, then the first 1000 or so bytes would suffice. But then, wouldn't it detect arbitrary HTML files inside a given "folder" wrongly as a folder?
My current workaround is to call info first and only call isdir if mimetype is text/html. This logic could also be implemented in HTTPFileSystem if there is no better way.
Metadata
Metadata
Assignees
Labels
No labels