Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ctime/mtime to list of expected values in info #526

Open
martindurant opened this issue Feb 2, 2021 · 5 comments
Open

Add ctime/mtime to list of expected values in info #526

martindurant opened this issue Feb 2, 2021 · 5 comments
Labels
good first issue Good for newcomers

Comments

@martindurant
Copy link
Member

Created and/or modified time is returned in the file info of most backends. We should endeavour to surface these in the file info dict with a common format (datetime.datetime? unix timestamp?) and key names.

e.g.,

--- a/fsspec/implementations/local.py
+++ b/fsspec/implementations/local.py
@@ -78,6 +78,8 @@ class LocalFileSystem(AbstractFileSystem):
                 result["size"] = out2.st_size
             except IOError:
                 result["size"] = 0
+        result['created'] = datetime.datetime.utcfromtimestamp(result["created"])
+        result['modified'] = datetime.datetime.utcfromtimestamp(result["mtime"])
         return result
@martindurant martindurant added the good first issue Good for newcomers label Mar 5, 2021
@martindurant
Copy link
Member Author

Marked as "good first issue" because this should be simple per implementation, but there are quite a few implementations to go through.

@ap--
Copy link
Contributor

ap-- commented Feb 9, 2024

A list of filesystems and their info keys

I collected some about the .info() dicts of the different filesystems.
Posting it here in case it might be useful:

AbstractFileSystem

"name", "size", "type"

elif len(out1) > 1 or out:
return {"name": path, "size": 0, "type": "directory"}

arrow

"name", "size", "type", "mtime" (datetime | float | None)

def _make_entry(self, info):
from pyarrow.fs import FileType
if info.type is FileType.Directory:
kind = "directory"
elif info.type is FileType.File:
kind = "file"
elif info.type is FileType.NotFound:
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), info.path)
else:
kind = "other"
return {
"name": info.path,
"size": info.size,
"type": kind,
"mtime": info.mtime,
}

https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileInfo.html#pyarrow.fs.FileInfo

dask

returns whatever the remote fs returns.

def ls(self, *args, **kwargs):
if self.worker:
return self.fs.ls(*args, **kwargs)
else:
return self.rfs.ls(*args, **kwargs).compute()

data

"name", "size", "type", "mimetype"

def info(self, path, **kwargs):
pref, name = path.split(",", 1)
data = self.cat_file(path)
mime = pref.split(":", 1)[1].split(";", 1)[0]
return {"name": name, "size": len(data), "type": "file", "mimetype": mime}

dbfs

"name", "size", "type"

out = [
{
"name": o["path"],
"type": "directory" if o["is_dir"] else "file",
"size": o["file_size"],
}
for o in files

dirfs

returns whatever the remote fs returns.

async def _ls(self, path, detail=True, **kwargs):
ret = (await self.fs._ls(self._join(path), detail=detail, **kwargs)).copy()
if detail:
out = []
for entry in ret:
entry = entry.copy()
entry["name"] = self._relpath(entry["name"])
out.append(entry)
return out

ftp

"name", "size", "type", "modify", "unix.owner", "unix.group", "unix.mode", and other returned via FTP.mlsd()

try:
out = [
(fn, details)
for (fn, details) in self.ftp.mlsd(path)
if fn not in [".", ".."]
and details["type"] not in ["pdir", "cdir"]
]
except error_perm:
out = _mlsd2(self.ftp, path) # Not platform independent
for fn, details in out:
if path == "/":
path = "" # just for forming the names, below
details["name"] = "/".join([path, fn.lstrip("/")])
if details["type"] == "file":
details["size"] = int(details["size"])
else:
details["size"] = 0
if details["type"] == "dir":
details["type"] = "directory"

this = (
split_line[-1],
{
"modify": " ".join(split_line[5:8]),
"unix.owner": split_line[2],
"unix.group": split_line[3],
"unix.mode": split_line[0],
"size": split_line[4],
},
)
if "d" == this[1]["unix.mode"][0]:
this[1]["type"] = "dir"
else:
this[1]["type"] = "file"
minfo.append(this)

git

"name", "size", "type", "hex", "mode" # mode is octal str, hex is str?

{
"type": "file",
"name": "/".join([path, obj.name]).lstrip("/"),
"hex": obj.hex,
"mode": f"{obj.filemode:o}",
"size": obj.size,
}

github

"name", "size", "type", "sha", "mode" # mode is octal str, sha is str

types = {"blob": "file", "tree": "directory"}
out = [
{
"name": path + "/" + f["path"] if path else f["path"],
"mode": f["mode"],
"type": types[f["type"]],
"size": f.get("size", 0),
"sha": f["sha"],
}
for f in r.json()["tree"]
if f["type"] in types
]

http

"name", "size", "type", "mimetype", "ETag", "Content-MD5", "Digest"

{
"name": u,
"size": None,
"type": "directory" if u.endswith("/") else "file",
}

if "Content-Length" in r.headers:
# Some servers may choose to ignore Accept-Encoding and return
# compressed content, in which case the returned size is unreliable.
if "Content-Encoding" not in r.headers or r.headers["Content-Encoding"] in [
"identity",
"",
]:
info["size"] = int(r.headers["Content-Length"])
elif "Content-Range" in r.headers:
info["size"] = int(r.headers["Content-Range"].split("/")[1])
if "Content-Type" in r.headers:
info["mimetype"] = r.headers["Content-Type"].partition(";")[0]
info["url"] = str(r.url)
for checksum_field in ["ETag", "Content-MD5", "Digest"]:
if r.headers.get(checksum_field):
info[checksum_field] = r.headers[checksum_field]

jupyter

"name", "size", "type", "last_modified", "created", "format", "mimetype", "writable"

out = r.json()
if out["type"] == "directory":
out = out["content"]
else:
out = [out]
for o in out:
o["name"] = o.pop("path")
o.pop("content")
if o["type"] == "notebook":
o["type"] = "file"

example:

{
    "name": "slurm-22382538.out",
    "last_modified": "2024-02-09T13:03:30.773865Z",
    "created": "2024-02-09T13:03:30.773865Z",
    "format": null,
    "mimetype": null,
    "size": 2896,
    "writable": true,
    "type": "file"
}

libarchive

"name", "size", "type", "created", "mode", "uid", "gid", "mtime"

self.dir_cache.update(
{
dirname: {"name": dirname, "size": 0, "type": "directory"}
for dirname in self._all_dirnames(set(entry.name))
}
)
f = {key: getattr(entry, fields[key]) for key in fields}
f["type"] = "directory" if entry.isdir else "file"

libarchive mappings:

fields = {
"name": "pathname",
"size": "size",
"created": "ctime",
"mode": "mode",
"uid": "uid",
"gid": "gid",
"mtime": "mtime",
}

local

"name", "size", "type", "created", "isLink", "mode", "uid", "gid", "mtime", "ino", "nlink", "destination"

result = {
"name": path,
"size": out.st_size,
"type": t,
"created": out.st_ctime,
"islink": link,
}
for field in ["mode", "uid", "gid", "mtime", "ino", "nlink"]:
result[field] = getattr(out, f"st_{field}")
if result["islink"]:
result["destination"] = os.readlink(path)
try:
out2 = os.stat(path, follow_symlinks=True)
result["size"] = out2.st_size
except OSError:
result["size"] = 0

memory

"name", "size", "type", "created"

return [
{
"name": path,
"size": self.store[path].size,
"type": "file",
"created": self.store[path].created.timestamp(),
}

reference

"name", "size", "type"

fileinfo = [
{
"name": name,
"type": "file",
"size": len(
json.dumps(self.zmetadata[name])
if name in self.zmetadata
else self._items[name]
),
}
for name in others
]

sftp

"name", "size", "type", "uid", "gid", "time", "mtime"

out = {
"name": "",
"size": stat.st_size,
"type": t,
"uid": stat.st_uid,
"gid": stat.st_gid,
"time": datetime.datetime.fromtimestamp(
stat.st_atime, tz=datetime.timezone.utc
),
"mtime": datetime.datetime.fromtimestamp(
stat.st_mtime, tz=datetime.timezone.utc
),
}

smb

"name", "size", "type", "uid", "gid", "time", "mtime"

res = {
"name": path + "/" if stype == "directory" else path,
"size": stats.st_size,
"type": stype,
"uid": stats.st_uid,
"gid": stats.st_gid,
"time": stats.st_atime,
"mtime": stats.st_mtime,
}

tar

"name", "size", "type", "mode", "uid", "gid", "mtime", "chksum", "linkname", "uname", "gname", "devmajor", "devminor"

for member in self.tar.getmembers():
info = member.get_info()
info["name"] = info["name"].rstrip("/")
info["type"] = typemap.get(info["type"], "file")
self.dir_cache[info["name"]] = info

example:

_ = {
    'name': 'somefile.md',
    'mode': 420,
    'uid': 501,
    'gid': 20,
    'size': 382,
    'mtime': 1707314187,
    'chksum': 8314,
    'type': 'file',
    'linkname': '',
    'uname': 'andreaspoehlmann',
    'gname': 'staff',
    'devmajor': 0,
    'devminor': 0
}

webhdfs

"name", "size", "type", "accessTime", "blockSize", "group", "modificationTime", "owner", "pathSuffix", "permission", "replication"

def info(self, path):
out = self._call("GETFILESTATUS", path=path)
info = out.json()["FileStatus"]
info["name"] = path
return self._process_info(info)

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus

zip

"name", "size", "type"

{
"name": z.filename.rstrip("/"),
"size": z.file_size,
"type": ("directory" if z.is_dir() else "file"),
}

adlfs

"name", "size", "type", "metadata", "creation_time", "deleted", "deleted_time", "last_modified", "content_time", "content_settings", "remaining_retention_days", "archive_status", "last_accessed_on", "etag", "tags", "tag_count", "version_id", "is_current_version"

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L49-L67

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L829C13-L846

gcsfs

https://cloud.google.com/storage/docs/json_api/v1/objects#resource

https://github.com/fsspec/gcsfs/blob/f526d96860c1422e7b4599b70b267607dae1af8a/gcsfs/core.py#L465-L477

s3fs

"name", "size", "type", "StorageClass", "VersionId", "ContentType", "ETag", "LastModified"

https://github.com/fsspec/s3fs/blob/74f4d95a62d7339a1af12db4339f22c5f3d73670/s3fs/core.py#L1310-L1319

alluxio

"name", "size", "type", "last_modification_time_ms"

https://github.com/fsspec/alluxiofs/blob/33489bcea618d6e934e5227be77be75b5ca105ff/alluxiofs/core.py#L134-L149

wandb

"name", "size", "type", "md5", "mimetype"

https://github.com/jkulhanek/wandbfs/blob/ccc7e4dceb45070de8c440b44ddee96fdd348057/wandbfs/_wandbfs.py#L63-L68

oci

"name", "size", "type", "etag", "md5", "timeCreated", "timeModified", "storageTier", "archivalState"

https://github.com/oracle/ocifs/blob/f0e1d3b7b26bc1c1b010abb11df6cd06ac318ed3/ocifs/core.py#L498-L509

asynclocal

same as local

gdrive

"name", "size", "type", and other returned via ??? https://developers.google.com/drive/api/reference/rest/v3/files#File

https://github.com/fsspec/gdrivefs/blob/8bbfa457605d60d40d2b09c8c93d493cf543100e/gdrivefs/core.py#L157-L160

dropbox

"name", "size", "type", and all public attr from FileMetadata

https://dropbox-sdk-python.readthedocs.io/en/latest/api/files.html#dropbox.files.FileMetadata

https://github.com/fsspec/dropboxdrivefs/blob/23463258eca49c10d77de33e9d07e4ee5caa090c/dropboxdrivefs/core.py#L163-L176

oss

"name", "size", "type", "LastModified"

https://github.com/fsspec/ossfs/blob/016ccbad6b90fe02cf613582bb8db3bb101f4438/src/ossfs/base.py#L186-L199

webdav

"name", "size", "type" and others returned via

_ = {
    'name': '/',
    'href': '/',
    'size': None,
    'created': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=tzutc()),
    'modified': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=datetime.timezone.utc),
    'content_language': None,
    'content_type': None,
    'etag': None,
    'type': 'directory',
    'display_name': 'test_storage_options0'
}

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/fsspec.py#L51-L57

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/client.py#L54-L65

dvc

"name", "size", "type", "md5", "md5-dos2unix", "dvc_info", "isdvc", "isout", "fs_info", "isexec", "repo"

https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/fs/dvc.py#L41-L69

root

"name", "size", "type"

https://github.com/CoffeaTeam/fsspec-xrootd/blob/f8c57cd7b0361425ee08a77096dd642ddeb1d987/src/fsspec_xrootd/xrootd.py#L320-L338

box

"name", "size", "type", "id", "modified_at", "created_at"

https://github.com/IBM/boxfs/blob/718fb0071d20a7004f44fe2fa0eac26dc9c3d5d5/src/boxfs/boxfs.py#L395-L402

lakefs

"name", "size", "type", "content-type", "checksum", "mtime"

https://github.com/aai-institute/lakefs-spec/blob/f05c5b6c57547e9f169e3b9c4ed5346f2d65bf35/src/lakefs_spec/spec.py#L356-L363

@martindurant
Copy link
Member Author

martindurant commented Feb 9, 2024

Thank you, @ap-- , that is very useful. Also worth adding that some backends that don't really have directories will make fake info dicts for those directories, typically with {"name": "...", "size": 0, "type": "dictionary"}.

Your list makes it sound like any FS could do with a add_standard_info_fields(info_dict) static method, where we decide what those standard fields are. For example, converting whatever time unit is expected to a standard representation, which would help for rsync() in particular.

@ap--
Copy link
Contributor

ap-- commented Feb 9, 2024

Yes that would be a great step towards standardizing the info_dict.

AbstractFileSystem could even have a default implementation, that tries various different aliases for getting mtime (and potentially others), as well as conversions to the standard datatype (i.e. like this ).

For completeness I'm cross-referencing barneygale/pathlib-abc#3 . I started looking into this, because I need to convert info_dicts into an os.stat_result compatible type for universal_pathlib.

@dholth
Copy link
Contributor

dholth commented Mar 22, 2024

While you're at it, the nanoseconds instead of float times would be good. https://docs.python.org/3/library/os.html#os.stat_result.st_mtime_ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants