Add ctime/mtime to list of expected values in info #526

martindurant · 2021-02-02T20:52:04Z

Created and/or modified time is returned in the file info of most backends. We should endeavour to surface these in the file info dict with a common format (datetime.datetime? unix timestamp?) and key names.

e.g.,

--- a/fsspec/implementations/local.py
+++ b/fsspec/implementations/local.py
@@ -78,6 +78,8 @@ class LocalFileSystem(AbstractFileSystem):
                 result["size"] = out2.st_size
             except IOError:
                 result["size"] = 0
+        result['created'] = datetime.datetime.utcfromtimestamp(result["created"])
+        result['modified'] = datetime.datetime.utcfromtimestamp(result["mtime"])
         return result

The text was updated successfully, but these errors were encountered:

martindurant · 2021-03-05T16:37:26Z

Marked as "good first issue" because this should be simple per implementation, but there are quite a few implementations to go through.

ap-- · 2024-02-09T16:03:40Z

A list of filesystems and their info keys

I collected some about the .info() dicts of the different filesystems.
Posting it here in case it might be useful:

AbstractFileSystem

"name", "size", "type"

filesystem_spec/fsspec/spec.py

Lines 669 to 670 in 2a8e0ee

    
           elif len(out1) > 1 or out: 
        
               return {"name": path, "size": 0, "type": "directory"}

arrow

"name", "size", "type", "mtime" (datetime | float | None)

filesystem_spec/fsspec/implementations/arrow.py

Lines 101 to 118 in 2a8e0ee

    
           def _make_entry(self, info): 
        
               from pyarrow.fs import FileType 
        
               if info.type is FileType.Directory: 
        
                   kind = "directory" 
        
               elif info.type is FileType.File: 
        
                   kind = "file" 
        
               elif info.type is FileType.NotFound: 
        
                   raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), info.path) 
        
               else: 
        
                   kind = "other" 
        
               return { 
        
                   "name": info.path, 
        
                   "size": info.size, 
        
                   "type": kind, 
        
                   "mtime": info.mtime, 
        
               }

https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileInfo.html#pyarrow.fs.FileInfo

dask

returns whatever the remote fs returns.

filesystem_spec/fsspec/implementations/dask.py

Lines 93 to 97 in 2a8e0ee

    
           def ls(self, *args, **kwargs): 
        
               if self.worker: 
        
                   return self.fs.ls(*args, **kwargs) 
        
               else: 
        
                   return self.rfs.ls(*args, **kwargs).compute()

data

"name", "size", "type", "mimetype"

filesystem_spec/fsspec/implementations/data.py

Lines 31 to 35 in 2a8e0ee

    
           def info(self, path, **kwargs): 
        
               pref, name = path.split(",", 1) 
        
               data = self.cat_file(path) 
        
               mime = pref.split(":", 1)[1].split(";", 1)[0] 
        
               return {"name": name, "size": len(data), "type": "file", "mimetype": mime}

dbfs

"name", "size", "type"

filesystem_spec/fsspec/implementations/dbfs.py

Lines 84 to 90 in 2a8e0ee

    
           out = [ 
        
               { 
        
                   "name": o["path"], 
        
                   "type": "directory" if o["is_dir"] else "file", 
        
                   "size": o["file_size"], 
        
               } 
        
               for o in files

dirfs

returns whatever the remote fs returns.

filesystem_spec/fsspec/implementations/dirfs.py

Lines 233 to 241 in 2a8e0ee

    
           async def _ls(self, path, detail=True, **kwargs): 
        
               ret = (await self.fs._ls(self._join(path), detail=detail, **kwargs)).copy() 
        
               if detail: 
        
                   out = [] 
        
                   for entry in ret: 
        
                       entry = entry.copy() 
        
                       entry["name"] = self._relpath(entry["name"]) 
        
                       out.append(entry) 
        
                   return out

ftp

"name", "size", "type", "modify", "unix.owner", "unix.group", "unix.mode", and other returned via FTP.mlsd()

filesystem_spec/fsspec/implementations/ftp.py

Lines 100 to 118 in 2a8e0ee

    
           try: 
        
               out = [ 
        
                   (fn, details) 
        
                   for (fn, details) in self.ftp.mlsd(path) 
        
                   if fn not in [".", ".."] 
        
                   and details["type"] not in ["pdir", "cdir"] 
        
               ] 
        
           except error_perm: 
        
               out = _mlsd2(self.ftp, path)  # Not platform independent 
        
           for fn, details in out: 
        
               if path == "/": 
        
                   path = ""  # just for forming the names, below 
        
               details["name"] = "/".join([path, fn.lstrip("/")]) 
        
               if details["type"] == "file": 
        
                   details["size"] = int(details["size"]) 
        
               else: 
        
                   details["size"] = 0 
        
               if details["type"] == "dir": 
        
                   details["type"] = "directory"

filesystem_spec/fsspec/implementations/ftp.py

Lines 370 to 384 in 2a8e0ee

    
           this = ( 
        
               split_line[-1], 
        
               { 
        
                   "modify": " ".join(split_line[5:8]), 
        
                   "unix.owner": split_line[2], 
        
                   "unix.group": split_line[3], 
        
                   "unix.mode": split_line[0], 
        
                   "size": split_line[4], 
        
               }, 
        
           ) 
        
           if "d" == this[1]["unix.mode"][0]: 
        
               this[1]["type"] = "dir" 
        
           else: 
        
               this[1]["type"] = "file" 
        
           minfo.append(this)

git

"name", "size", "type", "hex", "mode" # mode is octal str, hex is str?

filesystem_spec/fsspec/implementations/git.py

Lines 90 to 96 in 2a8e0ee

    
           { 
        
               "type": "file", 
        
               "name": "/".join([path, obj.name]).lstrip("/"), 
        
               "hex": obj.hex, 
        
               "mode": f"{obj.filemode:o}", 
        
               "size": obj.size, 
        
           }

github

"name", "size", "type", "sha", "mode" # mode is octal str, sha is str

filesystem_spec/fsspec/implementations/github.py

Lines 167 to 178 in 2a8e0ee

    
           types = {"blob": "file", "tree": "directory"} 
        
           out = [ 
        
               { 
        
                   "name": path + "/" + f["path"] if path else f["path"], 
        
                   "mode": f["mode"], 
        
                   "type": types[f["type"]], 
        
                   "size": f.get("size", 0), 
        
                   "sha": f["sha"], 
        
               } 
        
               for f in r.json()["tree"] 
        
               if f["type"] in types 
        
           ]

http

"name", "size", "type", "mimetype", "ETag", "Content-MD5", "Digest"

filesystem_spec/fsspec/implementations/http.py

Lines 190 to 194 in 2a8e0ee

    
           { 
        
               "name": u, 
        
               "size": None, 
        
               "type": "directory" if u.endswith("/") else "file", 
        
           }

filesystem_spec/fsspec/implementations/http.py

Lines 838 to 856 in 2a8e0ee

    
           if "Content-Length" in r.headers: 
        
               # Some servers may choose to ignore Accept-Encoding and return 
        
               # compressed content, in which case the returned size is unreliable. 
        
               if "Content-Encoding" not in r.headers or r.headers["Content-Encoding"] in [ 
        
                   "identity", 
        
                   "", 
        
               ]: 
        
                   info["size"] = int(r.headers["Content-Length"]) 
        
           elif "Content-Range" in r.headers: 
        
               info["size"] = int(r.headers["Content-Range"].split("/")[1]) 
        
           if "Content-Type" in r.headers: 
        
               info["mimetype"] = r.headers["Content-Type"].partition(";")[0] 
        
           info["url"] = str(r.url) 
        
           for checksum_field in ["ETag", "Content-MD5", "Digest"]: 
        
               if r.headers.get(checksum_field): 
        
                   info[checksum_field] = r.headers[checksum_field]

jupyter

"name", "size", "type", "last_modified", "created", "format", "mimetype", "writable"

filesystem_spec/fsspec/implementations/jupyter.py

Lines 47 to 57 in 2a8e0ee

    
           out = r.json() 
        
           if out["type"] == "directory": 
        
               out = out["content"] 
        
           else: 
        
               out = [out] 
        
           for o in out: 
        
               o["name"] = o.pop("path") 
        
               o.pop("content") 
        
               if o["type"] == "notebook": 
        
                   o["type"] = "file"

example:

{
    "name": "slurm-22382538.out",
    "last_modified": "2024-02-09T13:03:30.773865Z",
    "created": "2024-02-09T13:03:30.773865Z",
    "format": null,
    "mimetype": null,
    "size": 2896,
    "writable": true,
    "type": "file"
}

libarchive

"name", "size", "type", "created", "mode", "uid", "gid", "mtime"

filesystem_spec/fsspec/implementations/libarchive.py

Lines 165 to 172 in 2a8e0ee

    
           self.dir_cache.update( 
        
               { 
        
                   dirname: {"name": dirname, "size": 0, "type": "directory"} 
        
                   for dirname in self._all_dirnames(set(entry.name)) 
        
               } 
        
           ) 
        
           f = {key: getattr(entry, fields[key]) for key in fields} 
        
           f["type"] = "directory" if entry.isdir else "file"

libarchive mappings:

filesystem_spec/fsspec/implementations/libarchive.py

Lines 145 to 153 in 2a8e0ee

    
           fields = { 
        
               "name": "pathname", 
        
               "size": "size", 
        
               "created": "ctime", 
        
               "mode": "mode", 
        
               "uid": "uid", 
        
               "gid": "gid", 
        
               "mtime": "mtime", 
        
           }

local

"name", "size", "type", "created", "isLink", "mode", "uid", "gid", "mtime", "ino", "nlink", "destination"

filesystem_spec/fsspec/implementations/local.py

Lines 97 to 112 in 2a8e0ee

    
           result = { 
        
               "name": path, 
        
               "size": out.st_size, 
        
               "type": t, 
        
               "created": out.st_ctime, 
        
               "islink": link, 
        
           } 
        
           for field in ["mode", "uid", "gid", "mtime", "ino", "nlink"]: 
        
               result[field] = getattr(out, f"st_{field}") 
        
           if result["islink"]: 
        
               result["destination"] = os.readlink(path) 
        
               try: 
        
                   out2 = os.stat(path, follow_symlinks=True) 
        
                   result["size"] = out2.st_size 
        
               except OSError: 
        
                   result["size"] = 0

memory

"name", "size", "type", "created"

filesystem_spec/fsspec/implementations/memory.py

Lines 41 to 47 in 2a8e0ee

    
           return [ 
        
               { 
        
                   "name": path, 
        
                   "size": self.store[path].size, 
        
                   "type": "file", 
        
                   "created": self.store[path].created.timestamp(), 
        
               }

reference

"name", "size", "type"

filesystem_spec/fsspec/implementations/reference.py

Lines 224 to 235 in 2a8e0ee

    
           fileinfo = [ 
        
               { 
        
                   "name": name, 
        
                   "type": "file", 
        
                   "size": len( 
        
                       json.dumps(self.zmetadata[name]) 
        
                       if name in self.zmetadata 
        
                       else self._items[name] 
        
                   ), 
        
               } 
        
               for name in others 
        
           ]

sftp

"name", "size", "type", "uid", "gid", "time", "mtime"

filesystem_spec/fsspec/implementations/sftp.py

Lines 108 to 120 in 2a8e0ee

    
           out = { 
        
               "name": "", 
        
               "size": stat.st_size, 
        
               "type": t, 
        
               "uid": stat.st_uid, 
        
               "gid": stat.st_gid, 
        
               "time": datetime.datetime.fromtimestamp( 
        
                   stat.st_atime, tz=datetime.timezone.utc 
        
               ), 
        
               "mtime": datetime.datetime.fromtimestamp( 
        
                   stat.st_mtime, tz=datetime.timezone.utc 
        
               ), 
        
           }

smb

"name", "size", "type", "uid", "gid", "time", "mtime"

filesystem_spec/fsspec/implementations/smb.py

Lines 168 to 176 in 2a8e0ee

    
           res = { 
        
               "name": path + "/" if stype == "directory" else path, 
        
               "size": stats.st_size, 
        
               "type": stype, 
        
               "uid": stats.st_uid, 
        
               "gid": stats.st_gid, 
        
               "time": stats.st_atime, 
        
               "mtime": stats.st_mtime, 
        
           }

tar

"name", "size", "type", "mode", "uid", "gid", "mtime", "chksum", "linkname", "uname", "gname", "devmajor", "devminor"

filesystem_spec/fsspec/implementations/tar.py

Lines 112 to 116 in 2a8e0ee

    
           for member in self.tar.getmembers(): 
        
               info = member.get_info() 
        
               info["name"] = info["name"].rstrip("/") 
        
               info["type"] = typemap.get(info["type"], "file") 
        
               self.dir_cache[info["name"]] = info

example:

_ = {
    'name': 'somefile.md',
    'mode': 420,
    'uid': 501,
    'gid': 20,
    'size': 382,
    'mtime': 1707314187,
    'chksum': 8314,
    'type': 'file',
    'linkname': '',
    'uname': 'andreaspoehlmann',
    'gname': 'staff',
    'devmajor': 0,
    'devminor': 0
}

webhdfs

"name", "size", "type", "accessTime", "blockSize", "group", "modificationTime", "owner", "pathSuffix", "permission", "replication"

filesystem_spec/fsspec/implementations/webhdfs.py

Lines 266 to 270 in 2a8e0ee

    
           def info(self, path): 
        
               out = self._call("GETFILESTATUS", path=path) 
        
               info = out.json()["FileStatus"] 
        
               info["name"] = path 
        
               return self._process_info(info)

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus

zip

"name", "size", "type"

filesystem_spec/fsspec/implementations/zip.py

Lines 100 to 104 in 2a8e0ee

    
           { 
        
               "name": z.filename.rstrip("/"), 
        
               "size": z.file_size, 
        
               "type": ("directory" if z.is_dir() else "file"), 
        
           }

adlfs

"name", "size", "type", "metadata", "creation_time", "deleted", "deleted_time", "last_modified", "content_time", "content_settings", "remaining_retention_days", "archive_status", "last_accessed_on", "etag", "tags", "tag_count", "version_id", "is_current_version"

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L49-L67

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L829C13-L846

gcsfs

https://cloud.google.com/storage/docs/json_api/v1/objects#resource

https://github.com/fsspec/gcsfs/blob/f526d96860c1422e7b4599b70b267607dae1af8a/gcsfs/core.py#L465-L477

s3fs

"name", "size", "type", "StorageClass", "VersionId", "ContentType", "ETag", "LastModified"

https://github.com/fsspec/s3fs/blob/74f4d95a62d7339a1af12db4339f22c5f3d73670/s3fs/core.py#L1310-L1319

alluxio

"name", "size", "type", "last_modification_time_ms"

https://github.com/fsspec/alluxiofs/blob/33489bcea618d6e934e5227be77be75b5ca105ff/alluxiofs/core.py#L134-L149

wandb

"name", "size", "type", "md5", "mimetype"

https://github.com/jkulhanek/wandbfs/blob/ccc7e4dceb45070de8c440b44ddee96fdd348057/wandbfs/_wandbfs.py#L63-L68

oci

"name", "size", "type", "etag", "md5", "timeCreated", "timeModified", "storageTier", "archivalState"

https://github.com/oracle/ocifs/blob/f0e1d3b7b26bc1c1b010abb11df6cd06ac318ed3/ocifs/core.py#L498-L509

asynclocal

same as local

gdrive

"name", "size", "type", and other returned via ??? https://developers.google.com/drive/api/reference/rest/v3/files#File

https://github.com/fsspec/gdrivefs/blob/8bbfa457605d60d40d2b09c8c93d493cf543100e/gdrivefs/core.py#L157-L160

dropbox

"name", "size", "type", and all public attr from FileMetadata

https://dropbox-sdk-python.readthedocs.io/en/latest/api/files.html#dropbox.files.FileMetadata

https://github.com/fsspec/dropboxdrivefs/blob/23463258eca49c10d77de33e9d07e4ee5caa090c/dropboxdrivefs/core.py#L163-L176

oss

"name", "size", "type", "LastModified"

https://github.com/fsspec/ossfs/blob/016ccbad6b90fe02cf613582bb8db3bb101f4438/src/ossfs/base.py#L186-L199

webdav

"name", "size", "type" and others returned via

_ = {
    'name': '/',
    'href': '/',
    'size': None,
    'created': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=tzutc()),
    'modified': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=datetime.timezone.utc),
    'content_language': None,
    'content_type': None,
    'etag': None,
    'type': 'directory',
    'display_name': 'test_storage_options0'
}

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/fsspec.py#L51-L57

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/client.py#L54-L65

dvc

"name", "size", "type", "md5", "md5-dos2unix", "dvc_info", "isdvc", "isout", "fs_info", "isexec", "repo"

https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/fs/dvc.py#L41-L69

root

"name", "size", "type"

https://github.com/CoffeaTeam/fsspec-xrootd/blob/f8c57cd7b0361425ee08a77096dd642ddeb1d987/src/fsspec_xrootd/xrootd.py#L320-L338

box

"name", "size", "type", "id", "modified_at", "created_at"

https://github.com/IBM/boxfs/blob/718fb0071d20a7004f44fe2fa0eac26dc9c3d5d5/src/boxfs/boxfs.py#L395-L402

lakefs

"name", "size", "type", "content-type", "checksum", "mtime"

https://github.com/aai-institute/lakefs-spec/blob/f05c5b6c57547e9f169e3b9c4ed5346f2d65bf35/src/lakefs_spec/spec.py#L356-L363

martindurant · 2024-02-09T16:19:27Z

Thank you, @ap-- , that is very useful. Also worth adding that some backends that don't really have directories will make fake info dicts for those directories, typically with {"name": "...", "size": 0, "type": "dictionary"}.

Your list makes it sound like any FS could do with a add_standard_info_fields(info_dict) static method, where we decide what those standard fields are. For example, converting whatever time unit is expected to a standard representation, which would help for rsync() in particular.

ap-- · 2024-02-09T16:37:28Z

Yes that would be a great step towards standardizing the info_dict.

AbstractFileSystem could even have a default implementation, that tries various different aliases for getting mtime (and potentially others), as well as conversions to the standard datatype (i.e. like this ).

For completeness I'm cross-referencing barneygale/pathlib-abc#3 . I started looking into this, because I need to convert info_dicts into an os.stat_result compatible type for universal_pathlib.

dholth · 2024-03-22T20:41:29Z

While you're at it, the nanoseconds instead of float times would be good. https://docs.python.org/3/library/os.html#os.stat_result.st_mtime_ns

martindurant added the good first issue Good for newcomers label Mar 5, 2021

martindurant mentioned this issue Mar 16, 2021

Faster implementation of SFTP listdir #571

Merged

cvm-a mentioned this issue Jun 22, 2023

The info object in GCSFS does not have mtime/ctime fsspec/gcsfs#559

Closed

ap-- mentioned this issue Sep 23, 2023

Different return types for S3Path.stat() and WindowsUPath.stat() fsspec/universal_pathlib#145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ctime/mtime to list of expected values in info #526

Add ctime/mtime to list of expected values in info #526

martindurant commented Feb 2, 2021

martindurant commented Mar 5, 2021

ap-- commented Feb 9, 2024

martindurant commented Feb 9, 2024 •

edited

ap-- commented Feb 9, 2024

dholth commented Mar 22, 2024

Add ctime/mtime to list of expected values in info #526

Add ctime/mtime to list of expected values in info #526

Comments

martindurant commented Feb 2, 2021

martindurant commented Mar 5, 2021

ap-- commented Feb 9, 2024

A list of filesystems and their info keys

AbstractFileSystem

arrow

dask

data

dbfs

dirfs

ftp

git

github

http

jupyter

libarchive

local

memory

reference

sftp

smb

tar

webhdfs

zip

adlfs

gcsfs

s3fs

alluxio

wandb

oci

asynclocal

gdrive

dropbox

oss

webdav

dvc

root

box

lakefs

martindurant commented Feb 9, 2024 • edited

ap-- commented Feb 9, 2024

dholth commented Mar 22, 2024

martindurant commented Feb 9, 2024 •

edited