idu
analyzes a file system to build a database that
suports incremental re-scanning to support large local and clould based
fileystems. The analysis takes the form of scanning a filesystem, much like
du does, to gather information on file counts and sizes. An important
difference to du is that idu is heavily optimized for concurrent execution
and can easily handle issuing 1000s of simulataneous stat requests and
directory scans. For example it can scan an Apple Silicon macbook in
around 10 minutes and a 14M+ file lustre filesystem in around 50 minutes.
idu
is designed to be extensible to cloud based filesystems as AWS' S3
or GCP's Cloud Storage though this is not yet implemented.
It can report, from the database, aggregate statistics such as total
file counts, disk usage and to generate reports in json, markdown formats.
It is also possible to query the database in a variety of means,
including a per-user basis.
Note that cloud based filesystems generally do not have a concrete
directory structure in the same way that local filesystems do.
S3 filenames for example have slash separated components
but each component is not a directory in the sense that a user
can cd
to it and list files relative to it. Instead, the separators
are purely a convention and S3 filenames can be accessed independently
of the slash separate components. For example, s3:/aa/bb/cc
can be
listed as aws s3 ls s3:/aa/b
or aws s3 ls s3:/aa/bb/
. The former
will list all files starting with the prefix /aa/b
whereas the
latter will only list files whose names start with s3:/aa/bb
. Since
idu
is intended to work with cloud based filesystems the term
prefix
is often used instead of, or along with, directory. Differences
in behaviour for different filesystems will be called out as they
are added. Currently only local filesystems are supported.
idu
is configured using a yaml file, typically $HOME/idu.yml
, but
this can overriden with --config
file. This configuration file
is organized as a list of 'prefix' entries, each of which
specifies a filesystem tree to be used with idu
.
Each prefix entry specifies the tree to be scanned, the location of the database to be used/created and various, optional, configuration parameters. A minimal entry is as follows:
- prefix: /my/home/tree
database: /my/home/database/location
Common options control the degree of concurrency to use when analyzing a prefix. These are:
concurrent_scans: 5000 # scan up to 5000 directories concurrently.
concurrent_stats: 2000 # issue at most 2000 concurrent stat operations.
concurrent_stats_threshold: 10 # issue asynchronous stats if the number of files in a directory exceeds 10.
scan_size: 2000 # scan 2000 items at a time from each directory
Additional options are available to specify exclusions and file system specific otions.
Exclusions
section can be used to exclude directories/prefixes
and/or files that match the supplied regular expression. For MacOS
systems for example it may be desirable to ignore the .DS_Store
file,
and the CloudStorage
directory, which can be achieved as follows:
exclusions:
- '.DS_Store$'
- '^/User/someone/Library/CloudStorage'
It is possible to specify the file system separator (/ for Unix, \ for windows).
separator: \
The layouts
section is used to calculate disk usage by taking into
account file system block sizes, or more complex structures such as RAID.
layout:
type: block
block_size: 4096
Given a valid configuration file (shown below), idu
can be used as outlined below.
- prefix: /projects/yourshared-project/
database: /projects/yourshared-project/.idu/database
Common usage is as follows:
$ idu analyze /projects/yourshared-project/
$ idu errors /projects/yourshared-project/
$ idu stats compute --stats-dir=./stats /projects/yourshared-project/
$ idu stats view ./stats/latest.idustats
$ idu reports generate ./stats/latest.idustats
As idu
runs it will print various statistics that follow its progress. idu
may be safely interrupted and restarted (see Incremental Updates below).
Once complete, it's good practice to see if idu analyze
encountered any errors,
which are also written to the database, by running idu errors
as show above. Note
that errors are common and most often due to permissions problems; idu
records errors and leaves it to the user to decide whether they are relevant or not; for
example is a lot of disk usage behind an inaccessible due to permissions path?
stats compute <prefix>
will compute stats from the database and store
them in a timestamped file in --stats-dir
(it will create a soft-link, latest.idustats
to the file producted). stats view <idustats-file>
can be used to
read the stats from the database and print them to stdout. reports generate <idustats-file>
will generate a markdown report of the stats and write it to stdout.
Per-user or per-group statistics can be viewed as follows:
$ idu stats view --user=<user> <idustats-file>
$ idu stats view --group=<group> <idustats-file>
## Anticipated Changes and Improvements
### Cloud
`idu` was designed with cloud filesystems and support for GCP's Cloud
Storage and AWS S3 will be added in the near future.