Filesets

Andrea Richiardi edited this page Jan 30, 2016 · 9 revisions
Clone this wiki locally

Build workflows necessarily involve the filesystem because it's not practical to send file contents around in memory as function arguments, and because we want to be able to leverage existing JVM tooling that generally operates on things in the class path and not on in-memory data structures.

Complex project.clj files typically name numerous places on disk where plugins should either emit or expect files for various purposes. Unfortunately, because the places on disk are global to the build process and are possibly shared by independent destructive processes, configuring builds this way is brittle.

To aid in task composition, and to alleviate the difficulty inherent in coordinating globally addressable places, boot does things differently.

Overview

Boot is more than just a build tool. Boot is a framework for bootstrapping Clojure applications. Since Clojure programs run on the JVM there is a lot of bootstrapping that needs to be done:

  • Dependencies – fetch JARs from Maven (the immutable classpath).
  • Classpath files – add directories to the classpath (the mutable classpath).
  • Environment – prepare the Clojure environment to run the program.

Managing interactions with the filesystem is a key part of this process.

Bootstrapping Clojure

The bootstrapping process can be expressed in terms of input and output:

FuserBOOTSTRAPJcp + Fcp + Fasset + E

where

Fuser
User's project files (the build.boot file, sources, assets, etc).
Jcp
JARs on the classpath (the immutable classpath).
Fcp
Files in directories on the classpath (the mutable classpath).
Fasset
Files in directories not on the classpath that the program might need.
E
The Clojure environment in memory.

Of course, the line between bootstrapping and computing proper isn't clearly delineated. Most of the time the bootstrapping process will be extended (via Tasks) to set things up and launch a specific application or build specific artifacts.

This process of extending the bootstrapping phase to create an artifact for deployment or distribution, or to launch an application can be described as a transformation:

Jcp + Fcp + Fasset + ETASKJcp + F’cp + F’asset + E’

In any case, the JARs on the classpath cannot be modified and the environment is simply the result of evaluating Clojure expressions; Lisp already provides the abstractions we need to manage those. Boot provides a fileset abstraction to frame the FF’ component of the transformation in a Lispy, functional way.

The Boot Fileset

Boot provides a fileset record type to manage the parts of the process that involve interaction with the filesystem: an in-memory, immutable representation of the state of the files in Fcp and Fasset at a point in time.

The file-related parts of the process above can be described as:

Fcp + FassetFILESET OPERATIONSF’cp + F’asset

or, equivalently, as:

filesetFILESET OPERATIONSfileset’

Operations on fileset values return new values.

Note: The underlying filesystem is not immutable. The fileset protocol provides a commit! method to sync a given immutable fileset with the underlying mutable filesystem. This overwrites the files on disk to reflect the state of the fileset object.

The Task Model

The boot build process is essentially a pipeline of middleware, similar to the Ring servlet architecture, or Clojure transducers. Where Ring handlers take a request map and return a response map, task handlers take and return fileset objects.

The fileset lifecycle goes something like this:

  1. Receive – handler is passed an immutable fileset as its argument.
  2. Query – handler obtains a set of files to process from the fileset.
  3. Work – handler performs some operation, creating files in temp dirs.
  4. Add – add temp files to fileset, obtaining a new immutable value.
  5. Commit – sync the underlying filesystem dirs to the fileset.
  6. Next – call the next handler, passing it the new immutable fileset.

The middleware approach combined with the immutable fileset messaging between handlers provides a basis for creating powerful, composable modules.

Benefits

The fileset abstraction has a number of desirable characteristics:

  • Filesets are values, not places: they can be anonymous and scoped.
  • Tasks can hold onto a value and commit! at a later time.
  • The flow of files is a succession of values, with occasional commit!s.

These characteristics allow boot to exploit efficiencies such as the use of hard links, structural sharing, and copying without exposing implementation details to the rest of the program.

Definitions

In developing the model for boot's treatment of the side-effecting nature of build tasks, it's helpful to map out the types of tasks and files that comprise a typical build process.

Types of Tasks

Boot build processes usually consist of two main types of tasks:

  • Build
    • compile things, emit code, etc.
    • consume and produce intermediate or source files
  • Package
    • create JAR files, executables, etc.
    • consume intermediate files to produce final artifacts.

Note: tasks may perform activities of one or both types.

Roles of Files

From a task's point of view files in the build set fulfill two principal roles. These roles express the creator's intent with respect to how tasks will use them:

  • Input
    • may be compiled or processed
    • on the build class path
    • consumed by build tasks
    • created by build tasks
  • Output
    • may be incorporated into final artifacts
    • emitted by boot to the target directory
    • consumed by packaging tasks
    • created by build or packaging tasks

Note: these roles are not mutually exclusive; they represent orthogonal concerns, and files may have either or both roles assigned.

Fileset Components

Given that files relevant to the build process can be characterized by the two main roles listed above, we can divide the build fileset into four components, corresponding to the four permutations of input and/or output roles:

type input? output? example
resource HTML files, Clojure source (without AOT)
source Java source, Clojure source (with AOT)
asset ??
cache Various files needed during build

The relationship between roles and components is one of consumer and producer.

  • Consumers – query the fileset for files of a given role, depending on the type of task.
  • Producers – add files to the given component of the fileset to express their intent with respect to how subsequent tasks in the pipeline will use them.

Note: tasks are normally both consumer and producer; they consume files from the fileset and create artifacts of their own, adding them to the fileset.

Temp Directories

An important principle of the boot build process is that tasks do not refer to named places in the filesystem. Tasks may only create files in managed temp directories provided by boot. These temp directories are:

  • Anonymous – tasks do not specify the location of the temp dir.
  • Local – tasks do not pass references to temp dirs to other tasks.
  • Managed – temp dirs are cleaned up by boot as necessary.

In order to communicate files in these temp directories to the rest of the build process they must be added to the fileset object, described below.

Fileset Object

Boot provides a record type, TmpFileSet, that coordinates interaction with the filesystem. The fileset object is:

  • Immutable – operations on the fileset return new values.
  • Snapshot – the fileset models the state of the filesystem at a point in time.
  • Transactional – the filesystem can be synced to the fileset at any time.
  • Overlay – files are identified by unique paths relative to the fileset root.

Fileset & Temp Dirs API

The functions that make up the temp dirs API are all in the boot.core namespace.

Temp Directories

The only place where tasks are allowed to create or modify files is in temp directories provided by boot.

(tmp-dir!)
Returns a boot-managed temporary directory, as a java.io.File.

TmpFiles

The fileset is a tree of TmpFile objects. The underlying files are read-only.

(tmp-path f)
Returns the path of f relative to the fileset root.
(tmp-file f)
Returns the underlying java.io.File object for the temp file f.

Fileset Queries

Obtain sets of temp files from the fileset according to their roles.

(user-files fs)
Returns a set of TmpFile objects corresponding to files in fs that were created by the user as part of the project. These are not the actual files from the project–they are temp files that boot keeps synced with the user's files.
(input-files fs)
Returns a set of TmpFile objects corresponding to files in fs with the input role.
(output-files fs)
Returns a set of TmpFile objects corresponding to files in fs with the output role.

It is also possible to obtain references to the underlying boot-managed temp directories where the fileset is persisted. These directories are read-only.

(user-dirs fs)
Returns a set of java.io.File objects corresponding to the user's source, resource, and asset directories. These are not the actual user directories–they are temp dirs that boot keeps synced with the user's project directories.
(input-dirs fs)
Returns a set of java.io.File objects corresponding to directories in fs containing files with the input role.
(output-dirs fs)
Returns a set of java.io.File objects corresponding to directories in fs containing files with the output role.

Fileset Operations

Fileset operations return new immutable fileset objects. These functions may have hidden side effects. They are not intended to be used in STM transactions.

(add-resource fs ^File dir)
Adds the contents of the dir directory to fs and assigns roles as defined for the resource component above. Paths of added files are relative to dir. Returns a new fileset object.
(add-source fs ^File dir)
Adds the contents of the dir directory to fs and assigns roles as defined for the source component above. Paths of added files are relative to dir. Returns a new fileset object.
(add-asset fs ^File dir)
Adds the contents of the dir directory to fs and assigns roles as defined for the asset component above. Paths of added files are relative to dir. Returns a new fileset object.
(rm fs tmpfiles)
Removes the TmpFiles in tmpfiles from the fileset fs. Returns a new fileset object.
(cp fs ^File src-file ^TmpFile dest-tmpfile)
Replaces the contents of dest-tmpfile with the contents of src-file. Returns a new fileset object.

Sync Fileset to Disk

The fileset may be "synced" to the filesystem at any time. This is the only way for tasks to effect mutation of the classpath or communicate files to other tasks in the pipeline.

(commit! fs)
Syncs the underlying managed directories with the immutable fileset object fs, rebuilding the underlying directories according to its internal state. Returns the fileset object.

Examples

To demonstrate how filesets are used, consider a task that compiles files with the .lc extension to .uc by converting all lower case characters to upper case.

(ns acme.boot-lc
  {:boot/export-tasks true}
  (:require
    [boot.core       :as c]
    [clojure.java.io :as io]))

(defn- compile-lc!
  [in-file out-file]
  (doto out-file
    io/make-parents
    (spit (.toUpperCase (slurp in-file)))))

(defn- lc->uc
  [path]
  (.replaceAll path "\\.lc$" ".uc"))

(c/deftask lc
  "Compile .lc files."
  []
  (let [tmp (c/tmp-dir!)]                           ; [1]
    (fn middleware [next-handler]                   ; [2]
      (fn handler [fileset]                         ; [3]
        (c/empty-dir! tmp)                          ; [4]
        (let [in-files (c/input-files fileset)      ; [5]
              lc-files (c/by-ext [".lc"] in-files)] ; [6]
          (doseq [in lc-files]                      ; [7]
            (let [in-file  (c/tmp-file in)          ; [7.i]
                  in-path  (c/tmp-path in)          ; [7.ii]
                  out-path (lc->uc in-path)         ; [7.iii]
                  out-file (io/file tmp out-path)]  ; [7.iv]
              (compile-lc! in-file out-file)))      ; [7.v]
          (-> fileset                               ; [8]
              (c/add-resource tmp)                  ; [9]
              c/commit!                             ; [10]
              next-handler))))))                    ; [11]

The first two functions are just helper functions, representing processes that might be running in Pods in a real-world task. The task definition is where the interesting stuff happens:

  1. First, we obtain a temporary directory in which the task can create files. This is bound locally and closed over by the middleware the task returns, so the task can reuse the temp dir across build iterations.
  2. Tasks return middleware (similar to Ring middleware).
  3. Task middleware return handlers (similar to Ring handlers).
  4. Inside the handler, the first thing we do is empty the temp dir, ensuring that stale files from previous builds are removed. A more sophisticated implementation could track dependencies and recompile only the source files that have changed, but for simplicity we will just rebuild everything.
  5. We query the fileset, obtaining a set of input files. (This is a build-type task, so we consume files with the input role.) Note that this returns a set of TmpFile objects.
  6. We then filter the input files, keeping only the .lc files–the sources we will be compiling. Note that this returns a set of TmpFile objects.
  7. Then, we compile each of the filtered input files, producing output files in the temp dir.
    1. Get a reference to the underlying source file.
    2. Get the path of the source file relative to the fileset root.
    3. Compute the path of the output file relative to the temp dir.
    4. Create an output file in the temp dir with the computed relative path.
    5. Invoke the compiler to compile the source file.
  8. At this point the temp dir contains the compiled .uc files, but they are not yet incorporated into the fileset object.
  9. We add the contents of the temp dir to the resources component of the fileset, obtaining a new fileset value.
  10. We commit the fileset to disk, returning the fileset object. The output files are now on the classpath.
  11. Finally, we pass the fileset to the next handler, returning the result to the previous task in the build pipeline.

EDN

A fileset defrecord looks like:

{:dirs ;; a set of TmpDir
 #{{:dir :real-dir1-java-io-file
    :user true
    :input true
    :output nil}
   {:dir :real-dir2-java-io-file
    :user true
    :input true
    :output nil}}
 :tree ;; a map string -> TmpFile
 {"relative/foo/bar.clj"
  {:dir :real-dir1-java-io-file
   :bdir :cache-dir1-java-io-file
   :path "relative/foo/bar.clj"
   :id "3ff206d57a914e63a61a10369b01f297.1454093818000"
   :hash "3ff206d57a914e63a61a10369b01f297"
   :time 1454093818000
   :metadata-sample1 true
   :metadata-sample2 :sample}
  "relative/foo/baz.cljs"
  {:dir :real-dir2-java-io-file
   :bdir :cache-dir2-java-io-file
   :path "relative/foo/bar.clj"
   :id "aa0a2cf7eda07f01bc932dfd21b1bca3.1449358564000"
   :hash "aa0a2cf7eda07f01bc932dfd21b1bca3"
   :time 1449358564000
   :metadata-sample3 "I am metadata"}}
 :blob :blob-java-io-file,
 :scratch :scratch-java-io-file}