Day 7: No Space Left On Device

Problem statement
Solution code
Alternate (eager) solution

Intro

This was an interesting puzzle today, because there are so many ways to represent the data, and there are a few ways to "walk" the data. I'm documenting my original solution, and may come back later with alternate implementations if I decide to try other options.

Part One

The puzzle input is text representing terminal output, where each line is either a command instruction to change directories or list the files, or describes a subdirectory relative to the current directory, or a file with its size in the current directory. So as always, we start with parsing the data.

Data format

The target structure I chose to represent is a recursive map. My original plan was to represent the data as such:

{:files {"file1.txt" 100, "file2.txt" 200}
 :dirs  {"dir1" {:files {}, :dirs {}},
         "dir2" {:files {"file3.txt" 300}
                 :dirs  {"dir3" {:files {}, :dirs {}}}}}}

Thus a directory was a map with keys :files and :dirs. Every file in the :files map would map the filename to its size, while the :dirs map would be a map from the directory name to the original data structure of :files and :dirs. It turns out that we don't need the filenames in either part, so I ended up simplifying it by giving every directory a ++shallow++ :size attribute. I may go back later and make it a complete size instead.

{:size 300, :dirs {"dir1" {:size   0, :dirs {}}
                   "dir2" {:size 300, :dirs {"dir3" {:size 0, :dirs {}}}}}}

Parsing

To parse the incoming data, we'll need to work our way through the input lines, but at all times we must know what the current directory of the parser is. For this, I want to quickly talk about the current-dir binding we'll use. This is a vector that represents the current location of the file system within the state map. When at the root, it's an empty vector. At every subdirectory, this is a vector of the keyword :dirs and the directory name. Thus, if we were at the dir3 directory in my example above, the value of current-dir would be [:dirs "dir2" :dirs "dir3"]. So this means that whenever we change into a subdirectory, we must append both :dirs and the directory name, and whenever we move up the file system we need to remove the last 2 values from the vector. For this, we'll use the move-up and move-down helper functions.

(defn move-up [current-dir]
  (subvec current-dir 0 (max 0 (- (count current-dir) 2))))

(defn move-down [current-dir sub-dir]
  (conj current-dir :dirs sub-dir))

To parse the incoming data, let's start with the top-level function and work our way down. We'll quickly define an empty-directory constant, and then go straight to parse-input.

(def root-path [])
(def empty-directory {:size 0, :dirs {}})

(defn parse-input [input]
  (loop [lines (str/split-lines input), state empty-directory, current-dir root-path]
    (if-some [line (first lines)]
      (let [[arg0 arg1 arg2] (str/split line #" ")]
        (if (= arg0 "$")
          (recur (rest lines) state (cond (= arg1 "ls") current-dir
                                          (= arg2 "/") root-path
                                          (= arg2 "..") (move-up current-dir)
                                          :else (move-down current-dir arg2)))
          (recur (rest lines)
                 (if (= arg0 "dir")
                   (create-directory state current-dir arg1)
                   (create-file state current-dir (parse-long arg0)))
                 current-dir)))
      state)))

empty-directory is our map of 0 size and no directories. I wrote parse-input a few times, but ultimately decided it was easiest to read as a loop-recur function, since we need to not only preserve the state of the file system, but also the current directory of the parser. Within the loop, if the lines sequence is empty, we're done so we just return the state. Otherwise, we split the line into up to 3 parameters.

If the first argument is a $ string, we'the code is running a command. We don't have to do anything with ls because the outputs will show up in the following lines. Otherwise, we're doing a cd command. (cd "/") means we reset to the root path, and otherwise we use move-up or move-down as described above.

If the first argument is not a $ string, then we're looking at the output of either a file or a directory. This code went through a few iterations, based on whether the first string is the word dir we're going to either delegate to create-directory or create-file functions. They're both very simple now that we've set up a nice vector to use for our current directory.

(defn create-directory [state current-dir dir-name]
  (assoc-in state (move-down current-dir dir-name) empty-directory))

(defn create-file [state current-dir size]
  (update-in state (conj current-dir :size) + size))

Ok, we have parsed data! On to the problem itself.

Puzzle logic

The goal is to figure out the total size of each directory, including all subdirectories. So we need to create a walk-dir-sizes function, which takes in the parse file system and returns a map of every directory to its size.

(defn walk-dir-sizes
  ([fs] (walk-dir-sizes fs root-path))
  ([fs loc]
   (let [child-dir-names (keys (get-in fs (conj loc :dirs)))
         local-file-size (get-in fs (conj loc :size))
         sub-dirs (reduce (fn [acc child] (->> (move-down loc child) (walk-dir-sizes fs) (merge acc)))
                          {}
                          child-dir-names)
         sub-dir-sizes (transduce (map (comp sub-dirs (partial move-down loc))) + child-dir-names)]
     (assoc sub-dirs loc (+ sub-dir-sizes local-file-size)))))

First off, we create a multi-arity function, where the normal business logic would pass in just the file system, but this delegates to the recursive call with the current location, again walking from the root path. For each directory being inspected, we need to calculate the sum of all child directories, and then add them to the current directory's size. The sub-dirs binding is the result of calling reduce on all child names; for each child, we use walk-down to find that node's full path, recursively call walk-dir-sizes to get a resulting map, and then merge it in to a blank map. This results in a map of every subdirectory to its complete size. To calculate this sum of all subdirectory sizes, we use transduce to bind to sub-dir-sizes; in this case, for every subdirectory name, we again move-down and map that value from the sub-dirs map just created, adding all values together. Finally, by adding the sum of all child directory sizes to the current directory size.

Once that's done, we're ready for the part1 function.

(defn part1 [input]
  (->> (parse-input input)
       walk-dir-sizes
       vals
       (filter (partial >= 100000))
       (reduce +)))

It's pretty straightforward now -- parse the data, walk it, and call vals to look at just the sizes without caring about which directory has which size. Then we filter down to the ones with a size at most 100000, and add the values together with the reduce function.

Whew! That was a bunch for part 1. Let's move on.

Part Two

We have already done all of the work in part 1, so this should be a snap.

(defn part2 [input]
  (let [sizes (-> input parse-input walk-dir-sizes)
        total-current-used (sizes root-path)
        target (- total-current-used 40000000)]
    (->> sizes
         vals
         (filter (partial < target))
         (apply min))))

First we need to know the minimum file size that will free up the necessary space, and that's just 40000000 minus the size of the root path, which we can access with (sizes root-path). Then it's just a simple filter and search for the minimum value.

Refactoring

Both parts 1 and 2 start off with similar logic - parse the data, walk through to get the sizes, and eventually throw away the directory names so we can work with just the sizes. So let's see what we can do with an input->sizes function which does just that.

(defn input->sizes [input]
  (-> input parse-input walk-dir-sizes vals))

part1 now just has to call that function, filter by file size, and add the values together. That could be a filter and then a reduction... but that sounds like we've got another transducer on our hands! Let's do it.

; No transducer?  Boooo!
(defn part1 [input]
  (->> (input->sizes input)
       (filter (partial >= 100000))
       (reduce +)))

; With a transducer!  Yay!
(defn part1 [input]
  (transduce (filter (partial >= 100000)) + (input->sizes input)))

Now part2 could be rewritten to use a transducer, but it becomes ugly because we need to implement all three forms of the reducing function. Sometimes a reduce is just what we need.

(defn part2 [input]
  (let [sizes (input->sizes input)
        target (- (reduce max sizes) 40000000)]
    (reduce min (filter (partial < target) sizes))))

Note that we now do two reduce calls. Since input->sizes hides the directory names, we need to call (reduce max sizes) to find it again. Meh, no big deal. Then we call (reduce min ...) on the filtered sizes to get a nice clean answer.

Alternate Implementation

A day or two after finishing my solution, I realized that I didn't like constantly having to make subvectors, and that the walk-dir-sizes wasn't terrible, but also wasn't great. So I created another implementation, placed in the day07-eager namespace, with another way to solve the problem.

There are a few key changes in this implementation:

Directory paths are represented as lists of directory names, where the last element is always the root "/".
Within each directory is a list of directory paths, so a list of lists, rather than a map by name. Note that the list is of complete directory paths, not just the subdirectory names relative to its parent.
The overall structure of the file system is a single map of {directory-path {:size n, :dirs ()}}, where the size property is the total deep size of the directory (its files and those of its subdirectories). So again, this file system is not represented as a tree, but rather a directory of every path by name.
The create-file function adds the file size to its owning directory, and then also to all parent directories going all the way up to the root.
Since the algorithm eagerly calculates every directory's size, there is no need for a walk-dir-sizes function.

At work, this is usually when I would talk to the developers about whether we needed to optimize for reads or writes based on how the data would be used. In this scenario, we do a bunch of writes up front (forming the file system), and then one big read when we're done, so we should be optimized for writing. However, since I keep on updating directory sizes with multiple writes, this algorithm is actually optimized for reading, which is why it is slightly slower than my original solution.

Still, it's nice to see multiple solutions to the same problem, so we can play with different strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

day07.md

day07.md

Day 7: No Space Left On Device

Intro

Part One

Data format

Parsing

Puzzle logic

Part Two

Refactoring

Alternate Implementation

Files

day07.md

Latest commit

History

day07.md

File metadata and controls

Day 7: No Space Left On Device

Intro

Part One

Data format

Parsing

Puzzle logic

Part Two

Refactoring

Alternate Implementation