Diff for Clojure Sequences
Clojure Java
Latest commit 62906b4 May 12, 2011 @brentonashworth Merge pull request #1 from vseloved/master
levenshtein-distance was broken due to improper checking for empty collections in max-or-zero

README.textile

clj-diff

Provides diff and patch functions for Clojure sequences where (diff a b) -> x and (patch a x) -> b. Also provides edit-distance and levenshtein-distance functions for calculating the difference between two sequences.

Usage

user=> (use 'clj-diff.core)
user=> (diff "John went to the movies." "John was in the movies.")
{:+ [[5 \a \s \space \i]], :- [6 8 9 10 11]}
user=> (patch "John went to the movies." *1)
"John was in the movies."
user=> (edit-distance "John went to the movies." "John was in the movies."))
9
user=> (levenshtein-distance "John went to the movies." "John was in the movies.")
8

There is already a Java library which does this well. Why create a Clojure version? So that we can do this:

user=> (def a [{:a 1} {:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7}])
user=> (def b [{:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7} {:a 1}])
user=> (diff a b)
{:+ [[6 {:a 1}]], :- [0]}
user=> (patch a *1)
({:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7} {:a 1})
user=> (edit-distance a b)
2

Notes

The current diff algorithm comes from the paper An O(NP) Sequence Comparison Algorithm by Sun Wu, Udi Manber, Gene Myers and Webb Miller. It is fast and memory efficient. It also makes use of the pre-diff optimizations mentioned in Neil Fraser’s Diff Strategies. The worst-case running time of the algorithm is dependent only on the length of the longest sequence (N) and the number of deletions (P). This is much better than the Myers algorithm which is an O(ND) algorithm where N is the sum of the length of the two sequences and D is the edit distance.

I have created a separate project for comparing the performance of different diff algorithms. The main performance goal of this project is to create a Clojure diff that can outperform Fraser’s Java implementation. One of the most interesting results is show below.

This chart shows the most interesting range of comparison between the three algorithms. The current algorithm being used by clj-diff is labeled “Miller”. For larger sequences, the Miller algorithm outperforms Fraser. To summarize all of the performance test results; the Miller algorithm is never more than 50 milliseconds slower than the Fraser algorithm but the Fraser algorithm can be multiple seconds slower, and will run out of memory more quickly, for large sequences. For more information about performance see the clj-diff-performance project.

Installation

Leiningen

Add [clj-diff "1.0.0-SNAPSHOT"] to your :dependencies in project.clj.

Maven

Add the following dependency:

<dependency>
  <groupId>clj-diff</groupId>
  <artifactId>clj-diff</artifactId>
  <version>1.0.0-SNAPSHOT</version>
</dependency>

…which comes from Clojars…

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>

References

In order to understand the above two titles, you will need to know what N, D and P stand for. Let A and B be two sequences with lengths Q and M where Q >= M. Let D be the length of the minimum edit script for A and B. In the title of the first paper N = Q + M. In the second paper N = Q and P = (1/2)D – (1/2)(Q – M). Therefore, the O(NP) version is faster which is visualized in the charts below.

Roadmap

Version 1.0

  1. Text diff visualization.
  2. HTML diff visualization.
  3. Semantic cleanup.
  4. Sequence chunking. Allow line based diff for strings.

Version 2.0

  1. Improve performance by searching the edit graph from both ends at the same time.
  2. Arbitrary Clojure form diff/Nested diffs. Integrate with clojure.data/diff in Clojure 1.3
  3. Set the maximum time to look for an optimal diff.
  4. Add a fast and correct levenshtein-distance function.

License

Copyright © 2010-2011 Brenton Ashworth

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.