[plumula/diff "0.1.1"] ;; latest release
A fast text diff library for Clojure and ClojureScript.
This is a thin Clojure(Script) wrapper around the Java(Script) versions of the Diff, Match and Patch library by Neil Fraser.
(require '[plumula.diff :as d])
(d/diff "abcd" "bcde")
; => [{::d/operation ::d/delete, ::d/text "a"}
; {::d/operation ::d/equal, ::d/text "bcd"}
; {::d/operation ::d/insert, ::d/text "e"}]
diff
can take optional keyword-arguments which are described in the following
sections.
If you are using Clojure(Script) 1.9, feel free to peruse the nice specs in
plumula.diff.spec
.
In its default mode, diff
will try to produce an output that is minimal, at
the expense of readability.
Set the cleanup
option to cleanup-semantic
to make the output more readable
to humans, at the expense of minimality. This will (roughly)
- Eliminate small spurious equalities (AKA ‘chaff’) between the documents
- Aligning the operations to word-boundaries if possible
(d/diff document-1 document-2 ::d/cleanup ::d/cleanup-semantic)
In terms of storage and computation, each operation in a diff output is likely to have a cost with a constant component, and a component that is proportional to the operation’s lenght in number of characters.
Because of this constant per operation cost, a diff output can be made more
efficient by sacrificing minimality (in terms of number of characters edited)
in order to reduce the number of editing operations. This can be achieved by
settind the cleanup
option to cleanup-efficiency
:
(d/diff document-1 document-2 ::d/cleanup ::d/cleanup-efficiency ::edit-cost 4)
The relative cost of an edit operation in terms of edited characters can be
tuned with the edit-cost
option. At the default setting of 4, the optimiser
will accept to increase the character-count of the diff by up to 4 characters in
order to save an edit operation.
If you don’t like the idea of sacrificing minimality, but still want to improve the readability or efficiency of diff’s output, there are options for that. They are poor compromises, though, because the minimality constraint doesn’t leave enough wiggle room for the optimising algorithm to do a great job.
To improve the readability of the output while keeping it minimal set the
cleanup
option to cleanup-semantic-lossless
, but remember that in order to
keep the output minimal, the algorithm has to leave the semantic chaff there,
and readability won’t be as great as with cleanup-semantic
.
To reduce the number of edit operations in the output while keeping the number
of characters minimal, set the cleanup
option to cleanup-merge
. This might
make sense if the cost of an edit is negligible vs the cost of a character in
the output. It could be argued that there isn’t much of a point in reducing the
number of edit operations at all in that case though.
To speed up computations, diff
will pre-process the texts by diffing them
line-by-line. On long documents with multiple small edits, this can lead to an
order of magnitude improvement in speed and memory consumption. However, there
is a risk that this optimisation may result producting in a non-minimal diff
output.
If you don’t want to take that risk, use the check-lines
option to disable
the optimisation:
(d/diff document-1 document-2 ::d/check-lines false)
The diffing process consists of relatively fast pre- and post-processing steps surrounding a potentially long-running diff-optimising step. By default, this optimising step will time out after a run time of 1 second, returning a correct but potentially non-minimal diff output.
The timeout
option lets you set another maximum run time for the optimising
step, or disable the time out behaviour entirely, allowing for an arbitrarily
long optimising step:
; time out after 1.5 seconds
(d/diff document-1 document-2 ::d/timeout 1.5)
; never time out
(d/diff document-1 document-2 ::d/timeout 0)
- The
match
andpatch
functions are not currently wrapped - Depends on both the Java and JavaScript library, even if your project targets only one of Clojure or ClojureScript. The unneeded dependency will get compiled away but it’s still a useless download.
The notable changes to this project are documented in the change log.
Distributed under the Apache License, Version 2.0. Copyright © 2017 Frederic Merizen.
The underlying Diff, Match and Patch Library is distributed under the Apache License, Version 2.0 and copyright © 2006 Google Inc.