<a href="https://colab.research.google.com/github/apresa74/Bootstrap-Portfolio/blob/master/simple_diffing_tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Simple Diffing Tool for Editors

This iPython notebook is intended to help editors find the variants/emendations in an edition of a text based on a copy text.

The code was written by Dr. Bryan Tarpley at the Center of Digital Humanities Research at Texas A&M University. This notebook may be freely distributed and used without restriction, with the understanding that the code is provided "as is," and that the author may not be able to provide support or answer questions due to time constraints.

## Instructions

To use it, you'll need to have two plain text files (preferably UTF-8 encoded)--one for the copy text (it needs to be named "copy_text.txt") and one for the variant edition (it needs to be named "variant_edition.txt"). Unfortunately, this tool does not support any other file format (no Word documents or PDFs).

Once you have your two files, open the file tray by clicking on the folder icon on the sidebar to the left. You may see the message "Connecting to runtime to enable file browsing"--this message will eventually go away and present you with a single folder called "sample_data." Next, drag-and-drop your two files into the empty space beneath the "sample_data" folder. You may see a notice about how files uploaded to this area will be deleted once the runtime is recycled--click OK. Those files should then upload and appear on the same level as (not inside of) the "sample_data" folder.

Next, scroll down to the code cell below these instructions and hover your mouse over the first line of code. A dark "play" symbol should appear on the upper-left corner of the code cell. Click that button!

Once the execution of the code cell completes, you should now see a third file in your file tray called "results.html" (it may take several seconds for it to appear there). Hover over that file with your mouse, click on the three dots that appear to the right of the file, and click "Download." The "results.html" file should be downloaded to your computer, and you should be able to then double-click on that file and open it in your browser.

That "results.html" file, when opened in your browser, will display two main columns with the lines (or paragraphs) from "copy_text.txt" on the left and from "variant_edition.txt" on the right. Any differences between the two texts should be highlighted in either red or green. In the gutter between the two columns, a blue link with the letter "n" should appear whereever a difference occurs--clicking on that link should take you to the next difference in the text.

A quick, 30-second video of this procedure can be found [here](https://drive.google.com/file/d/14Wr1hJiByubJMdIBQeAh8PK1egJvC81o/view?usp=sharing).

## Caveats

It should be noted that in order to improve the signal-to-noise ratio in terms of textual differences, the following modifications to both texts occur before comparing them (these modifications occur during code execution and are _not_ saved to the files):

1.   In an attempt to ignore differences in lineation between the two texts, all single line breaks are removed (paired line breaks to designate a new paragraph are maintained). This allows the report to consider differences between entire paragraphs of text rather than line-by-line. If, however, the paragraphs in your text are delimited by a single line-break and a tab, or if you're comparing editions of poetry or a playtext, you may want to turn this feature off. To do this, simply replace the word "True" with the word "False" on line 6 of the code cell below before clicking the black play button (you can always click it again after making modifications to the code--this will overwrite the contents of "results.html" with a new report).
2.   In order to ignore differences in how certain punctuation marks are represented in plain text, paired hyphens are replaced with em-dashes, and "smart" apostrophes/quotes are replaced with their simple counterparts.

In [None]:
import difflib

copy_text_path = 'copy_text.txt'
variant_path = 'variant_edition.txt'
results_path = 'results.html'
fix_lineation = True

replacements = [
  ('--', '—'),      # replace paired hyphens with em-dash
  ('’', "'"),       # replace smart apostrophe with single-quote
  ('‘', "'"),       # replace alt-smart apostrophe with single-quote
  ('“', '"'),       # replace smart quote with quote
  ('”', '"')        # replace alt-smart quote with quote
]

if fix_lineation:
  replacements += [
    ('\n\n', '####'), # replace paired line breaks with 4 hashtags
    ('\n', ' '),      # replace single line breaks with a space
    ('####', '\n\n'), # replace 4 hashtags with paired line breaks
  ]

copy_text_lines = []
with open(copy_text_path, 'r') as copy_in:
  copy_text = copy_in.read()
  for replacement in replacements:
    copy_text = copy_text.replace(replacement[0], replacement[1])
  copy_text_lines = copy_text.split('\n')

variant_lines = []
with open(variant_path, 'r') as variant_in:
  variant_text = variant_in.read()
  for replacement in replacements:
    variant_text = variant_text.replace(replacement[0], replacement[1])
  variant_lines = variant_text.split('\n')

diff_maker = difflib.HtmlDiff()
diff_html = diff_maker.make_file(copy_text_lines, variant_lines, fromdesc='Copy Text', todesc='Variant')
diff_html = diff_html.replace('nowrap="nowrap"', '')
diff_html = diff_html.replace('&nbsp;', ' ')
with open(results_path, 'w') as results_out:
  results_out.write(diff_html)