Skip to content
module 3 for hist3907b
Branch: master
Clone or download
Pull request Compare This branch is even with hist3907b-winter2015:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Module 3 Wrangling Data

In the previous module, we successfully grabbed lots of data from various online repositories. Some of it was already in well-structured tables; much of it was not. All of it was text though. Initially, it (or most of it) was just scanned images of documents. At some point, object character recognition was used to identify the black dots from the white dots in those images, to recognize the patterns that make up letters, numbers, and punctuation. There are commercial products that can do this (and we have some installed in the Underhill Research Room that you can use), and there are free products that you can install on your computer to do it yourself.

It all looks so neat and tidy. Ian Milligan discusses this 'illusionary order' and its implications for historians:

In this article, I make two arguments. Firstly, online historical databases have profoundly shaped Canadian historiography. In a shift that is rarely – if ever – made explicit, Canadian historians have profoundly reacted to the availability of online databases. Secondly, historians need to understand how OCR works, in order to bring a level of methodological rigor to their work that use these sources.

Just as we saw with Ted Underwood's article on theorizing search, these 'simple' steps in the research process are anything but. They are also profoundly theoretical in how they change what it is we can know. In archaeology, every step of the method, every stage in the process, has a profound impact on the stories we eventually tell about the past. Decisions we make destroy data, and create new data. Historians aren't used to thinking about these kinds of issues!

There are also manual ways of doing the same thing as OCR does - we call these things 'humans', and we organize their work through 'crowdsourcing'. We break the process up into wee manageable steps, and make these available over the net. Sometimes we gamify these steps, to make them more 'fun'. If several people all work on the same piece of text, the thinking is that errors will cancel each other out: a proper transcription will emerge from the work of the crowd. Take a look at these two pieces concerning the Transcribe Bentham project:

While transcriptions might've provided the earliest examples of crowdsourcing research (but see also The HeritageCrowd Project and the subsequent 'How I Lost the Crowd'), other tasks are now finding their way into the crowdsourced world - see the archaeological applications within the MicroPasts platform. These include things like 'masking' artefact photographs in order to develop 3d photogrammetric models.

But often, we don't have a whole crowd. We're just one person, alone, with a computer, at the archive. Or working with someone else's digitized image that we found online. How do we wrangle that data? Let's start with M. H. Beal's account of how she 'xml'd her way to data management' and then we'll consider a few more of the nuts and bolts of her work in OA TEI-XML DH on the WWW; or, My Guide to Acronymic Success.

This kind of work is extraordinarily important! So we're going to try our hand at it too. (Now, if we had a seriously big project where we were transcribing lots of text, we'd invest in a dedicated XML editor like Oxygen - there are plugins available and frameworks for doing historical transcription on this platform. There is a 30 day free trial license if you want to give it a try. But for now, Notepad++, Textwrangler, Komodo Edit, Sublime text, or any of a number of good text editors will do all that we need to do). Also, check out the TEI. Take 15 minutes and read throuhg What is XML and Why Should Humanists Care? by David Birnbaum. Keep notes in your notebook!


In the exercises for this week we are going to focus on some bare-bones wrangling of data. First, we are going to do some activities and exercises to get in the right frame of mind. Then, we'll transcribe and mark up some text that has already been digitized (in the sense that there exists a digital image). If you've done HIST2809 with me, this will feel rather familiar - but instead of making a transcription that sits in our notebook, we'll make one that lives online. Since it is online, the use of xml tags like or etc allow us to do some other quite interesting things. We'll look at M. Beal's Colonial Newspaper Database in more detail. This will introduce us to the 'Text Encoding Initiative' and some of the scholarly work surrounding making scholarly editions online. Some of you might be interested in how all of this ties into linked data, so I'll provide further resources to that world for those who wish to go exploring.

Then, we'll switch gears and we'll use regular expressions to search and extract information from the Diplomatic Correspondence of the Republic of Texas, which you'll find at the Internet Archive. If you're on a PC, download Notepad++ - this is a souped-up version of the simple notepad application, and allows us to do very useful things indeed. If you're on a Mac, TextWrangler is probably already installed and is all you need. If you're working in Linux, you can use whatever text editor you're familiar with.

We'll conclude by using 'Open Refine' to tidy up the information we extracted from the Texan correspondence.

Things you will learn in this module:

  • the basic concepts of XML and TEI for marking up text
  • the power of regular expressions. 'Search' and 'Replace' in Word just won't cut it any more for you! (Another reason why you should write in Markdown in the first place and then convert to Word for the final typesetting - Indeed, there will be an optional exercise to install and use Pandoc to convert your .md files to .docx).
  • Open Refine as a powerful engine for tidying up the messiness that is ocr'd text.
You can’t perform that action at this time.