LAHP2019: "Working with Texts," part II: 30 May 2019.
Martin Steer (School of Advanced Study)
Jonathan Blaney (Institute of Historical Research)
Christopher Ohge (Institute of English Studies)
Outline for the day
|Introduction to bits, bytes, encoding etc||30 mins, Marty|
|Intro the CLI: principles, navigations||15 mins, Christopher|
|Regex||20 mins, Christopher|
|Regex questions||15 mins|
|Grep and regex||15 mins, Jonathan|
|Grep questions||15 mins|
|CLI tool for document conversion: Pandoc||10 mins, Christopher|
Bits and Bytes
Make sure you have downloaded Git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
- What are command line tools, why do we use them, and which do we use most?
command line tools are based on "standard" Unix commands, but also include other commands from downloadable packages (e.g., Git and Pandoc).
generally, we use CLI tools to hack our system, to have full control ...
the CLI tools we use most are navigation, conversion, and system debugging...
use Tab to complete arguments or list all available commands.
use ctrl+r to search through command history.
- To locate a file by name in the current directory,
find . -iname 'NAME'
awk(line by line processing)
sed(searching and replacing)
debugging is important, but we won't have time to do this justice.
Want to learn more? Enter
man bashinto your Terminal window.
For a very good (and broad) overview of command line tools, see https://github.com/jlevy/the-art-of-command-line.
- Exercise: Get a copy of this repository.
Navigate to this repo (https://github.com/cmohge1/LAHP2019).
click on the green button that says "clone or download".
IF you have a GitHub account,
- open your terminal, enter
- use the
git clone https://github.com/cmohge1/LAHP2019.git
- open your terminal, enter
IF YOU DO NOT have a GitHub account, download the zip of the repo to your Desktop.
in the Terminal, enter
make a new directory called "gitspace"
navigate to the gitspace directory, then to "LAHP2019".
naviagte to the Billy Budd folders, and create a new folder called "diplomatic".
move the file "ch1-leaf.xml" to "diplomatic".
head-- what do you see?
The best way to learn regex is to experiment with one of the online tools. But let's copy-and-paste the xml file from the Billy Budd/diplomatic directory. Then go to Regex101.
Make sure to consult this regex cheat sheet.
Find any instance of "don't" or "do'nt".
Find the first word at the beginning of each speech.
What two regexs would you use to get ride of the choice tags and only show the original spellings (contained within the tags)?
Find all attribute values of page image zones (hint: they are the value of @facs attributes).
See the grep basics sheet (here)[/grep-basics-sheet.pdf].
Install Pandoc here.
Pandoc is a very handy universal document converter. It is used on the command line following a basic syntax:
pandoc FILE-TO-CONVERT -o CONVERTED-FILE
We invoke pandoc first, then type out (and tab) the file we want to convert, use the -o option (which stands for output), and name the output file. That's it.
Let's go back to our Terminal. Navigate to our git repo, type
If we take this file (README.md), and we want to convert it to a PDF, how would we do that?
pandoc README.md -o LAHP-working-textsII.pdf
Most pandoc operations look like the above. There are more precise ways to guide the conversion: So if you want an html file of this README file,
pandoc README.md -f markdown -t html -s -o README.html
works like a charm, for the most part. The -f options specifies that you are converting from markdown syntax, and -t says that you are converting to html. The -s option specifies that it is a standalone file with a proper html header. In most cases, however, pandoc only needs the file extension to guide the conversion, which is why the -o option is all you need in most cases. But sometimes you do need more precision, such as with PDF conversions (note that you’ll need to have LaTeX installed. See MacTeX on OS X, MiKTeX on Windows, or install the texlive package in linux).
pandoc README.md -o README.pdf
What happens here?
Try this instead:
pandoc -N README.md --toc -o README.pdf
What has changed?
I have already extolled the virtues of Markdown, and I hope you can see that you can use Pandoc to easily convert files from Markdown.