TEI Toolbox

A set of scripts used to first create TEI files (with BBEdit), parse TEI files, and finally do simple analysis against the result.

Create a TEI file

Given one or more plain text files, here is a recipe for creating a set of well-formed and valid TEI files which have been enhanced with sentence, word, and part-of-speech tags:

Obtain a plain text file for marking up
Optionally, remove line breaks from the text file
Run ./bin/txt2xml.sh to create the most rudimentary of TEI files
Run ./bin/validate.sh to... validate the TEI
Use what ever tools you have at your disposal to continue marking up the TEI, and this toolbox includes a set of BBEdit macros which may (or may not) make the process easier; Oxygen is a good tool tool
Go to Step #4 until you are finished (or get tired)
Run ./bin/morphadorn.sh to supplement the TEI with parts-of-speech
Run ./bin/validate.sh against the TEI, just to make sure
Go to Step #1 for any number of files -- build a corpus

Create a "study carrel"

TEI files are merely containers used to do alchemy; TEI turns data into information. Through TEI, numbers, like 1776, are turned into dates. Words, like "man", are turned into nouns. Lists of words might become sentences or titles. Etc. To analyze stand-alone TEI files (or just about any other flavor of XML) it is necessary to have an extensive knowledge XPath, which is a specific XML technology. Sure, additional XML technologies exist, but they are all specific to XML, and compared to the communities of relational databases, full text indexers, general-purpose programming languages, the XML community is small. Thus, it behooves the student, researcher, or scholar to transform their TEI documents into other data structures where the tools are more amenable to analysis. The following recipe does just that; the following recipe creates a "study carrel" and transforms valid TEI into a number of other formats:

Run ./bin/carrel-initialize.sh to create a new "study carrel"
Copy your newly created TEI file(s) to the tei directory of your newly created study carrel
Run ./bin/carrel-build.sh to transform the TEI into many other data structures

The end result will be a directory filled with sets of structured data ready for computer analysis -- "reading". These formats include PDF, documents suitable for printing (and writing in). It includes HTML files suitable for online reading. It includes plain text files in the form of complete works, chapters, or paragraphs, all of which lend themselves to various forms of text mining and searching. It includes a relational database and a semantic index for even more fine-grained investigation.

TEI files are cool, but TEI files are really only a means to an end. Study carrels are filled with content transformed from TEI files, and the resulting content is very amenable to analysis -- "reading".

Typical searching

The newly created study carrel supports at least four typical types of searching: 1) querying a relational database, 2) querying a semantic index, 3) concordancing, and 4) full text indexing.

Concordancing

A concordancing is one of the oldest of text mining processes, and such a tool is often called a "keyword in content" (KWIC) index. Given a word (or regular expression), the concordance will output matching lines as well as a simple dispersion chart. For example:

Run ./bin/search-kwik.pl without any input to learn what input the script takes
Run ./bin/search-kwik.pl with a file from the txt directory of your study carrel as input
Go to Step #2 until you get tired

Database searching

The study carrel includes a relational database file. The database contains a list of each and every word, part-of-speech, and named-entity from each and every paragraph of your entire corpus. Given a knowledge of the database's structure as well as the codes used to denote parts-of-speech or named-entities, it is possible to query the database not only for individual words but also in terms of grammars. "Find all ways the king is described." The included script only outputs counts & tabulations based on words and their lemmas and types of named-entities. For example:

Run ./bin/search-db.sh without any input to learn what input the script takes
Run ./bin/search-db.sh with the name of your carrel and additional inputs such as "NOUN love PERSON"
Go to Step #2 until you get tired, and after repeated uses, you will see patterns

Semantic indexing

The study carrel also includes a semantic index -- a list of words and associated vectors. Given a word, the semantic index will determine what other words have similar vectors. For example:

Run ./bin/search-vec.py sans any input to get an idea of what input is expected
Run ./bin/search-vec.py with the name of your study carrel and a word; the result will be a list of words an associated scores, and higher scores denote higher degrees of similarity
Go to Step #2 until you get tired

Semantic indexing only really works with "large" volumes of text, and things start to get large at 1,000,000 words. Other things can be done with semantic indexes such as the completion of analogies or the listing of opposites. The included script only supports similarity.

Full text indexing

A study carrel is primed for full text indexing with an indexer called Solr. [INSERT HERE THE LONG & COMPLICATED RECIPE OUTLINING HOW TO CREATE THE FULL TEXT INDEX.]

Eric Lease Morgan <emorgan@nd.edu>
February 12, 2020

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
bin		bin
etc		etc
tei		tei
txt		txt
xml		xml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

etc

etc

tei

tei

txt

txt

xml

xml

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

TEI Toolbox

Create a TEI file

Create a "study carrel"

Typical searching

Concordancing

Database searching

Semantic indexing

Full text indexing

About

Releases

Packages

Languages

License

ericleasemorgan/tei-toolbox

Folders and files

Latest commit

History

Repository files navigation

TEI Toolbox

Create a TEI file

Create a "study carrel"

Typical searching

Concordancing

Database searching

Semantic indexing

Full text indexing

About

Resources

License

Stars

Watchers

Forks

Languages