# Building a Text Analysis Pipeline for Akkadian with CLTK

This workshop will introduce the Classical Language Toolkit, an open-source Python framework dedicated to text analysis and natural language processing for historical languages. Participants will be taken through the basic stages of a text analysis pipeline, namely corpus loading, preprocessing, sentence and word tokenization, lemmatization, part-of-speech and morphological tagging, prosody identification, and more. Examples will given primarily using Akkadian texts (and English translations), though some attention will be given to other languages supported by the project, including Latin and Ancient Greek. Participants are encouraged to bring their laptops; the demonstration can be followed along with interactively in a web browser without any installation or setup.

Andrew Deloucas  
Perseus/NEH Workshop  
*Digital Editions and Digital Corpora*  
Tufts University  
5/31-6/1/19  
  
*Last updated 5/30/19*

## Setup

In [None]:
import cltk
import os

from pprint import pprint

## Working with CLTK Corpora

### Installing Corpora

In [None]:
## Set up corpora

## You will need the models/datasets that the new lemmatizer uses
## Note that this will generate an error if this and older 
## version of this corpus is already installed. If that happens,
## backup the old version, move or delete it, and reimport the corpus.
##
## The corpus we'll be using syncs with CDLI's github account, 
## which is backed-up daily. CDLI is the Cuneiform Digital Library 
## Initiative, who focus on digital publication of cuneiform writing.
##
## Their website is here: https://cdli.ucla.edu/
##
## To install on your own machine, uncomment this cell block and run the code below...

# from cltk.corpus.utils.importer import CorpusImporter

# corpus_importer = CorpusImporter('akkadian')
# print(corpus_importer.list_corpora)

# corpus_importer.import_corpus('cdli_corpus')

### Loading a CLTK corpus

In [None]:
# Check and ensure that the Cuneiform Digital Library corpus is imported.

parent_directory = os.path.expanduser('~')
file = os.path.join(parent_directory, 'cltk_data', 'akkadian', 
                    'atf', 'cdli_corpus', 'cdliatf_unblocked.atf')
os.path.isfile(file)

In [None]:
# Load your corpus. FileImport reads a .txt file and 
# saves to memory the text in .raw_file and .file_lines. 
# Parse_file captures information in a text file and formats 
# it in a clear, and disparate, manner for every text found.
# 
# All of these inputs are 'invisible' and don't output 
# any information. 

from cltk.corpus.akkadian.file_importer import FileImport
from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

fi = FileImport(file)
cc = CDLICorpus()

fi.read_file()
cc.parse_file(fi.file_lines)

## Corpus Data

In [None]:
## Examining available Metadata (unique to Akkadian).
#
# Depending on the information available from CDLI, you can
# generate the availability of their data. This example 
# shows that you can filter through texts that have available
# metadata, transliterations, normalizations and translation. 
# They're identified by their unique CDLI number (Pnum), 
# and edition (publication).
#
# Not all of our corpora has this feature, so it's a nice way to see how
# Assyriologists divy up information for our texts.
# 
# Using remote corpora like this is the most common method with which the 
# CLTK works, but we also have hosted corpora, such as Tesserae,
# a collaborative between University at Buffalo, University of Norte Dame
# and University of Geneva project which provides a web interface for 
# exploring intertextual parallels. The material there has proper 
# citation.bib, license.md, etc. files.
#
# Their homepage is here: http://tesserae.caset.buffalo.edu/
# Version Control is here: https://github.com/cltk/greek_text_tesserae
#
# Note, I'm utilizing 'normalization' because it pulls the least amount 
# of data at once. Other filters include pnum, edition, metadata, 
# transliteration, normalization, and translation.

cc.print_catalog(catalog_filter=['normalization'])

### Additional tools for Catalog

In [None]:
## Print Catalog
# This is the most memory intensive, as it grabs all 
# known information from every text: metadata, pnum,
# edition, text, transliteration, normalization, and
# translation.

print('Catalog')
pprint(list(cc.catalog)[:5])
print()

## Table of Contents
# Shortened form of catalog: pnum, edition, and 
# line length of text.

print('Table of Contents')        
pprint(list(cc.toc())[:5])
print()

## List CDLI number / Edition
# Shortest formation of this! Only pnums or editions.

print('Edition')
pprint(list(cc.list_editions())[:5]) 
print()
print('CDLI Number')
pprint(list(cc.list_pnums())[500:505])

## Tokenization

### Line, String Tokenizers

In [None]:
## Set up CLTK Latin word tokenizer
#
# The Akkadian tokenizer reads ATF material and converts 
# data into readable, mutable tokens. There is an option
# whether or not to preserve damage in the text. This type
# of feature is something you'll never deal with using NLTK.
# We'll get into how utilizing damage can affect our readings.
#
# For now, we'll keep it off.

from cltk.corpus.akkadian.tokenizer import Tokenizer
line_tokenizer = Tokenizer(preserve_damage=False)


In [None]:
## Select your text, we'll be choosing two:
#
# 1) a Tuppi Tamgurti (Legal Agreement text) from Nuzi
# 2) Taylor's Prism (Royal Inscription of Sennacherib)

nuzi = cc.catalog['P388524']['transliteration']
sennacherib = cc.catalog['P462830']['transliteration']

# Based on how CDLI structures its documents, we don't
# need line tokenizers all the time; however, the option is 
# there in case you're working with other documentation.

pprint(nuzi[0:8])
print()
pprint(sennacherib[0:8])
print()
print(f'There are {len(nuzi)} lines in Nuzi.')
print(f'There are {len(sennacherib)} lines in Sennacherib.')

# The first thing you may notice about our Akkadian lines:
# sz = /sh/ phoneme, used to represent an 's-caron'.
# commas = used to denote emphatic characters (t, and s,).
# {curly bracket} = used to denote determinatives.
# _underscores_ = used to denote Sumerian logograms.
# numbers = denotes sign values (bi, bi2, bi3).
# apostrophes = glottal stops
# (parentheses) = numbers

<a id='nuzi-damage-sample'></a>

In [None]:
# Here's an example of damage erasure lines 19, 20 of our Nuzi document:
#
# [square brackets] = missing signs
# [...] = unrestorable
# Pound sign = broken, but present

pprint(nuzi[18:20])
print()

# Putting it through our line tokenizer.

nuzi_sample = line_tokenizer.string_tokenizer('\n'.join(nuzi[18:20]))
pprint(nuzi_sample)

# For those curious, these lines read:
# "If the woman Awishnaya ... the man Sheshwaya
# after (lit. 'on the back of') the man Ari-peni ... the woman Shidanka"

### Word, Sign Tokenizers

In [None]:
# Unique to cuneiform languages, CLTK can additionally
# tokenize texts by their word count and, further,
# the cuneiform signs of which the words are composed.

from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('akkadian')

# Word Tokenizers take strings, not lists.

nuzi_words = word_tokenizer.tokenize('\n'.join(nuzi[18:20]))
sennacherib_words = word_tokenizer.tokenize('\n'.join(sennacherib[0:1]))

pprint(nuzi_words)
print()
pprint(sennacherib_words)

In [None]:
# Note that the tokenizers can differentiate between
# phonetic Akkadian and logographic Sumerian writing
# and we can confirm this with the sign tokenizer.

nuzi_sign = [word_tokenizer.tokenize_sign(x) 
             for x in nuzi_words]
sennacherib_sign = [word_tokenizer.tokenize_sign(x) 
                    for x in sennacherib_words]  
  
pprint(nuzi_sign[:5])
print()

# because the textual damage messes with our sign 
# readings, we'll have to turn it off.

#### Nuzi was already edited [above](#nuzi-damage-sample), so we'll tackle only Sennacherib.

In [None]:
sennacherib_sample = line_tokenizer.string_tokenizer('\n'.join(sennacherib))

nuzi_fixed = word_tokenizer.tokenize('\n'.join(nuzi_sample))
sennacherib_fixed = word_tokenizer.tokenize('\n'.join(sennacherib_sample[0:1]))

nuzi_signs = [word_tokenizer.tokenize_sign(x) for x in nuzi_fixed]
sennacherib_signs = [word_tokenizer.tokenize_sign(x) for x in sennacherib_fixed]  

pprint(nuzi_signs)
print()
pprint(sennacherib_signs)

# Here we can see some data trouble: Sennacherib's name is
# spelled with sumerian logograms, but our data did not account
# for this; our program isn't equipped to clean up data.

## Additional tools

In [None]:
# Further down the pipeline, we have a few smaller tests for normalized akkadian.

## Syllabifier

from cltk.stem.akkadian.syllabifier import Syllabifier

word = "epištašu"
syll = Syllabifier()
syll.syllabify(word)

In [None]:
## Decliner

from cltk.stem.akkadian.declension import NaiveDecliner

word = 'ilum'
decliner = NaiveDecliner()
decliner.decline_noun(word, 'm')

In [None]:
## Stresser

from cltk.phonology.akkadian.stress import StressFinder
stresser = StressFinder()
word = "šarrātim"
stresser.find_stress(word)

In [None]:
## Stems and Bound Forms

from cltk.stem.akkadian.stem import Stemmer
from cltk.stem.akkadian.bound_form import BoundForm
stemmer = Stemmer()
bound_former = BoundForm()
s_word = "ilātim"
b_word = "kalbim"
stemmer.get_stem(word, 'f')
bound_former.get_bound_form(word, 'm')

In [None]:
## Consonant Vowel Parser

from cltk.stem.akkadian.cv_pattern import CVPattern
cv_patterner = CVPattern()
word = "iparras"
cv_patterner.get_cv_pattern(word)

## Sample Text Analyses & Visualizations

<img src="https://upload.wikimedia.org/wikipedia/commons/d/de/Taylor_Prism-1.jpg" alt="Taylor's Prism from the British Museum" align="left" style="width: 200px;"/>

Public domain photo by David Castor via [Wikimedia](https://upload.wikimedia.org/wikipedia/commons/d/de/Taylor_Prism-1.jpg).

<img src="https://www.britishmuseum.org/collectionimages/AN01015/AN01015592_001_l.jpg" alt="Detail, Taylor's Prism from the British Museum" align="left"/>

Photo via [British Museum](https://www.britishmuseum.org/research/collection_online/collection_object_details/collection_image_gallery.aspx?partid=1&assetid=1015592001&objectid=295077) (BM 91032; CC BY-NC-SA 4.0).

### Sign Frequency Analysis

In [None]:
from cltk.stem.akkadian.atf_converter import ATFConverter
from collections import Counter

toto_signs = []

lines = [line_tokenizer.string_tokenizer(text, include_blanks=False)
        for text in nuzi]
words = [word_tokenizer.tokenize(line[0]) for line in lines]

for signs in words:
    individual_words = [word_tokenizer.tokenize_sign(a) for a in signs]
    individual_signs = [c for b in individual_words for c in b]
    for count in individual_signs:
        toto_signs.append(count)

frequency_analysis = Counter(toto_signs).most_common(15)
pprint(frequency_analysis)


# We're not going to get too much information out of this, but for those
# curious, dumu (4) means son or citizen, igi (5) can mean many
# things, usually concerning eyes, sight, or witnesses. Munus (10) means
# female.

### Counting Tokens

In [None]:
from collections import Counter

In [None]:
sennacherib_tokens = word_tokenizer.tokenize(
    '\n'.join(sennacherib_sample))
s_tokens = [word[0] for word in sennacherib_tokens]
word_count = Counter(s_tokens)

running = 0

print("Top 25 words in the Taylor's Prism:\n")
print("{number:>5}  {word:<20}     {count:<12}{percent:<12}{running:<12}". \
        format(number="", word="TOKEN", count="COUNT", percent="TOKEN %", running = "RUNNING %"))
for i, pair in enumerate(word_count.most_common(25)):
    running += pair[1]
    print("{number:>5}. {word:<20}      {count:<12}{percent:<12}{running:<12}". \
        format(number=i+1, word=pair[0], count=pair[1], \
        percent=str(round(pair[1] / len(s_tokens)*100, 2))+"%", running = str(round(running / len(s_tokens)*100, 2))+"%"))
    
# Majority of these words are particles or prepositional: 
# genitive (1, 2), subjunctive (2), accusative (3), 
# conjugational (4), and negative (5); prepositional (7,8,10)
#
# Unsurprisingly for a royal inscription, the top two 
# nouns are king (6), and the main deity of the Neo-Assyrian
# Empire: Assur (9). 
# 
# Other words of note: campaign (11), types of land (12, 18)
# and "I counted, I surrounded" (22, 23).

### KWIC

In [None]:
from nltk import Text

In [None]:
Sennacherib_Text = Text(s_tokens) # Note that Text takes a list of tokens as its input
Sennacherib_Text.concordance('_lugal_')

# The word 'king' can be seen before a few places:
#
#Lands
#
# Karduniash (Southern Mesopotamia, originally Kassite term)
# Meluhha (Indus Valley)
# Elam (modern Iran)
# Babylonia (Southern Mesopotamia)
#
# Cities
#
# Siduni / Sidon (modern Lebanon)
# Isqaluna / Ashkelon (modern Israel)
# Asdudu / Ashdod (modern Israel)
# Amqaruna / Ekron (modern Israel)
# Hazitu / Gaza (modern Palestine)

### Dispersion Plot

In [None]:
%matplotlib inline  
import matplotlib.pyplot as plt

In [None]:
Sennacherib_Text

In [None]:
plt.figure(figsize=(20, 5))
Sennacherib_Text.dispersion_plot(['_lugal_', 'asz-szur{ki}', '{kur}elam-ma{ki}'])

### Graphed Frequency Distribution

In [None]:
from nltk.probability import FreqDist

fdist = FreqDist(Sennacherib_Text)

plt.figure(figsize=(20, 10))
fdist.plot(50, cumulative=True)

# Our graph appears to follow Zipf's Law, in that the frequency of 
# any word is inversely proportional to its rank in the frequency table.

## Working with other CLTK Languages

### Working with Ancient Greek in CLTK

In [None]:
import re
import requests

response = requests.get('https://raw.githubusercontent.com/tesserae/tesserae/master/texts/grc/homer.iliad/homer.iliad.part.1.tess')
iliad = response.text

In [None]:
print(iliad[:100])

In [None]:
iliad = re.sub(r'<.+?>\t', '', iliad)
print(iliad[:100])

In [None]:
import unicodedata

iliad = unicodedata.normalize('NFC', iliad)

In [None]:
from cltk.tokenize.word import WordTokenizer

word_tokenizer_greek = WordTokenizer('greek')
tokens = word_tokenizer_greek.tokenize(iliad)
print(tokens[:100])

In [None]:
from cltk.tokenize.line import LineTokenizer

line_tokenizer = LineTokenizer('greek')
lines = line_tokenizer.tokenize(iliad)
pprint(lines[:5])

In [None]:
# Beta tool—let us know if you'd like to contribute to its development!

from cltk.tag.pos import POSTag
tagger = POSTag('greek')

tagger.tag_ngram_123_backoff(lines[0])