## Homework

1. Read @Brezina2018 [ch. 2, pp. 41--65].
2. Choose 3 books of Pausanias and calculate the most common tokens, types, and lemmata for each. In a paragraph or so, describe your findings relative to the work we have done in class today.
3. Using your findings from 2., write a short (1-page) evaluation of one of the books of Pausanias that you have analyzed. Does your qualitative -- which is not to say "subjective" -- experience of reading the text cohere with your quantitative evaluation?

In [None]:
%pip install MyCapytain

In [40]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)

In [41]:

from lxml import etree
from MyCapytain.common.constants import Mimetypes

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

In [None]:
# install the latest version of numpy 1, instead of pandas' numpy 2
%pip install numpy==1.26.4

%pip install pandas

In [43]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [None]:
# See https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html for
# panda's string-splitting utilities; it splits on whitespace by default
pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df

In [45]:
def get_book_of_pausanias(df: pd.DataFrame, book_n: int):
    return df[df['urn'].str.startswith(f"urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:{book_n}")]

In [46]:
book7 = get_book_of_pausanias(pausanias_df, 7)
book1 = get_book_of_pausanias(pausanias_df, 1)
book3 = get_book_of_pausanias(pausanias_df, 3)

In [None]:
book7['whitespaced_tokens'].explode().count()
len(book7['whitespaced_tokens'].explode().unique())

In [None]:
from collections import Counter
types = book7['whitespaced_tokens'].explode()

type_counts = Counter(types)

print(type_counts.most_common(100))

In [None]:
from collections import Counter
types = book1['whitespaced_tokens'].explode()

type_counts = Counter(types)

print(type_counts.most_common(100))

In [None]:
from collections import Counter
types = book3['whitespaced_tokens'].explode()

type_counts = Counter(types)

print(type_counts.most_common(100))

In [None]:
%pip install spacy

In [None]:
%run -m spacy download en
%run -m spacy download en_core_web_sm

TYPES WITH TOKEN COUNT

In [53]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [54]:
tokenizer = nlp.tokenizer

In [55]:
def commontypes(df: pd.DataFrame):
    from collections import Counter
    
    df['tokens'] = df['unannotated_strings'].apply(tokenizer)

    tokens_list = [t.text for t in df['tokens'].explode() if not t.is_stop and t.is_alpha]

    type_counts = Counter(tokens_list)

    print('Type count: ' + str(len(type_counts)))
    print('Token count: ' + str(sum(type_counts.values())))
    print (type_counts.most_common(100))
    

In [None]:
commontypes(book7)

In [None]:
commontypes(book1)

In [None]:
commontypes(book3)

LEMMATA

In [59]:
def commonlemmata(df: pd.DataFrame):
    from collections import Counter
    raw_texts = [t for t in df['unannotated_strings']]
    annotated_texts = nlp.pipe(raw_texts, batch_size=100)
    df['nlp_docs'] = list(annotated_texts)
    lemmata = [t.lemma_ for t in df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

    lemmata_counts = Counter(lemmata)

    return lemmata_counts.most_common(100)

In [None]:
commonlemmata(book7)

In [None]:
commonlemmata(book1)

In [None]:
commonlemmata(book3)

Findings in relation to work done in class:


I believe the transition from tokens to types to lemmata is extremely critical for quantitative textual analysis to be meaningful. For example, the words 'men' and 'man' are very similar and pretty much represent the same thing/ have the same implications. Thus, by combining their counts through lemmatization, we are able to get a much more accurate and insightful idea of the text to be studied. Especially in the context of these books where proper nouns take different forms (Athens/ Athenians, Greece/ Greek), the aforementioned process is helpful to correlate closely resembled words.

Question 3. Summary of Book 1:


In Book 1, Pausanias focuses on Attica, the region of Athens, one of the most famous and culturally significant parts of ancient Greece. Athenians (Athens), Greeks and Apollo are some of the most recurring lemmata in book 1, almost certainly iterating that the setting of the book is Athens, Greece. This is easily discernible given the frequency of appearances of these words in our quantitative analysis.

Pausanias provides an extensive description of the Acropolis, the citadel of Athens, and its significant temples and monuments. Pausanias details the art and sculptures housed in these temples, which we can clearly link to our quantitative analysis- given that temples, statues and sanctuaries were in the top 20 most common words used in the book. Pausanias describes the Athenian Agora, the social and political heart of the city, pointing out the various temples, statues, and stoas. Throughout Athens, Pausanias records numerous statues of gods, heroes, and important figures. 

Pausanias is particularly interested in the temples of Athens and surrounding Attica. The temple of Athena on the Acropolis, the temple of Olympian Zeus, and the sanctuary at Eleusis are key highlights of his description. Gods like Athena, Zeus, and Apollo are central to the religious life of the Athenians, as Pausanias illustrates by detailing the architecture and religious ceremonies.

Pausanias frequently refers to historical events such as the wars that shaped the history of Athens, including the Battle of Marathon. He also mentions various kings, both mythical and historical, like Theseus, who is closely tied to the legends of Athens. His discussions of war and leadership reflect the Athenians' military prowess and the legacy of their past victories. This is evident from our quantitative analysis where the words 'war', 'king', 'kill' appeared frequently- all together indicative of conquest. Thus, I can reiterate that my quantitative analysis matches the qualitative observation through reading about historical events.

The frequent references to family relationships, particularly sons and daughters, echo Pausanias' recounting of Greek myths and genealogies. For example, Athena (who is symbolically considered the daughter of Zeus) is revered in Athens, and mythological stories often emphasize the familial connections between gods and heroes. The sons and daughters of kings and gods feature prominently in the stories of the city. That is why our quantitative analysis of the most common lemmata featured sons and daughters- simply because of the way Greek mythological gods are referred to in society. 

Overall, Pausanias intersperses his description with historical accounts of wars, political events, and important figures that shaped Athens. He often contrasts Athens' present with its glorious past, lamenting how many of the city's wonders were destroyed or decayed over time. I believe the quantitative analysis conducted does give reasonable insight into the contents of book 1 without reading- as by grouping themed words together- we can assert the setting, history and other important events and landmarks.