# Preliminaries

## Import statements

This gives us access to code that isn't part of base Python.

In [1]:
import os
import pandas as pd
from matplotlib import pyplot as plt

## Load the data

Once again, we're starting from the token table created in notebook 3.

In [2]:
# load the token table
csv_file = os.path.join('data', 'tokens.csv')
token_table = pd.read_csv(csv_file, dtype=str)

# drop punctuation tokens
no_punct = token_table.loc[token_table.upos!='PUNCT'].reset_index(drop=True)

display(no_punct)

Unnamed: 0,urn,author,title,line,token,lemma,upos,mood,tense,voice,person,number,case,gender
0,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,Opaca,Opaca,PROPN,,,,,Sing,Nom,Masc
1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,linquens,linquens,VERB,,,Act,,Sing,Nom,Masc
2,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,Ditis,Dis,PROPN,,,,,Sing,Gen,Masc
3,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,inferni,infernus,ADJ,,,,,Sing,Gen,Masc
4,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,loca,locus,NOUN,,,,,Plur,Acc,Neut
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100104,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,meruisse,mereo,VERB,Inf,Pres,Act,,,,
100105,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,putas,puto,VERB,Ind,Pres,Act,2,Sing,,
100106,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,me,ego,PRON,,,,,Sing,Acc,
100107,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,talia,talis,DET,,,,,Plur,Acc,Neut


## Exclude very uncommon words

Let's keep only words that occur at least 10 times.

In [3]:
# calculate corpus-wide counts for all lemmata
lemma_count = no_punct.lemma.value_counts()

# create stoplist
lemma_kept = lemma_count.index.values[lemma_count>=10]

# Sampling

## Measuring internal variations in style

For this experiment, we are going to calculate a **rolling** style signal—something that changes as we move through each document. In previous examples, each text was represented by a single sample. Here, we're going to create a **sliding window** that moves through the text, sampling as it moves. The samples will have a fixed size (the size of the window) but will overlap at the edges.

That lets us measure internal variability within a document, while hopefully keeping the samples large enough for the signal to be robust. We expect the signal to change relatively smoothly, and for the changes to correspond to meaningful divisions in the text. If we see something weird, we might need to change the size of the window or the features we're looking at.

## From tokens to lines

We're used to talking about locations in Latin poems using **line** numbers, so when I talk about the size of the sliding window I naturally think about it as, say, 50 lines, or 40 lines. The first thing I'm going to do, then, is convert my **token**-based feature table to a **line**-based one.

I'm not going to normalize by dividing by the number of tokens; I'll just sum the feature counts per line. We can convert to frequency per 1000 words later, after we've built the samples. Or we might decide not to normalize, since frequency per sample is already a kind of normalization.


### Create unique ids for lines

Right now, the token table has a column **urn** with a unique identifier for the work, and another column **line** with the internal line number on which each token falls. But because line numbers start from 1 in each book, we need to combine these two values to create a line ID that's unique within the corpus.

I'm taking the time to create IDs that are CTS compliant, so they treat Valerius Flaccus' multi-book epic slightly differently from Seneca's individual tragedies. 

In [4]:
# start with an empty list
line_urns = []

# iterate over all the rows of the token table
#   - combine work urn with line number
for i, row in no_punct[['urn', 'author', 'line']].iterrows():
    if row['author'] == 'Seneca':
        this_id = row.urn + ':' + row.line
    else:
        this_id = row.urn + '.' + row.line
    line_urns.append(this_id)
    
# check the results
display(line_urns[:15])

['urn:cts:latinLit:phi1017.phi007:1',
 'urn:cts:latinLit:phi1017.phi007:1',
 'urn:cts:latinLit:phi1017.phi007:1',
 'urn:cts:latinLit:phi1017.phi007:1',
 'urn:cts:latinLit:phi1017.phi007:1',
 'urn:cts:latinLit:phi1017.phi007:2',
 'urn:cts:latinLit:phi1017.phi007:2',
 'urn:cts:latinLit:phi1017.phi007:2',
 'urn:cts:latinLit:phi1017.phi007:2',
 'urn:cts:latinLit:phi1017.phi007:2',
 'urn:cts:latinLit:phi1017.phi007:3',
 'urn:cts:latinLit:phi1017.phi007:3',
 'urn:cts:latinLit:phi1017.phi007:3',
 'urn:cts:latinLit:phi1017.phi007:3',
 'urn:cts:latinLit:phi1017.phi007:3']

### Give the IDs a custom sorting order

I have a feeling that we're going to need to sort these IDs later. Right now, Python sees them as text, so they will be sorted alphabetically. That would mean, for example, that `urn:cts:latinLit:phi1017.phi007:10` comes before `urn:cts:latinLit:phi1017.phi007:2`. It also could cause problems where an editor has re-ordered the lines in the ancient text, so that the line numbers don't necessarily proceed numerically either.

If I convert the IDs from **strings** (i.e. text) to Pandas' **Categorical** data type, I can specify a custom sort order—the order they're in right now. Then if I have to sort the IDs later, they'll be put back the way they're found in the ancient works.

In [5]:
line_urns_cat = pd.Categorical(line_urns, categories=pd.unique(line_urns), ordered=True)
display(line_urns_cat)

['urn:cts:latinLit:phi1017.phi007:1', 'urn:cts:latinLit:phi1017.phi007:1', 'urn:cts:latinLit:phi1017.phi007:1', 'urn:cts:latinLit:phi1017.phi007:1', 'urn:cts:latinLit:phi1017.phi007:1', ..., 'urn:cts:latinLit:phi1035.phi001:8.467', 'urn:cts:latinLit:phi1035.phi001:8.467', 'urn:cts:latinLit:phi1035.phi001:8.467', 'urn:cts:latinLit:phi1035.phi001:8.467', 'urn:cts:latinLit:phi1035.phi001:8.467']
Length: 100109
Categories (16574, object): ['urn:cts:latinLit:phi1017.phi007:1' < 'urn:cts:latinLit:phi1017.phi007:2' < 'urn:cts:latinLit:phi1017.phi007:3' < 'urn:cts:latinLit:phi1017.phi007:4' ... 'urn:cts:latinLit:phi1035.phi001:8.464' < 'urn:cts:latinLit:phi1035.phi001:8.465' < 'urn:cts:latinLit:phi1035.phi001:8.466' < 'urn:cts:latinLit:phi1035.phi001:8.467']

This time, after the list, we see some extra information about the categories these IDs represent. We can see that even though there are 100109 items in the list (the number of tokens in the corpus), there are only 16574 categories (the number of unique lines). We can also see the sort order listed separately.

### Add the new line IDs to the token table

I'll use the `insert()` method to make the new IDs the first column in the table.

In [6]:
no_punct.insert(0, 'line_id', line_urns_cat)
display(no_punct)

Unnamed: 0,line_id,urn,author,title,line,token,lemma,upos,mood,tense,voice,person,number,case,gender
0,urn:cts:latinLit:phi1017.phi007:1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,Opaca,Opaca,PROPN,,,,,Sing,Nom,Masc
1,urn:cts:latinLit:phi1017.phi007:1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,linquens,linquens,VERB,,,Act,,Sing,Nom,Masc
2,urn:cts:latinLit:phi1017.phi007:1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,Ditis,Dis,PROPN,,,,,Sing,Gen,Masc
3,urn:cts:latinLit:phi1017.phi007:1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,inferni,infernus,ADJ,,,,,Sing,Gen,Masc
4,urn:cts:latinLit:phi1017.phi007:1,urn:cts:latinLit:phi1017.phi007,Seneca,Agamemnon,1,loca,locus,NOUN,,,,,Plur,Acc,Neut
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100104,urn:cts:latinLit:phi1035.phi001:8.467,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,meruisse,mereo,VERB,Inf,Pres,Act,,,,
100105,urn:cts:latinLit:phi1035.phi001:8.467,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,putas,puto,VERB,Ind,Pres,Act,2,Sing,,
100106,urn:cts:latinLit:phi1035.phi001:8.467,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,me,ego,PRON,,,,,Sing,Acc,
100107,urn:cts:latinLit:phi1035.phi001:8.467,urn:cts:latinLit:phi1035.phi001:8,Val_Flac,Thebaid_08,467,talia,talis,DET,,,,,Plur,Acc,Neut


## Calculate line-based feature counts

### Lemma counts per line

The lemma-based cross-tabulation takes a long time because there are 14000 unique lemmata. But most of these are going to be thrown out immediately because they're not in the `lemma_kept` list. Here we make a **mask** based on which rows fit a criterion (their lemma is in the kept list). Then we use that mask to filter just the rows we want before doing the `crosstab()`. That saves computing a lot of data we don't need.

In [7]:
# identify rows that meet criterion
mask = no_punct.lemma.isin(lemma_kept)

# do cross-tabulation on masked table
lemma_count_line = pd.crosstab(no_punct.line_id, no_punct.lemma.loc[mask])

# reorder columns by frequency
lemma_count_line = lemma_count_line[lemma_kept]

# check the results
display(lemma_count_line)

lemma,que,et,qui,sum,hic,in,tu,non,ego,iam,...,offero,auus,Iove,quodque,redux,aequoreus,advolo,alumnus,magnanimus,patruus
line_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
urn:cts:latinLit:phi1017.phi007:1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:5,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
urn:cts:latinLit:phi1035.phi001:8.463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.464,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.465,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.466,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Total lemmata per line

We'll want to know the total number of lemmata on each line, both to check for irregularities and in case we decide to normalize per 1000 words later.

In the previous step we masked the original table, but here we're counting everything. Instead of crosstab, we'll just use groupby and do a count of the **lemma** column. I'm adding the result as a new column to `lemma_count_line`.

In [8]:
all_lemmas_by_line = no_punct.lemma.groupby(no_punct.line_id).agg('count')
display(all_lemmas_by_line)

line_id
urn:cts:latinLit:phi1017.phi007:1         5
urn:cts:latinLit:phi1017.phi007:2         5
urn:cts:latinLit:phi1017.phi007:3         5
urn:cts:latinLit:phi1017.phi007:4         5
urn:cts:latinLit:phi1017.phi007:5         7
                                         ..
urn:cts:latinLit:phi1035.phi001:8.463     6
urn:cts:latinLit:phi1035.phi001:8.464     8
urn:cts:latinLit:phi1035.phi001:8.465     6
urn:cts:latinLit:phi1035.phi001:8.466     8
urn:cts:latinLit:phi1035.phi001:8.467    11
Name: lemma, Length: 16574, dtype: int64

### Calculate line-based counts for part-of-speech tags

In [9]:
# calculate pos counts
pos_count_line = pd.crosstab(no_punct.line_id, no_punct.upos)

# rename columns with a prefix
pos_count_line = pos_count_line.rename(columns = lambda name: 'pos_' + name)

display(pos_count_line)

upos,pos_ADJ,pos_ADP,pos_ADV,pos_AUX,pos_CCONJ,pos_DET,pos_INTJ,pos_NOUN,pos_NUM,pos_PART,pos_PRON,pos_PROPN,pos_SCONJ,pos_VERB,pos_X
line_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
urn:cts:latinLit:phi1017.phi007:1,1,0,0,0,0,0,0,1,0,0,0,2,0,1,0
urn:cts:latinLit:phi1017.phi007:2,1,0,0,0,0,0,0,1,0,0,0,1,0,2,0
urn:cts:latinLit:phi1017.phi007:3,1,0,1,0,0,0,0,2,0,0,0,0,0,1,0
urn:cts:latinLit:phi1017.phi007:4,0,0,0,0,0,0,0,4,0,0,0,0,0,1,0
urn:cts:latinLit:phi1017.phi007:5,0,0,0,0,1,0,1,3,0,0,0,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
urn:cts:latinLit:phi1035.phi001:8.463,1,0,0,0,0,0,0,3,0,0,0,1,0,1,0
urn:cts:latinLit:phi1035.phi001:8.464,2,0,2,0,1,0,0,2,0,0,0,0,0,1,0
urn:cts:latinLit:phi1035.phi001:8.465,1,0,2,0,0,0,0,1,0,0,0,0,0,2,0
urn:cts:latinLit:phi1035.phi001:8.466,0,0,0,0,2,1,0,0,0,0,1,0,0,4,0


### Calculate line-based counts for morphological features

In [10]:
# a list of columns to process
feature_names = ['mood', 'voice', 'tense', 'person', 'number', 'gender', 'case']

# an empty list to gather the resulting tables
morph_counts = []

# iterate over the columns, using `feat` as a stand-in for the current feature
for feat in feature_names:
    
    # tally feature counts and normalize
    this_count = pd.crosstab(no_punct.line_id, no_punct[feat], dropna=False)

    # rename columns with a prefix
    this_count = this_count.rename(columns = lambda name: feat + '_' + name.upper())
    
    # add table to the list
    morph_counts.append(this_count)

### Join all the line-based tables together

In [11]:
# join all the tables together
feat_count_line = pos_count_line.join(morph_counts).join(lemma_count_line).fillna(0).astype(int)

# add total lemma counts as first column
feat_count_line.insert(
    loc = 0,
    column = 'lemma_ALL',
    value = all_lemmas_by_line,
)
    
# show results
display(feat_count_line)

Unnamed: 0_level_0,lemma_ALL,pos_ADJ,pos_ADP,pos_ADV,pos_AUX,pos_CCONJ,pos_DET,pos_INTJ,pos_NOUN,pos_NUM,...,offero,auus,Iove,quodque,redux,aequoreus,advolo,alumnus,magnanimus,patruus
line_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
urn:cts:latinLit:phi1017.phi007:1,5,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:2,5,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:3,5,1,0,1,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:4,5,0,0,0,0,0,0,0,4,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1017.phi007:5,7,0,0,0,0,1,0,1,3,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
urn:cts:latinLit:phi1035.phi001:8.463,6,1,0,0,0,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.464,8,2,0,2,0,1,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.465,6,1,0,2,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
urn:cts:latinLit:phi1035.phi001:8.466,8,0,0,0,0,2,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Export the line-based feature table

I'm going to write the finished table to a CSV file and then continue in a new notebook. We've been having a little trouble getting the feature extraction step to work in GitHub Codespaces---the lemma `crosstab()` call may be taxing our free-tier resources even with the masking.

If you can't get the steps above to work in the environment you're using, you can start with the next Notebook and just import the saved data.

In [12]:
feat_count_line.to_csv(os.path.join('data', 'features_by_line.csv'))