# Retrieve annotations 

### XML Parsing Function
This function takes an XML file as input and parses it using the `lxml` library. It returns the parsed XML object for further processing.

In [3]:
from lxml import etree as ET

import os
import sys
path=os.getcwd()

def get_xml(xml_input):
    parser = ET.XMLParser(remove_comments=False)
    xml = ET.parse(xml_input, parser=parser)
    return xml 

### Load and Process XMI Files
This section of the notebook loads XMI files and processes the annotations contained within them. Specifically, it extracts and stores token annotations, which will be mapped to unique identifiers for further analysis.


In [4]:
import csv  
from cassis import *  

# Function to load a Common Analysis Structure (CAS) from an XMI file
def load_cas(file_input):
    """
    This function loads a Common Analysis Structure (CAS) from an XMI file.
    
    Arguments:
    file_input : str : Path to the XMI file to be loaded
    
    Returns:
    cassis.cas.Cas : Loaded CAS object
    """
    # Open typesystem.xml to load the type system
    f = open('typesystem.xml', 'rb')  
    typesystem = load_typesystem(f)  # Load the type system from the XML file
    
    # Open the XMI file to load the CAS
    fxmi = open(file_input, 'rb')  
    cas = load_cas_from_xmi(fxmi, typesystem=typesystem)  # Load the CAS from the XMI file
    return cas  # Return the loaded CAS

# List of all input XMI files
all_files = ['Caesar, De bello Gallico 1-4.xmi', 'Virgil, Aeneid.xmi']

# Abbreviated names for the input files
files_abbreviated = ['Caes', 'Virg']

id2tok = dict()  # Dictionary to store IDs mapped to tokens
count = 0  # Initialize count variable for tracking iterations

for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    print(file_input)  # Printing the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    # Looping through each 'Actionality' annotation in the CAS
    for relation in cas.select('webanno.custom.Actionality'):
        # Looping through each token covered by the 'Actionality' annotation
        for token in cas.select_covered('webanno.custom.Actionality', relation):
            tok = token.get_covered_text()  # Getting the text covered by the current token
            id = token.begin  # Getting the beginning offset of the token
            
            # Option to limit output: print every 100th token
            # To print all tokens, comment out the line below and uncomment the one after
            if count % 100 == 0:  # Print every 100th token (can be removed if you want all tokens)
                print(f"ID: {id}, Token: {tok}")  # Print the token ID and its text
            
            count = count + 1  # Incrementing the count
            id = str(id) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
            id2tok[id] = tok  # Storing the token text in the dictionary with the token ID as the key

            #Uncomment the following line to print all tokens if needed (instead of limiting to 100th token)
            #print(f"ID: {id}, Token: {tok}")  # Print every token (this line will show all tokens)

Caesar, De bello Gallico 1-4.xmi
ID: 1498, Token: exirent
ID: 78841, Token: circumventas
Virgil, Aeneid.xmi
ID: 28254, Token: succurrere
ID: 171793, Token: inibo
ID: 331311, Token: egressisque
ID: 436869, Token: succurrere


### Import pandas

Pandas will be needed to create DataFrames.

In [5]:
import pandas as pd

### Creating a DataFrame from the items in id2tok dictionary
This dictionary maps token IDs to their corresponding verb tokens. The code converts the dictionary into a list of `(key, value)` pairs, then creates a DataFrame with two columns: `"ID"` for the token ID and `"VERB TOKEN"` for the corresponding verb string. The resulting `tokenid_df` provides a tabular view of the token mapping.
This DataFrame is the basis for constructing all subsequent annotation DataFrames. By maintaining a consistent `"ID"` field (constructed using token offsets and file abbreviations), multiple layers of linguistic annotation can be merged and analyzed cohesively.




In [14]:
tokenid_df = pd.DataFrame([(k, v) for k, v in id2tok.items()], columns=["ID", "VERB TOKEN"]) #where 'tokenid_df includes the token id and the token'

tokenid_df


Unnamed: 0,ID,VERB TOKEN
0,1498Caes,exirent
1,4296Caes,exeant
2,4578Caes,subeunda
3,4854Caes,transierant
4,4978Caes,exire
...,...,...
499,435595Virg,procurrit
500,436869Virg,succurrere
501,438653Virg,occurrere
502,440736Virg,subirent


### Extracting Morphological Features from INCEpTION-Generated Annotations

This code snippet demonstrates how to retrieve a **span-level annotation layer** from INCEpTION—specifically the `MorphologicalFeatures` layer. INCEpTION stores linguistic annotations in a structured format (CAS - Common Analysis Structure), and these layers can be accessed programmatically for further processing and analysis.

- `list_morphological_features` is initialized as an empty list for collecting features.  
- `pred2_morphological_features` is a dictionary that maps each uniquely identified token (`ID`) to its associated morphological features.  
- A loop iterates through each file in `all_files`, using both the full name and an abbreviated label (`file_input_abbr`) to ensure unique token identification across multiple documents.  
- The CAS for each file is loaded and searched for annotations of type:

  ```python
  'de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures'


In [11]:
list_morphological_features = []  # Initializing an empty list to store morphological features
pred2_morphological_features = dict()  # Initializing an empty dictionary to store predicted morphological features mapped to IDs
count = 0  # Initializing a count variable to keep track of iterations

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'MorphologicalFeatures' annotation in the CAS
    for relation in cas.select('de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures'):
        # Looping through each token covered by the 'MorphologicalFeatures' annotation
        for token in cas.select_covered('de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures', relation):
            tok = token.get_covered_text()  # Getting the text covered by the current token
            id = token.begin  # Getting the beginning offset of the token
            id = str(id) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
            morphological_feature = relation.value  # Getting the morphological feature value
            
            # Checking if the ID already exists in pred2mf dictionary
            if id in pred2_morphological_features:
                # If ID exists, append the new morphological feature to the existing list of features
                list_morphological_features = pred2_morphological_features[id] + morphological_feature
                pred2_morphological_features[id] = list_morphological_features
            else:
                # If ID doesn't exist, store the morphological feature directly
                pred2_morphological_features[id] = morphological_feature

print(pred2_morphological_features)


{'1498Caes': 'Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin|Voice=Act', '4296Caes': 'Mood=Subj|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act', '4578Caes': 'Case=Nom|Gender=Neut|Number=Plur|Tense=Pres|VerbFor=Gdv|Voice=Act', '4854Caes': 'Mood=Ind|Number=Plur|Person=3|Tense=Pqp|VerbForm=Fin|Voice=Act', '4978Caes': 'Tense=Pres|VerbForm=Inf|Voice=Act', '5366Caes': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass', '5732Caes': 'Mood=Subj|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act', '5997Caes': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act', '6831Caes': 'Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin|Voice=Act', '7012Caes': 'Mood=Ind|Number=Plur|Person=3|Tense=Pqp|VerbForm=Fin|Voice=Act', '7293Caes': 'Tense=Pres|VerbForm=Inf|Voice=Act', '8498Caes': 'Mood=Subj|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act', '9426Caes': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act', '9649Caes': 'Mood=Ind|Number

### Creating the `morphological_features_df` DataFrame

This line transforms the `pred2_morphological_features` dictionary—where each key is a unique token ID and each value is a string or list of associated morphological features—into a structured Pandas DataFrame. The resulting `morphological_features_df` contains two columns:

- `"ID"`: the unique identifier for each token, composed of its character offset and the abbreviated file name  
- `"MORPHOLOGICAL FEATURES"`: the morphological annotation(s) (e.g., number, case, gender, tense) assigned to that token

This DataFrame structure allows for easy inspection, analysis, and merging with other annotation layers (e.g., POS tags, lemmas) based on the shared `"ID"` field.
ical Features

In [12]:
morphological_features_df = pd.DataFrame([(k,v) for k,v in pred2_morphological_features.items()], columns=["ID", "MORPHOLOGICAL FEATURES"]) #where 'morphological_features_df' is a dataframe containing token IDs and morphological features
morphological_features_df

Unnamed: 0,ID,MORPHOLOGICAL FEATURES
0,1498Caes,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...
1,4296Caes,Mood=Subj|Number=Plur|Person=3|Tense=Pres|Verb...
2,4578Caes,Case=Nom|Gender=Neut|Number=Plur|Tense=Pres|Ve...
3,4854Caes,Mood=Ind|Number=Plur|Person=3|Tense=Pqp|VerbFo...
4,4978Caes,Tense=Pres|VerbForm=Inf|Voice=Act
...,...,...
499,435595Virg,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...
500,436869Virg,Tense=Pres|VerbForm=Inf|Voice=Act
501,438653Virg,Tense=Pres|VerbForm=Inf|Voice=Act
502,440736Virg,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...


### Merging Token IDs with Morphological Features

This line merges the `tokenid_df` (which contains token IDs and their corresponding verb tokens) with `morphological_features_df` (which holds morphological annotations) using a **left join** on the `"ID"` column. The resulting DataFrame, `id_morphological_features_df`, retains all tokens from `tokenid_df` and adds the corresponding morphological features where available.

This step is essential for aligning the base token information with its linguistic annotations, enabling integrated analysis of token form and grammatical properties.

In [15]:
id_morphological_features_df = tokenid_df.merge(morphological_features_df, on='ID', how='left') #where 'id_morphological_features_df' merges 'tokenid_df' and 'morphological_features_df'
id_morphological_features_df

Unnamed: 0,ID,VERB TOKEN,MORPHOLOGICAL FEATURES
0,1498Caes,exirent,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...
1,4296Caes,exeant,Mood=Subj|Number=Plur|Person=3|Tense=Pres|Verb...
2,4578Caes,subeunda,Case=Nom|Gender=Neut|Number=Plur|Tense=Pres|Ve...
3,4854Caes,transierant,Mood=Ind|Number=Plur|Person=3|Tense=Pqp|VerbFo...
4,4978Caes,exire,Tense=Pres|VerbForm=Inf|Voice=Act
...,...,...,...
499,435595Virg,procurrit,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...
500,436869Virg,succurrere,Tense=Pres|VerbForm=Inf|Voice=Act
501,438653Virg,occurrere,Tense=Pres|VerbForm=Inf|Voice=Act
502,440736Virg,subirent,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...


### Extracting Lemmas and Mapping to Token IDs

This chunk of code extracts lemma annotations from the CAS and maps them to token IDs, specifically focusing on tokens covered by the `Lemma` annotation layer.

- The script loops through each file in `all_files` and retrieves the XML and CAS content.
- For each file, it identifies tokens annotated with the `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma` annotation layer.
- The script then extracts the lemma associated with each token and creates a unique ID for the token by combining the token's offset and the file abbreviation.
- If the token's ID already exists in the `pred2_lemmas` dictionary, the new lemma is appended to the existing list of lemmas; otherwise, the lemma is stored directly.

**Note:** Unlike the `MorphologicalFeatures` layer, which is only applied to verbs in this study, the `Lemma` layer can be found on various parts of speech (POS), not just verbs. This allows for the extraction of lemmas related to any token in the sentence, including nouns, adjectives, and adverbs. These lemmas are then mapped to unique token IDs and can be used for further analysis, such as in extracting **spatial relation lemmas** (cf. code chunks below).


In [27]:
list_lemmas = []  # Initializing an empty list to store lemmas
pred2_lemmas = dict()  # Initializing an empty dictionary to store predicted lemmas mapped to IDs
count = 0  # Initializing a count variable to keep track of iterations

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'Lemma' annotation in the CAS
    for relation in cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma'):
        # Looping through each token covered by the 'Lemma' annotation
        for token in cas.select_covered('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma', relation):
            tok = token.get_covered_text()  # Getting the text covered by the current token
            id = token.begin  # Getting the beginning offset of the token
            id = str(id) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
            lemma = relation.value  # Getting the lemma value
            
            # Checking if the ID already exists in pred2_lemmas dictionary
            if id in pred2_lemmas:
                # If ID exists, append the new lemma to the existing list of lemmas
                list_lemmas = pred2_lemmas[id] + lemma
                pred2_lemmas[id] = list_lemmas
            else:
                # If ID doesn't exist, store the lemma directly
                pred2_lemmas[id] = lemma

print(pred2_lemmas)



{'1463Caes': 'de', '1466Caes': 'finis', '1498Caes': 'exeo', '4281Caes': 'ex', '4283Caes': 'finis', '4296Caes': 'exeo', '4560Caes': 'ad', '4569Caes': 'periculum', '4578Caes': 'subeo', '4587Caes': 'sum', '4805Caes': 'qui', '4837Caes': 'in', '4840Caes': 'ager', '4854Caes': 'transeo', '4962Caes': 'iter', '4973Caes': 'domus', '4978Caes': 'exeo', '5323Caes': 'Rhodanus', '5366Caes': 'transeo', '5709Caes': 'ad', '5712Caes': 'ripa', '5732Caes': 'convenio', '5986Caes': 'ad', '5989Caes': 'Genava', '5997Caes': 'pervenio', '6806Caes': 'miles', '6831Caes': 'convenio', '6995Caes': 'qui', '6999Caes': 'ex', '7002Caes': 'provincia', '7012Caes': 'convenio', '7293Caes': 'transeo', '8498Caes': 'transeo', '9371Caes': 'in', '9374Caes': 'finis', '9426Caes': 'pervenio', '9630Caes': 'in', '9643Caes': 'finis', '9649Caes': 'pervenio', '10386Caes': 'in', '10389Caes': 'Santoni', '10398Caes': 'Helvetii', '10407Caes': 'pervenio', '10576Caes': 'is', '10579Caes': 'Helvetii', '10617Caes': 'transeo', '10848Caes': 'ad', '

## Sentence

### Extracting Custom Annotation Layer: `Includes` (Manually Added in INCEpTION)

This code block retrieves annotations from a **custom relation layer** called `Includes`, which has been **manually added in INCEpTION** (see Farina, 2024). Unlike span layers such as `MorphologicalFeatures`, this layer represents a **relation between two tokens**: a *Governor* and a *Dependent*.

- `list_sentences` is initialized to store all dependent token texts.
- `pred2_sentences` maps unique token IDs to the governor tokens they are associated with—effectively reconstructing parts of annotated sentence structures.

The code iterates through each annotated file:

- Loads the CAS object and selects annotations of type:
  
  ```python
  'webanno.custom.Includes'


In [16]:
list_sentences = []  # Initializing an empty list to store sentences
pred2_sentences = dict()  # Initializing an empty dictionary to store predicted sentences mapped to IDs
count = 0  # Initializing a count variable to keep track of iterations

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'Includes' relation in the CAS
    for relation in cas.select('webanno.custom.Includes'):
        dep = relation.Dependent  # Getting the dependent token of the relation
        tokdep = dep.get_covered_text()  # Getting the text covered by the dependent token
        id = str(dep.begin) + file_input_abbr  # Creating a unique ID for the dependent token based on its offset and file abbreviation
        list_sentences.append(str(tokdep))  # Appending the text of the dependent token to the list of sentences
        
        gov = relation.Governor  # Getting the governor token of the relation
        sentence = gov.get_covered_text()  # Getting the text covered by the governor token
        
        # Checking if the ID already exists in pred2_sentences dictionary
        if id in pred2_sentences:
            # If ID exists, append the new token text to the existing list of tokens representing the sentence
            list_sentences = pred2_sentences[id] + sentence
            pred2_sentences[id] = list_sentences
        else:
            # If ID doesn't exist, store the token text directly
            pred2_sentences[id] = sentence

print(pred2_sentences)


{'1498Caes': 'Pisone consulibus regni cupiditate inductus coniurationem nobilitatis fecit et civitati persuasit ut de finibus suis cum omnibus copiis exirent', '4296Caes': 'Post eius mortem nihilo minus Helvetii id quod constituerant facere conantur, ut e finibus suis exeant', '4578Caes': 'frumentum omne, praeter quod secum portaturi erant, comburunt, ut domum reditionis spe sublata paratiores ad omnia pericula subeunda essent', '4854Caes': 'Persuadent Rauracis et Tulingis et Latobrigis finitimis, uti eodem usi consilio oppidis suis vicisque exustis una cum iis proficiscantur, Boiosque, qui trans Rhenum incoluerant et in agrum Noricum transierant Noreiamque oppugnabant, receptos ad se socios sibi adsciscunt', '4978Caes': 'Erant omnino itinera duo, quibus itineribus domo exire possent', '5366Caes': 'alterum per provinciam nostram, multo facilius atque expeditius, propterea quod inter fines Helvetiorum et Allobrogum, qui nuper pacati erant, Rhodanus fluit isque non nullis locis vado tran

### Creating the `sentences_df` DataFrame

This line converts the `pred2_sentences` dictionary—where each key is a unique token ID and each value is the sentence or phrase (retrieved from the Governor token) that includes the corresponding verb—into a structured Pandas DataFrame.

The resulting `sentences_df` contains:

- `"ID"`: a unique identifier for each token, composed of the token's character offset and file abbreviation  
- `"SENTENCE"`: the text span (usually a sentence or clause) in which the token appears, based on the custom `Includes` relation

This DataFrame allows for a clear alignment between individual tokens and the broader sentence contexts in which they occur. Like previous annotation layers, it can later be merged with other DataFrames (e.g., `tokenid_df`) to form a comprehensive view of the annotated data.


In [17]:
sentences_df = pd.DataFrame([(k,v) for k,v in pred2_sentences.items()], columns=["ID", "SENTENCE"]) #where 'sentences_df' is a dataframe containing token IDs and the whole sentences including the verb tokens
sentences_df

Unnamed: 0,ID,SENTENCE
0,1498Caes,Pisone consulibus regni cupiditate inductus co...
1,4296Caes,Post eius mortem nihilo minus Helvetii id quod...
2,4578Caes,"frumentum omne, praeter quod secum portaturi e..."
3,4854Caes,Persuadent Rauracis et Tulingis et Latobrigis ...
4,4978Caes,"Erant omnino itinera duo, quibus itineribus do..."
...,...,...
499,435595Virg,"Dum nititur acer et instat, rursus in aurigae ..."
500,436869Virg,"Iuturnam misero, fateor, succurrere fratri sua..."
501,438653Virg,Harum unam celerem demisit ab aethere summo Iu...
502,440736Virg,"Vix illud lecti bis sex cervice subirent, qual..."


### Merging Lemmas with Sentence Contexts

This line merges `id_lemmas_df`—which contains token IDs, verb tokens, and their lemmas—with `sentences_df`, which includes the sentence-level context of each token. The merge is performed using a **left join** on the `"ID"` column to ensure that all tokens are retained, even if a sentence annotation is not available.

The resulting DataFrame, `id_sentences_df`, contains:

- Token IDs  
- Verb tokens  
- Lemmas  
- Corresponding sentence contexts (retrieved via the custom `Includes` relation)

This step exemplifies the overall structure of the pipeline:  
You **progressively extract each span or relation layer** (e.g., lemmas, morphological features, POS tags, sentence contexts) and **merge it into a growing base DataFrame** using the common `"ID"` field. This cumulative merging process leads to a **final, comprehensive DataFrame** that aligns all linguistic annotations on a token-by-token basis for further analysis or modeling.

In [18]:
id_sentences_df = id_morphological_features_df.merge(sentences_df, on='ID', how='left') #where 'id_sentences_df' merges 'id_morphological_features_df' and 'sentences_df'
id_sentences_df

Unnamed: 0,ID,VERB TOKEN,MORPHOLOGICAL FEATURES,SENTENCE
0,1498Caes,exirent,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...,Pisone consulibus regni cupiditate inductus co...
1,4296Caes,exeant,Mood=Subj|Number=Plur|Person=3|Tense=Pres|Verb...,Post eius mortem nihilo minus Helvetii id quod...
2,4578Caes,subeunda,Case=Nom|Gender=Neut|Number=Plur|Tense=Pres|Ve...,"frumentum omne, praeter quod secum portaturi e..."
3,4854Caes,transierant,Mood=Ind|Number=Plur|Person=3|Tense=Pqp|VerbFo...,Persuadent Rauracis et Tulingis et Latobrigis ...
4,4978Caes,exire,Tense=Pres|VerbForm=Inf|Voice=Act,"Erant omnino itinera duo, quibus itineribus do..."
...,...,...,...,...
499,435595Virg,procurrit,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...,"Dum nititur acer et instat, rursus in aurigae ..."
500,436869Virg,succurrere,Tense=Pres|VerbForm=Inf|Voice=Act,"Iuturnam misero, fateor, succurrere fratri sua..."
501,438653Virg,occurrere,Tense=Pres|VerbForm=Inf|Voice=Act,Harum unam celerem demisit ab aethere summo Iu...
502,440736Virg,subirent,Mood=Subj|Number=Plur|Person=3|Tense=Imp|VerbF...,"Vix illud lecti bis sex cervice subirent, qual..."


## Preverb semantics

### Extracting Custom Annotation Layer: `SemPrev` (Preverb Semantics)

This code snippet retrieves annotations from the custom span layer `SemPrev`, which was manually added in INCEpTION. The layer encodes **preverb semantic values** for each token and stores them under the `Preverbsemantics` attribute.

- `list_preverb_semantics` is initialized to collect values associated with each token.
- `pred2_preverb_semantics` maps each token's unique `"ID"` (constructed from its character offset and file abbreviation) to its corresponding preverb semantic annotation.

For each file in `all_files`:
- The CAS object is loaded, and the script selects all annotations of type:

  ```python
  'webanno.custom.SemPrev'


In [19]:
list_preverb_semantics = []  # Initializing an empty list to store preverb semantics
pred2_preverb_semantics = dict()  # Initializing an empty dictionary to store predicted preverb semantics mapped to IDs
count = 0  # Initializing a count variable to keep track of iterations

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'SemPrev' annotation in the CAS
    for relation in cas.select('webanno.custom.SemPrev'):
        # Looping through each token covered by the 'SemPrev' annotation
        for token in cas.select_covered('webanno.custom.SemPrev', relation):
            tok = token.get_covered_text()  # Getting the text covered by the current token
            id = token.begin  # Getting the beginning offset of the token
            id = str(id) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
            preverb_semantics = relation.Preverbsemantics  # Getting the preverb semantics value
            
            # Checking if the ID already exists in pred2_preverb_semantics dictionary
            if id in pred2_preverb_semantics:
                # If ID exists, append the new preverb semantics to the existing list of preverb semantics
                list_preverb_semantics = pred2_preverb_semantics[id] + preverb_semantics 
                pred2_preverb_semantics[id] = list_preverb_semantics
            else:
                # If ID doesn't exist, store the preverb semantics directly
                pred2_preverb_semantics[id] = preverb_semantics

print(pred2_preverb_semantics)


{'1498Caes': uima_cas_StringArray(xmiID=None, elements=['out'], type=Type(name=uima.cas.StringArray)), '4296Caes': uima_cas_StringArray(xmiID=None, elements=['out'], type=Type(name=uima.cas.StringArray)), '4578Caes': uima_cas_StringArray(xmiID=None, elements=['under'], type=Type(name=uima.cas.StringArray)), '4854Caes': uima_cas_StringArray(xmiID=None, elements=['across'], type=Type(name=uima.cas.StringArray)), '4978Caes': uima_cas_StringArray(xmiID=None, elements=['out'], type=Type(name=uima.cas.StringArray)), '5366Caes': uima_cas_StringArray(xmiID=None, elements=['across'], type=Type(name=uima.cas.StringArray)), '5732Caes': uima_cas_StringArray(xmiID=None, elements=['together'], type=Type(name=uima.cas.StringArray)), '5997Caes': uima_cas_StringArray(xmiID=None, elements=['completely'], type=Type(name=uima.cas.StringArray)), '6831Caes': uima_cas_StringArray(xmiID=None, elements=['together'], type=Type(name=uima.cas.StringArray)), '7012Caes': uima_cas_StringArray(xmiID=None, elements=['

### Cleaning and Extracting Relevant Preverb Semantics

This code snippet demonstrates the cleaning process for the **preverb semantics** annotations retrieved in the previous step. The preverb semantics are stored as UIMA `StringArray` objects, which need to be processed to extract the relevant textual data.

- The script iterates through each file in `all_files` and selects the `SemPrev` annotation layer from the CAS.
- For each `SemPrev` annotation:
  - The preverb semantics (`relation.Preverbsemantics`) are retrieved.
  - The `StringArray` elements are **joined into a single string** (`preverb_semantics_str`), effectively removing any unwanted UIMA-specific wrapper characters (e.g., `uima_cas_StringArray`).
  
- The cleaned preverb semantics string is then appended to `list_preverb_semantics`, which now contains only the relevant semantic values (e.g., `"out"`).

The result is a clean list of preverb semantics, ready for inclusion in the final DataFrame or further analysis.


In [20]:
list_preverb_semantics = []  # Initializing an empty list to store preverb semantics

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'SemPrev' annotation in the CAS
    for relation in cas.select('webanno.custom.SemPrev'):
        preverb_semantics = relation.Preverbsemantics  # Getting the preverb semantics value
        # Append the preverb semantics to the list after removing unwanted characters
        preverb_semantics_str = ', '.join(preverb_semantics.elements)  # Joining the elements of StringArray
        list_preverb_semantics.append(preverb_semantics_str)  # Appending the preverb semantics to the list

print(list_preverb_semantics)


['out', 'out', 'under', 'across', 'out', 'across', 'together', 'completely', 'together', 'together', 'across', 'across', 'completely', 'completely', 'completely', 'across', 'completely', 'across', '(malefactive), to', 'out', 'across', 'across', 'to', '(malefactive), to', 'around', 'completely', 'together with', 'completely', 'completely', 'away', 'across', 'out', 'together', 'across', 'forth', 'across', 'across', 'away', 'against', '(malefactive), together', 'together', '(malefactive), together', 'under', 'across', '(malefactive), together', '(malefactive), together', 'across', 'across', 'away', 'around', 'together', 'onwards', 'completely', 'across', 'into', 'forward', '(idea of destruction/death), across', 'into', 'completely', 'into', 'together', 'to', 'around', 'across', 'across', '(malefactive), to', 'across', '(malefactive), to', 'across', 'across', 'around', 'across', 'onwards', 'together', 'away', 'completely', 'together', 'away', 'forth', 'completely', 'into', 'completely', 'a

### Extracting Token IDs for `Actionality` Annotations

This code snippet retrieves the token IDs associated with the custom annotation layer `Actionality`. Each token in this layer is identified by its character offset, which is combined with the file abbreviation to form a unique ID.

- The code loops through each file in `all_files`, retrieving the CAS (Common Analysis Structure) for each file.
- It then selects all `Actionality` annotations and iterates over the tokens covered by these annotations.
- For each token, the script extracts the **beginning offset** and creates a unique token ID by concatenating the offset with the abbreviated file name (`file_input_abbr`).
- The unique token IDs are stored in the `list_token_ids` list, which will be used to link token IDs with their corresponding annotations in later steps.

The result is a list of token IDs that represent the tokens covered by the `Actionality` annotations.
Note that any span layer used for **all verbal occurrences** can be used to get a list of token IDs.

In [21]:
list_token_ids = []  # Initializing an empty list to store token IDs
count = 0  # Initializing a count variable to keep track of iterations

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Getting the current file name
    file_input_abbr = files_abbreviated[i]  # Getting the abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file
    
    # Looping through each 'Actionality' annotation in the CAS
    for relation in cas.select('webanno.custom.Actionality'):
        # Looping through each token covered by the 'Actionality' annotation
        for token in cas.select_covered('webanno.custom.Actionality', relation):
            id = token.begin  # Getting the beginning offset of the token
            id = str(id) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
            list_token_ids.append(id)  # Appending the token ID to the list

print(list_token_ids)


['1498Caes', '4296Caes', '4578Caes', '4854Caes', '4978Caes', '5366Caes', '5732Caes', '5997Caes', '6831Caes', '7012Caes', '7293Caes', '8498Caes', '9426Caes', '9649Caes', '10407Caes', '10617Caes', '10862Caes', '10890Caes', '10931Caes', '11152Caes', '11818Caes', '12323Caes', '21567Caes', '24504Caes', '24513Caes', '25610Caes', '26036Caes', '26225Caes', '26268Caes', '26640Caes', '27386Caes', '27831Caes', '28359Caes', '29918Caes', '30767Caes', '31365Caes', '33967Caes', '34210Caes', '34348Caes', '36546Caes', '36804Caes', '37047Caes', '37157Caes', '37517Caes', '38986Caes', '41443Caes', '46744Caes', '46852Caes', '48088Caes', '50312Caes', '52574Caes', '53622Caes', '56162Caes', '56208Caes', '56243Caes', '56345Caes', '56583Caes', '58654Caes', '59161Caes', '60301Caes', '61670Caes', '63933Caes', '64997Caes', '65326Caes', '65391Caes', '65423Caes', '65528Caes', '66243Caes', '66324Caes', '66389Caes', '66411Caes', '66489Caes', '66567Caes', '66805Caes', '67219Caes', '67317Caes', '68853Caes', '69478Caes',

### Mapping Token IDs to Preverb Semantics

This code snippet creates a dictionary that maps the **token IDs** (extracted in the previous chunk) to their corresponding **preverb semantics** values. This mapping is done by pairing the token IDs from `list_token_ids` with the cleaned preverb semantics in `list_preverb_semantics`.

- A dictionary comprehension is used to iterate over the `list_token_ids` and `list_preverb_semantics` simultaneously, linking each token ID to its respective preverb semantics.
- The resulting dictionary, `pred2_preverb_semantics`, contains the token IDs as keys and their corresponding preverb semantic values as values.

This mapping provides a direct association between each token and its preverb semantics, which will later be used to create a DataFrame.


In [22]:
pred2_preverb_semantics = {list_token_ids[i]: list_preverb_semantics[i] for i in range(len(list_token_ids))}
pred2_preverb_semantics

{'1498Caes': 'out',
 '4296Caes': 'out',
 '4578Caes': 'under',
 '4854Caes': 'across',
 '4978Caes': 'out',
 '5366Caes': 'across',
 '5732Caes': 'together',
 '5997Caes': 'completely',
 '6831Caes': 'together',
 '7012Caes': 'together',
 '7293Caes': 'across',
 '8498Caes': 'across',
 '9426Caes': 'completely',
 '9649Caes': 'completely',
 '10407Caes': 'completely',
 '10617Caes': 'across',
 '10862Caes': 'completely',
 '10890Caes': 'across',
 '10931Caes': '(malefactive), to',
 '11152Caes': 'out',
 '11818Caes': 'across',
 '12323Caes': 'across',
 '21567Caes': 'to',
 '24504Caes': '(malefactive), to',
 '24513Caes': 'around',
 '25610Caes': 'completely',
 '26036Caes': 'together with',
 '26225Caes': 'completely',
 '26268Caes': 'completely',
 '26640Caes': 'away',
 '27386Caes': 'across',
 '27831Caes': 'out',
 '28359Caes': 'together',
 '29918Caes': 'across',
 '30767Caes': 'forth',
 '31365Caes': 'across',
 '33967Caes': 'across',
 '34210Caes': 'away',
 '34348Caes': 'against',
 '36546Caes': '(malefactive), tog

### Creating the `preverb_semantics_df` DataFrame

In this final chunk, the code creates a Pandas DataFrame, `preverb_semantics_df`, from the dictionary `pred2_preverb_semantics`, which maps token IDs to their respective preverb semantics.

- A list comprehension is used to convert the dictionary items into a list of tuples, where each tuple contains a token ID and its corresponding preverb semantics.
- The DataFrame is then constructed with two columns: `"ID"`, representing the token IDs, and `"PREVERB SEMANTICS"`, representing the corresponding preverb semantics for each token.

The resulting DataFrame provides a structured view of the token IDs alongside their preverb semantic annotations, which can be used for further analysis or integration with other data layers.


In [23]:
preverb_semantics_df = pd.DataFrame([(k,v) for k,v in pred2_preverb_semantics.items()], columns=["ID", "PREVERB SEMANTICS"]) #where 'preverb_semantics_df' is a dataframe containing token IDs and preverb semantics of the verb tokens
preverb_semantics_df

Unnamed: 0,ID,PREVERB SEMANTICS
0,1498Caes,out
1,4296Caes,out
2,4578Caes,under
3,4854Caes,across
4,4978Caes,out
...,...,...
499,435595Virg,forward
500,436869Virg,under
501,438653Virg,to
502,440736Virg,under


### Extracting Spatial Relations and Mapping Tokens to Spatial Relation IDs

This code snippet retrieves tokens annotated with the custom layer `Spatiality` and maps them to their corresponding spatial relation identifiers.

- The script loops through each file in `all_files` and selects the `Spatiality` annotation layer from the CAS.
- For each spatial relation annotation:
  - It identifies both the **dependent** and **governor** tokens and extracts their covered text.
  - A unique ID for each token is generated by combining the token's offset and the file abbreviation.
  - The dependent token's covered text is appended to `list_spatial_relations`.
  - The governor token's ID is either added to an existing list or stored as a new entry in `tok2_spatial_relations`, creating a mapping from the token's ID to a list of associated governor token IDs.

The result is a dictionary `tok2_spatial_relations` that maps each dependent token ID to the IDs of the spatial relation's governor tokens.


In [24]:
# Initializing an empty list to store spatial relations covered text
list_spatial_relations = []

# Initializing an empty dictionary to map tokens to their spatial relations with unique identifiers
tok2_spatial_relations = dict()

# Looping through each file in the list of all_files
for i in range(len(all_files)):
    file_input = all_files[i]  # Current file name
    file_input_abbr = files_abbreviated[i]  # Abbreviated name for the current file
    xml = get_xml(file_input)  # Retrieving XML content from the current file
    cas = load_cas(file_input)  # Loading CAS from the current file

    # Looping through each 'Spatiality' annotation in the CAS
    for relation in cas.select('webanno.custom.Spatiality'):
        dep = relation.Dependent  # Dependent token of the spatial relation
        tokdep = dep.get_covered_text()  # Covered text of the dependent token
        id = str(dep.begin) + file_input_abbr  # Creating a unique ID for the token based on its offset and file abbreviation
        list_spatial_relations.append(str(tokdep))  # Appending the covered text of the dependent token to list_spatial_relations

        gov = relation.Governor  # Governor token of the spatial relation
        tokgov = gov.get_covered_text()  # Covered text of the governor token
        id2 = str(gov.begin) + file_input_abbr # Creating a unique identifier by appending the file abbreviation to the offset of the governor token
        
        if id in tok2_spatial_relations:  # Checking if the ID already exists in tok2_spatial_relations
            tok2_spatial_relations[id].append(id2)  # If so, append the unique identifier to the existing list
        else:
            tok2_spatial_relations[id] = [id2]  # If not, create a new entry with a list containing the unique identifier

print(tok2_spatial_relations)


{'1498Caes': ['1466Caes'], '4296Caes': ['4283Caes'], '4578Caes': ['4569Caes'], '4854Caes': ['4840Caes'], '4978Caes': ['4973Caes', '4962Caes'], '5732Caes': ['5712Caes'], '5997Caes': ['5989Caes'], '7012Caes': ['7002Caes'], '9426Caes': ['9374Caes'], '9649Caes': ['9643Caes'], '10407Caes': ['10389Caes'], '10617Caes': ['10576Caes'], '10862Caes': ['10855Caes'], '10890Caes': ['10883Caes'], '10931Caes': ['10902Caes'], '11152Caes': ['11147Caes'], '11818Caes': ['11811Caes'], '12323Caes': ['12316Caes'], '21567Caes': ['21563Caes'], '24504Caes': ['24490Caes', '24479Caes'], '25610Caes': ['25584Caes'], '26225Caes': ['26206Caes'], '26268Caes': ['26264Caes'], '26640Caes': ['26620Caes'], '27386Caes': ['27380Caes', '27357Caes'], '27831Caes': ['27826Caes'], '28359Caes': ['28339Caes'], '29918Caes': ['29911Caes'], '30767Caes': ['30758Caes'], '31365Caes': ['31358Caes'], '33967Caes': ['33960Caes'], '34210Caes': ['34199Caes'], '37157Caes': ['37146Caes'], '37517Caes': ['37510Caes'], '46744Caes': ['46737Caes'], '

### Creating the `tok2_spatial_relations_id_df` DataFrame

This chunk constructs a Pandas DataFrame, `tok2_spatial_relations_id_df`, which stores the mapping of token IDs to the IDs of other tokens annotated with spatial relations.

- The dictionary `tok2_spatial_relations` is converted into a list of tuples, with each tuple containing a token ID and a list of associated spatial relation IDs.
- The DataFrame is created with two columns: `"ID"`, representing the token IDs, and `"SPATIAL RELATION ID"`, representing the IDs of tokens related by the spatial relation.

The resulting DataFrame provides a structured view of token IDs and their associated spatial relation identifiers, which can be further analyzed or integrated with other data.


In [25]:
tok2_spatial_relations_id_df = pd.DataFrame([(k,v) for k,v in tok2_spatial_relations.items()], columns=["ID", "SPATIAL RELATION ID"]) #where 'tok2_spatial_relations_id_df' is a dataframe containing token IDs and IDs of all the tokens annotated with a Spatial relation
tok2_spatial_relations_id_df

Unnamed: 0,ID,SPATIAL RELATION ID
0,1498Caes,[1466Caes]
1,4296Caes,[4283Caes]
2,4578Caes,[4569Caes]
3,4854Caes,[4840Caes]
4,4978Caes,"[4973Caes, 4962Caes]"
...,...,...
265,424445Virg,[424413Virg]
266,428401Virg,[428307Virg]
267,436869Virg,[436880Virg]
268,438653Virg,[438644Virg]


### Mapping Spatial Relation IDs to Lemmas

In this code snippet, the script generates a dictionary, `pred2_spatial_relation_lemmas`, that maps predicate token IDs (from `tok2_spatial_relations`) to their associated spatial relation lemmas.

- The code iterates through the keys (predicate IDs) in `tok2_spatial_relations`.
- For each predicate ID, the script retrieves the list of associated spatial relation IDs.
- It then looks up the lemma for each spatial relation ID from the `pred2_lemmas` dictionary and appends it to a list.
- This list of lemmas is stored in the dictionary `pred2_spatial_relation_lemmas`, with the predicate ID as the key.

This results in a dictionary that links each predicate token ID to a list of lemmas for the spatial relations it is associated with.


In [28]:
pred2_spatial_relation_lemmas = dict()  # Initialize an empty dictionary to store mappings between predicate IDs and their spatial relation lemmas

# Iterate through each predicate ID in the dictionary tok2_spatial_relations
for pred_id in tok2_spatial_relations:
    spatial_relation_ids = tok2_spatial_relations[pred_id]  # Retrieve the list of spatial relation IDs associated with the current predicate ID
    spatial_relation_lemmas = list()  # Initialize an empty list to store lemmas corresponding to spatial relations
    
    # Iterate through each spatial relation ID in the list of IDs
    for spatial_relation in spatial_relation_ids:
        spatial_relation_lemma = pred2_lemmas[spatial_relation]  # Retrieve the lemma associated with the current spatial relation ID from pred2_lemmas dictionary
        spatial_relation_lemmas.append(spatial_relation_lemma)  # Append the retrieved lemma to the list of lemmas for the current predicate ID
        pred2_spatial_relation_lemmas[pred_id] = spatial_relation_lemmas  # Assign the list of lemmas to the current predicate ID in the pred2_spatial_relation_lemmas dictionary

print(pred2_spatial_relation_lemmas)  


{'1498Caes': ['finis'], '4296Caes': ['finis'], '4578Caes': ['periculum'], '4854Caes': ['ager'], '4978Caes': ['domus', 'iter'], '5732Caes': ['ripa'], '5997Caes': ['Genava'], '7012Caes': ['provincia'], '9426Caes': ['finis'], '9649Caes': ['finis'], '10407Caes': ['Santoni'], '10617Caes': ['is'], '10862Caes': ['pars'], '10890Caes': ['flumen'], '10931Caes': ['is'], '11152Caes': ['domus'], '11818Caes': ['flumen'], '12323Caes': ['flumen'], '21567Caes': ['is'], '24504Caes': ['latus', 'noster'], '25610Caes': ['finis'], '26225Caes': ['eo'], '26268Caes': ['is'], '26640Caes': ['castrum'], '27386Caes': ['finis', 'finis'], '27831Caes': ['domus'], '28359Caes': ['Caesar'], '29918Caes': ['Rhenus'], '30767Caes': ['civitas'], '31365Caes': ['Rhenus'], '33967Caes': ['Rhenus'], '34210Caes': ['provincia'], '37157Caes': ['tectum'], '37517Caes': ['Rhenus'], '46744Caes': ['Rhenus'], '46852Caes': ['Rhenus'], '48088Caes': ['finis'], '53622Caes': ['castrum'], '56162Caes': ['locus', 'Rhenus'], '56345Caes': ['ea'], '

### Creating the `spatial_relation_lemma_df` DataFrame

The final chunk creates a DataFrame, `spatial_relation_lemma_df`, which contains the mapping of token IDs to their corresponding spatial relation lemmas.

- The dictionary `pred2_spatial_relation_lemmas` is converted into a list of tuples, where each tuple consists of a predicate token ID and the associated spatial relation lemmas.
- The DataFrame is then constructed with two columns: `"ID"`, representing the token IDs, and `"SPATIAL RELATION LEMMA"`, representing the lemmas corresponding to the spatial relations for each predicate.

This DataFrame provides a structured representation of the spatial relation lemmas associated with each predicate token, ready for further analysis or integration.


In [29]:
spatial_relation_lemma_df = pd.DataFrame([(k,v) for k,v in pred2_spatial_relation_lemmas.items()], columns=["ID", "SPATIAL RELATION LEMMA"]) #where 'spatial_relation_lemma_df' is a dataframe containing token IDs and IDs of all the tokens annotated with a Spatial relation
spatial_relation_lemma_df

Unnamed: 0,ID,SPATIAL RELATION LEMMA
0,1498Caes,[finis]
1,4296Caes,[finis]
2,4578Caes,[periculum]
3,4854Caes,[ager]
4,4978Caes,"[domus, iter]"
...,...,...
265,424445Virg,[Hyllus]
266,428401Virg,[hic]
267,436869Virg,[frater]
268,438653Virg,[Iuturna]
