# Animacy in German Folktales

This notebook contains the reproducible code examples and analyses for the paper *"Animacy in German Folktales"* submitted in proceedings of CHR 2024: Computational Humanities Research Conference, 2024, Aarhus, Denmark.

**Authors:** Julian Häußler, Janis von Keitz, Evelyn Gius

**Institution:** *fortext lab, Technical University of Darmstadt, Germany*

**Reference:** Häußler, J., von Keitz, J., Gius, E. (2024). *Animacy in German Folktales*. CHR 2024: Computational Humanities Research Conference, December 4 – 6, 2024, Aarhus, Denmark. https://ceur-ws.org/Vol-3834/paper90.pdf.

**GitHub Repository:** https://github.com/forTEXT/Animacy_in_German_Folktales

## Notebook 01: Preprocessing

In [1]:
# Import libraries

import pandas as pd
import os 
import json
import re
import stanza
from tqdm import tqdm
from sklearn.metrics import cohen_kappa_score

# Initialize Stanza Pipeline
nlp = stanza.Pipeline(lang='de', processors='tokenize,mwt,pos,lemma,depparse,ner')

2024-11-24 16:02:56 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-11-24 16:02:56 INFO: Downloaded file to C:\Users\jvonk\stanza_resources\resources.json
2024-11-24 16:02:58 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| pos       | gsd_charlm   |
| lemma     | gsd_nocharlm |
| depparse  | gsd_charlm   |
| ner       | germeval2014 |

2024-11-24 16:02:58 INFO: Using device: cpu
2024-11-24 16:02:58 INFO: Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-11-24 16:03:00 INFO: Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-11-24 16:03:00 INFO: Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
2024-11-24 16:03:00 INFO: Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024

In [2]:
# Set input and output folder
input_folder = 'input/typical_animacy_corpus' # Input folder with TXT files
output_folder = 'intermediate/annotations_typical' # Output folder to store annotated texts as JSON files

### Annotations Dataframe
This step is based on a Catma Query Export. In the Analyze section of Catma you can filter all your annotations for a specific tag (e.g. animate). The result needs then to be exported *Flat as CSV*. The export is put in the working folder and renamed ```catma_export.csv```. After that you can continue with the following code.

In [3]:
# Read the catma export output as dataframe 
catma_export_path = f'{input_folder}/catma_export.csv'
annotations_df = pd.read_csv(catma_export_path, sep = ';')

# Seletc the essential columns from the catma export file
annotations_df = annotations_df.iloc[:, [2, 3, 4, 6, 7]]

# Name these columns
annotations_df.columns = ['title', 'chars', 'annotation', 'start', 'end']

# Save df as CSV
annotations_df_path = f'{output_folder}/annotations.csv'
annotations_df.to_csv(annotations_df_path, index=False)


To match the the TXT file with the correct entries in the Dataframe you need to make sure that there is a column in the Dataframe that contains an identifier for the txt file. In our case we used the following code to split the ```title``` column in an ```id``` and a ```name``` column, since our names for the text used in Catma consisted of a number and the name of the story. The TXT files are just named accordingly to this numbers.

For example:  We named a story *171. Der Zaunkönig* in Catma so this is the ```title``` in the Dataframe. We split the column so the ID of the story is *171*. The matching TXT file is named ```171.txt```.

In [4]:
# Optional: Only use to split ID and name of story if your titles are in the format as mentioned above

# Split column title in columns id and name
annotations_df[['id', 'name']] = annotations_df['title'].str.split('.', expand=True)

# Delete column 'title'
annotations_df.drop('title', axis=1, inplace=True)

If you do not have an ID but you have unambiguous titles in Catma, you can also use the ```title``` column to identify the TXT files. Just update the code below accordingly.

In [5]:
# Define the name of the column that you want to identify your txt files with
identifier = 'id' # Change to 'title' or other to use this column to identify txt files

### Find Annotations

Sometimes the problem might occur, that the indices from Catma does not match the TXT files exaclty. This function searches for the correct annotation starting from the index given by Catma.

In [6]:
# Function to find annotation in text backwards (in case of an offset between Catma annotation indices and the txt file)
def find_annotation(text: str, annotation: str, start_index: int) -> tuple:
    lower_text = text.lower()
    lower_annotation = annotation.lower()
    annotation_length = len(lower_annotation)

    # Search backwards
    for i in range(start_index, -1, -1):
        if lower_text[i:i + annotation_length] == lower_annotation:
            return i, i + annotation_length

    # Search forward
    for i in range(start_index, len(lower_text) - annotation_length + 1):
        if lower_text[i:i + annotation_length] == lower_annotation:
            return i, i + annotation_length

    # If not found
    return -1, -1

In [7]:
# Check if annotations in Dataframe match indices

# Iterate over all TXT files in input folder
for filename in os.listdir(input_folder):
    file_path = os.path.join(input_folder, filename)

    # Skip all files that are not txt
    if not filename.endswith('.txt'): continue

    # Read text from file
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    
    # Filter entries for actual story with ID
    txt_id = int(filename.replace('.txt', '')) # ID of the actual story
    filtered_df = annotations_df[annotations_df[identifier] == txt_id]
    
    # Check if Dataframe has entries for actual text
    if filtered_df.empty: continue
    
    # Iterate over every annotation in actual story
    for index, row in filtered_df.iterrows():
        start_index = int(row['start'])
        end_index = int(row['end'])
        annotation_actual = text[start_index:end_index].strip()
        annotation_check = row['annotation']

        # Check if extracted annotations from start and end index matches annotation from table
        if annotation_check != annotation_actual: 
            
            # raise ValueError(f'Die Annotationen {annotation} und {annotation_check} stimmen nicht überein.')
            start_index, end_index = find_annotation(text, annotation_check, start_index)
            annotation = text[start_index:end_index].strip()

            # Update indices in annotations_df
            annotations_df.at[index, 'start'] = start_index
            annotations_df.at[index, 'end'] = end_index

# Save df as CSV
annotations_df_path = f'{output_folder}/annotations.csv'
annotations_df.to_csv(annotations_df_path, index=False)

### Create Annotation Files

In this step, we create annotations from the indices in the annotations_df. Therefore all indices of the annotations are looked up in the txt files. For each texts we also get NER, POS, Lemmas and Tokens. Everything is collected in a Dataframe for each text and then stored as a JSON file. 

In [8]:
# Load annotation df 
annotations_df_path = f'{output_folder}/annotations.csv'
annotations_df = pd.read_csv(annotations_df_path)

# Create output folder if it does not already exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Iterate over all TXT files in the input folder
for filename in tqdm(os.listdir(input_folder)):
    file_path = os.path.join(input_folder, filename)

    # Skip all files that are not txt
    if not filename.endswith('.txt'): continue

    # Filter entries for actual story with ID
    txt_id = int(filename.replace('.txt', '')) # ID of the actual story
    filtered_df = annotations_df[annotations_df[identifier] == txt_id]
    
    # Check if annotations exist for actual ID
    if filtered_df.shape[0] == 0:
        continue
        
    # Read text from txt file
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()

    # Create NLP Object and get tokens, lemmas, and further tags
    doc = nlp(text)
    
    tokens = []
    lemmas = []
    pos_tags = []
    dep_tags = []
    ner_tags = []
    start_chars = []
    end_chars = []

    for sent in doc.sentences:
        for token in sent.tokens:
            # Information from token object
            tokens.append(token.text)
            ner_tags.append(token.ner)
            start_chars.append(token.start_char)
            end_chars.append(token.end_char)
            # Information from word object
            lemmas.append(token.words[0].lemma)
            pos_tags.append(token.words[0].upos)
            dep_tags.append(token.words[0].deprel)


    # Create Dataframe with all Information for actual text

    df_text_annotated = pd.DataFrame(columns=['tokens','animate_tags','lemmas','pos_tags','dep_tags','ner_tags','start_chars','end_chars'])

    df_text_annotated['tokens']=tokens
    df_text_annotated['lemmas']=lemmas
    df_text_annotated['pos_tags']=pos_tags
    df_text_annotated['dep_tags']=dep_tags
    df_text_annotated['ner_tags']=ner_tags
    df_text_annotated['start_chars']=start_chars
    df_text_annotated['end_chars']=end_chars
    

    # Annotate text
    for index,row in df_text_annotated.iterrows():
        is_animate = False 
        start_pos = row['start_chars']
        end_pos = row['end_chars']
        for ann_start, ann_end in zip(filtered_df['start'], filtered_df['end']):
            if start_pos >= ann_start and end_pos <= ann_end:
                is_animate = True
                break
        annotation_type = 'animate' if is_animate else 'inanimate'
        df_text_annotated.loc[index, 'animate_tags'] = annotation_type

    # Save df with annotations as JSON
    target_path_annotations = os.path.join(output_folder, filename.replace('.txt', '_annotations.json'))
    with open(target_path_annotations, 'w', encoding='utf-8') as json_file:
        df_text_annotated.to_json(json_file, force_ascii=False)

100%|██████████| 7/7 [02:00<00:00, 17.19s/it]


### Inter Annotator Agreement

In the following Code. We tested the IAA of two Texts measuring the Cohens Kappa Coefficient. For this we first load the tags of the two annotations for each text in a separate list. Wen then calculate Cohens Kappa for each text and lastly the average for both texts.

In [9]:
# Set input folder with IAA Annotations
input_iaa = 'intermediate/annotations_iaa'

In [10]:
# Function to load tags from JSON File as list
def load_tags(filepath):
    with open(filepath) as f:
        data = json.load(f)
        return [token[1] for token in data]

In [11]:
# Load tags for each annotation of both texts as lists
tags_6_A1 = load_tags(f'{input_iaa}/A1/6_annotations.json')
tags_6_A2 = load_tags(f'{input_iaa}/A2/6_annotations.json')
tags_10_A1 = load_tags(f'{input_iaa}/A1/10_annotations.json')
tags_10_A2 = load_tags(f'{input_iaa}/A2/10_annotations.json')

In [12]:
# Calculate Cohen's Kappa for both texts
kappa_6 = cohen_kappa_score(tags_6_A1, tags_6_A2)
kappa_10 = cohen_kappa_score(tags_10_A1, tags_10_A2)

# Print individual and Average Kappa Scores
print(f"Kappa for text 6: {kappa_6}")
print(f"Kappa for text 10: {kappa_10}")
print(f"Average Kappa: {(kappa_6 + kappa_10) / 2}")

Kappa for text 6: 0.8839836032587045
Kappa for text 10: 0.8555919798734466
Average Kappa: 0.8697877915660756
