# Task 4: Visualizing Annotations with CAS Visualizer

In this notebook, we integrate the CAS Visualizer module to visually explore learner corpus annotations.

You will learn how to:
- Load an XMI file with the dakoda typesystem
- Use the CAS Visualizer to highlight and display annotation layers (e.g., tokens, POS tags)
- Customize which annotation types to show and their colors
- Compare learner text annotations and target hypotheses visually

This notebook requires the `visualizer.py` module from the CAS Visualizer GitHub repository.
Make sure `visualizer.py` is accessible in your notebook's working directory or Python path.


In [1]:
import sys
sys.path.append(r"C:\Users\Vedang Deshmukh\cas-visualizer\cas_visualizer")

import os
import cassis
from visualizer import SpacySpanVisualiser  # Your local file
from IPython.core.display import display, HTML

# Change directory to project root if needed
os.chdir(r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes")
print("Working directory:", os.getcwd())


Working directory: C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes


  from IPython.core.display import display, HTML


### Load typesystem and example XMI file

We will load the typesystem XML and an example learner XMI file.


In [2]:
TYPESYSTEM_PATH = r"data\dakoda_typesystem.xml"
XMI_FILE = r"data\CDLK\learner_xmi\201006ZW063.xmi"

with open(TYPESYSTEM_PATH, "rb") as f:
    typesystem = cassis.load_typesystem(f)

with open(XMI_FILE, "rb") as f:
    cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)

print(f"Loaded XMI file: {XMI_FILE}")
print("Available views (sofa IDs):", [sofa.sofaID for sofa in cas.sofas])




Loaded XMI file: data\CDLK\learner_xmi\201006ZW063.xmi
Available views (sofa IDs): ['_InitialView', 'ctok', 'ctok_linbomb', 'mixtral_th1', 'transgec_th1', 'mbartgec_th1', 'spacy_ctok', 'spacy_ctok_linbomb', 'spacy_mixtral_th1', 'spacy_transgec_th1', 'spacy_mbartgec_th1']


### Select the annotation view to visualize

We will use the `'ctok'` view containing the learner tokens and annotations.

In [3]:
view_name = 'ctok'
view_cas = cas.get_view(view_name)
print(f"Using view: {view_name}")
print(f"Document length: {len(view_cas.sofa_string)} characters")


Using view: ctok
Document length: 1195 characters


### Visualize tokens and POS annotations with highlighting

We configure the visualizer to highlight token and POS annotations with colors.

In [4]:
from IPython.display import display, HTML

# Define span wrapper (copied/adapted from your streamlit example)
def spanWrapper(color, token_text):
    return f'<span style="background-color: {color}; border-radius: 5px; padding: 2px 6px;">{token_text}</span>'

# Filter tokens to exclude metadata tokens
def is_normal_token(token, sofa):
    text = sofa[token.begin:token.end]
    return not (text.startswith('<DF-') or text.startswith('<TM-'))

# Prepare your token list with annotation labels (e.g., POS tags)
tokens_with_pos = []
for token in view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'):
    if not is_normal_token(token, view_cas.sofa_string):
        continue
    # Find POS annotation covering this token
    pos_anns = [pos for pos in view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS')
                if pos.begin == token.begin and pos.end == token.end]
    pos_tag = pos_anns[0].PosValue if pos_anns else None
    tokens_with_pos.append((token.get_covered_text(), pos_tag))

# Define colors for POS tags (simplified example)
pos_colors = {
    "NN": "lightgreen",
    "VVFIN": "lightblue",
    "PPER": "orange",
    "PUNCT": "lightgray",
    # Add more POS tags/colors as needed
}

# Build the inline HTML with color spans
html_tokens = []
for text, pos_tag in tokens_with_pos:
    color = pos_colors.get(pos_tag, "white")  # default no color
    if color == "white":
        html_tokens.append(text)
    else:
        html_tokens.append(spanWrapper(color, text))

html_text = " ".join(html_tokens)

# Display in notebook
display(HTML(html_text))


### Interactive Corpus and File Selector with POS Highlighting

Use the dropdown menus below to select a corpus and an XMI file from that corpus.

The text of the selected file will be displayed inline with tokens color-highlighted according to their Part-of-Speech (POS) tags.

**Legend for POS highlights:**

- **NN** — Noun, singular or mass
- **VVFIN** — Finite verb, main clause
- **PPER** — Personal pronoun
- **PUNCT** — Punctuation

You can add more POS tags and colors by modifying the `pos_colors` dictionary in the code.


In [5]:
from IPython.display import display, HTML
import ipywidgets as widgets
import os
import cassis

# Paths
CDLK_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\CDLK\learner_xmi"
KLP1_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\KLP1\learner_xmi"
TYPESYSTEM_PATH = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\dakoda_typesystem.xml"

# Helper functions
def load_cas(xmi_file, typesystem_path, view_name='ctok'):
    with open(typesystem_path, 'rb') as f:
        typesystem = cassis.load_typesystem(f)
    with open(xmi_file, 'rb') as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
    if view_name not in [sofa.sofaID for sofa in cas.sofas]:
        raise ValueError(f"View {view_name} not found in {xmi_file}")
    return cas.get_view(view_name)

def spanWrapper(color, token_text):
    return f'<span style="background-color: {color}; border-radius: 5px; padding: 2px 6px;">{token_text}</span>'

def is_normal_token(token, sofa):
    text = sofa[token.begin:token.end]
    return not (text.startswith('<DF-') or text.startswith('<TM-'))

# POS tags with colors and full names
pos_colors = {
    "NN": "lightgreen",       # Noun, singular or mass
    "VVFIN": "lightblue",     # Finite verb, main clause
    "PPER": "orange",         # Personal pronoun
    "PUNCT": "lightgray",     # Punctuation
}

pos_full_names = {
    "NN": "Noun, singular or mass",
    "VVFIN": "Finite verb, main clause",
    "PPER": "Personal pronoun",
    "PUNCT": "Punctuation",
}

def highlight_pos_inline(view_cas):
    tokens_with_pos = []
    for token in view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'):
        if not is_normal_token(token, view_cas.sofa_string):
            continue
        pos_anns = [pos for pos in view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS')
                    if pos.begin == token.begin and pos.end == token.end]
        pos_tag = pos_anns[0].PosValue if pos_anns else None
        tokens_with_pos.append((token.get_covered_text(), pos_tag))

    html_tokens = []
    for text, pos_tag in tokens_with_pos:
        color = pos_colors.get(pos_tag, "white")  # default no color
        if color == "white":
            html_tokens.append(text)
        else:
            html_tokens.append(spanWrapper(color, text))

    html_text = " ".join(html_tokens)
    return html_text

# Widgets setup
corpus_folders = {
    'CDLK': CDLK_FOLDER,
    'KLP1': KLP1_FOLDER
}

corpus_selector = widgets.Dropdown(
    options=list(corpus_folders.keys()),
    value='CDLK',
    description='Corpus:',
)

file_selector = widgets.Dropdown(
    options=[],
    description='File:',
    layout=widgets.Layout(width='80%')
)

output = widgets.Output()

def update_file_list(change):
    folder = corpus_folders[change['new']]
    files = sorted([f for f in os.listdir(folder) if f.endswith('.xmi')])
    file_selector.options = files
    file_selector.value = files[0] if files else None

corpus_selector.observe(update_file_list, names='value')
update_file_list({'new': corpus_selector.value})  # initialize

def display_highlighted_text(corpus, filename):
    output.clear_output()
    if filename is None:
        return
    folder = corpus_folders[corpus]
    xmi_path = os.path.join(folder, filename)
    try:
        view_cas = load_cas(xmi_path, TYPESYSTEM_PATH, 'ctok')
        html = highlight_pos_inline(view_cas)
        with output:
            display(HTML(html))
            print("\nLegend for POS highlights:")
            for pos, color in pos_colors.items():
                display(HTML(f'<span style="background-color:{color}; padding:4px 8px; margin-right:10px; border-radius:4px;">{pos} — {pos_full_names[pos]}</span>'))
    except Exception as e:
        with output:
            print(f"Error loading file: {e}")

widgets.interact(display_highlighted_text, corpus=corpus_selector, filename=file_selector)
display(output)


interactive(children=(Dropdown(description='Corpus:', options=('CDLK', 'KLP1'), value='CDLK'), Dropdown(descri…

Output()

##  List and count annotation types in a corpus view

To understand what types of annotations are present in a corpus file, we list all annotation types found in a specific view, along with their counts.

This helps identify which linguistic features or learner error annotations are available for further exploration.


In [6]:
from collections import Counter

def list_annotation_types(view_cas):
    types_in_cas = [ann.type.name for ann in view_cas.select_all()]
    counts = Counter(types_in_cas)
    # Sort by count descending
    sorted_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    return sorted_counts

# Example usage: list annotations in 'ctok' view of current CAS

annotation_counts = list_annotation_types(view_cas)
print("Annotation types and counts in 'ctok' view:")
for ann_type, count in annotation_counts:
    print(f"{ann_type}: {count}")


Annotation types and counts in 'ctok' view:
de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS: 198
custom.UniversalDependencies.POS: 198
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma: 198
custom.syntaxdot.topofield: 198
org.dakoda.syntax.UDependency: 198
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token: 198
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence: 16
org.dakoda.Stage: 14


## Extract and explore additional key annotations

Besides tokens and POS tags, learner corpora often contain other useful annotations such as:

- **Lemmas:** base forms of words  
- **Dependency relations:** syntactic dependencies between tokens  
- **Topological fields:** sentence structure info (specific to syntaxdot parser)  
- **Stage:** annotation stage information  

In this step, we will extract some of these annotations and display sample values for exploration.


In [7]:
def extract_and_display_annotations(view_cas):
    print("Sample Lemmas:")
    lemmas = view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma')
    for lemma in list(lemmas)[:10]:
        print(f"Text span: '{view_cas.sofa_string[lemma.begin:lemma.end]}', Lemma: '{lemma.value}'")

    print("\nSample Dependency Relations (UDependency):")
    deps = list(view_cas.select('org.dakoda.syntax.UDependency'))
    for dep in deps[:10]:
        label = getattr(dep, 'DependencyType', 'N/A')
        gov = getattr(dep, 'Governor', None)
        depd = getattr(dep, 'Dependent', None)

        # Get governor covered text span
        if gov:
            gov_text = gov.get_covered_text() if hasattr(gov, 'get_covered_text') else str(gov)
        else:
            gov_text = "N/A"

        # Get dependent covered text span
        if depd:
            dep_text = depd.get_covered_text() if hasattr(depd, 'get_covered_text') else str(depd)
        else:
            dep_text = "N/A"

        print(f"Dependency: '{label}', Governor text: '{gov_text}', Dependent text: '{dep_text}'")

    print("\nSample Topological Fields:")
    topo_fields = view_cas.select('custom.syntaxdot.topofield')
    for tf in list(topo_fields)[:10]:
        print(f"Topological Field: '{tf.value}'")

    print("\nSample Stages:")
    stages = view_cas.select('org.dakoda.Stage')
    for stage in list(stages)[:5]:
        try:
            val = stage.value('value')  # Corrected here
        except Exception as e:
            val = f"<Error getting value: {e}>"
        print(f"Stage: '{val}'")

# Run the function
extract_and_display_annotations(view_cas)


Sample Lemmas:
Text span: '<DF-8-47.74><TM-2>', Lemma: '<DF-8-47.74><TM-2>'
Text span: 'Zusammenfassung', Lemma: 'Zusammenfassung'
Text span: 'des', Lemma: 'd'
Text span: 'Artikel', Lemma: 'Artikel'
Text span: '„', Lemma: '„'
Text span: 'Jugendherberge', Lemma: 'Jugendherberge'
Text span: '–', Lemma: '–'
Text span: 'ein', Lemma: 'e'
Text span: '‚', Lemma: '‚'
Text span: 'Hotel', Lemma: 'Hotel'

Sample Dependency Relations (UDependency):
Dependency: 'punct', Governor text: 'bechreibt', Dependent text: '.'
Dependency: 'root', Governor text: 'bechreibt', Dependent text: 'bechreibt'
Dependency: 'root', Governor text: '<DF-8-47.74><TM-2>', Dependent text: '<DF-8-47.74><TM-2>'
Dependency: 'punct', Governor text: 'Jugendherberge', Dependent text: '“'
Dependency: 'amod', Governor text: 'Person', Dependent text: 'unbekannten'
Dependency: 'det', Governor text: 'Artikel', Dependent text: 'des'
Dependency: 'det', Governor text: 'Person', Dependent text: 'ein'
Dependency: 'punct', Governor text: 'Z

## Compare Learner Text with Target Hypothesis Views

The learner corpora provide multiple annotation views:

- **`ctok`**: The original learner-produced text with tokens and annotations  
- **Target Hypothesis (TH) views** like `transgec_th1`, `mixtral_th1` etc., representing automatic "corrected" or target forms of the learner's text

In this step, we will:
- Load both learner and target hypothesis views from the same XMI file  
- Explore differences in token counts  
- Investigate token alignments using `org.dakoda.crosslayermap` annotations  
- Understand how tokens correspond across views


In [8]:
def compare_learner_and_th_views(xmi_file, typesystem_path, learner_view='ctok', th_view='transgec_th1'):
    with open(typesystem_path, 'rb') as f:
        typesystem = cassis.load_typesystem(f)
    with open(xmi_file, 'rb') as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
    
    views = [sofa.sofaID for sofa in cas.sofas]
    if learner_view not in views or th_view not in views:
        raise ValueError(f"One or both views {learner_view}, {th_view} not found in XMI.")
    
    learner_cas = cas.get_view(learner_view)
    th_cas = cas.get_view(th_view)
    
    # Token counts
    token_type = typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
    learner_tokens = list(learner_cas.select(token_type.name))
    th_tokens = list(th_cas.select(token_type.name))
    
    print(f"Tokens in learner view '{learner_view}': {len(learner_tokens)}")
    print(f"Tokens in target hypothesis view '{th_view}': {len(th_tokens)}")
    
    # Cross-layer mappings
    crossmap_type = typesystem.get_type('org.dakoda.crosslayermap')
    cross_maps = list(th_cas.select(crossmap_type.name))
    print(f"Number of cross-layer token mappings in TH view: {len(cross_maps)}")
    
    # Show example cross-layer mappings
    print("\nExample cross-layer token mappings (TH token span => learner token):")
    for i, cm in enumerate(cross_maps[:10]):
        print(f"TH token span: ({cm.begin}, {cm.end}), refers to learner token index: {cm.index}, sentence nth: {cm.nth}")


In [9]:
example_file = r"data\CDLK\learner_xmi\201006ZW063.xmi"
compare_learner_and_th_views(example_file, TYPESYSTEM_PATH)


Tokens in learner view 'ctok': 198
Tokens in target hypothesis view 'transgec_th1': 199
Number of cross-layer token mappings in TH view: 199

Example cross-layer token mappings (TH token span => learner token):
TH token span: (0, 15), refers to learner token index: 2_1, sentence nth: 1
TH token span: (16, 19), refers to learner token index: 2_2, sentence nth: 2
TH token span: (20, 28), refers to learner token index: 2_3, sentence nth: 3
TH token span: (29, 30), refers to learner token index: 2_4, sentence nth: 4
TH token span: (31, 45), refers to learner token index: 2_5, sentence nth: 5
TH token span: (46, 47), refers to learner token index: 2_6, sentence nth: 6
TH token span: (48, 51), refers to learner token index: 2_7, sentence nth: 7
TH token span: (52, 53), refers to learner token index: 2_8, sentence nth: 8
TH token span: (54, 59), refers to learner token index: 2_9, sentence nth: 9
TH token span: (60, 61), refers to learner token index: 2_10, sentence nth: 10


## Flexible View Selection and Visualization

Learner corpora contain multiple annotation views (e.g., `ctok`, `transgec_th1`, `spacy_ctok`, etc.).

In this step, you will select any view dynamically and visualize annotations in that view.

This allows exploration of different parser outputs and corrected target hypotheses.


In [10]:
# List available views in the loaded CAS
available_views = [sofa.sofaID for sofa in cas.sofas]
print("Available views:", available_views)

view_selector = widgets.Dropdown(
    options=available_views,
    value='ctok',
    description='View:',
)

output = widgets.Output()

def display_view_annotations(view_name):
    output.clear_output()
    view_cas = cas.get_view(view_name)
    html = highlight_pos_inline(view_cas)  # reuse your inline POS highlighting function from earlier
    with output:
        display(HTML(html))
        print(f"Displayed view: {view_name}")

widgets.interact(display_view_annotations, view_name=view_selector)
display(output)


Available views: ['_InitialView', 'ctok', 'ctok_linbomb', 'mixtral_th1', 'transgec_th1', 'mbartgec_th1', 'spacy_ctok', 'spacy_ctok_linbomb', 'spacy_mixtral_th1', 'spacy_transgec_th1', 'spacy_mbartgec_th1']


interactive(children=(Dropdown(description='View:', index=1, options=('_InitialView', 'ctok', 'ctok_linbomb', …

Output()

## Side-by-Side Token and POS Comparison

Select two views from the same XMI file to compare their tokenization and POS tagging.

The table below shows tokens and POS tags from both views side by side for easy comparison.


In [11]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, HTML

# Reuse your existing corpus folder and typesystem path variables
# CDLK_FOLDER, KLP1_FOLDER, TYPESYSTEM_PATH

def get_tokens_pos(view_cas):
    tokens = list(view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'))
    pos_annots = list(view_cas.select('de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS'))
    pos_dict = {(pos.begin, pos.end): pos.PosValue for pos in pos_annots}

    rows = []
    for token in tokens:
        pos_val = pos_dict.get((token.begin, token.end), "N/A")
        rows.append({"Token": token.get_covered_text(), "POS": pos_val})
    return pd.DataFrame(rows)

def compare_views(xmi_path, typesystem_path, view1_name, view2_name):
    with open(typesystem_path, 'rb') as f:
        typesystem = cassis.load_typesystem(f)
    with open(xmi_path, 'rb') as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)

    views = [sofa.sofaID for sofa in cas.sofas]
    if view1_name not in views or view2_name not in views:
        print(f"One or both views not found. Available views: {views}")
        return

    view1 = cas.get_view(view1_name)
    view2 = cas.get_view(view2_name)

    df1 = get_tokens_pos(view1)
    df2 = get_tokens_pos(view2)

    # Combine side by side with suffixes
    combined = pd.concat([df1.add_suffix(f" ({view1_name})"), df2.add_suffix(f" ({view2_name})")], axis=1)
    display(combined)

# Widget setup

corpus_folders = {
    'CDLK': CDLK_FOLDER,
    'KLP1': KLP1_FOLDER
}

def get_files_for_corpus(corpus):
    folder = corpus_folders[corpus]
    return sorted([f for f in os.listdir(folder) if f.endswith('.xmi')])

corpus_selector = widgets.Dropdown(options=list(corpus_folders.keys()), value='CDLK', description='Corpus:')
file_selector = widgets.Dropdown(description='File:', options=get_files_for_corpus(corpus_selector.value))
view_selector_1 = widgets.Dropdown(description='View 1:', options=[])
view_selector_2 = widgets.Dropdown(description='View 2:', options=[])

output = widgets.Output()

def update_file_list(change):
    files = get_files_for_corpus(change['new'])
    file_selector.options = files
    file_selector.value = files[0] if files else None

def update_view_options(*args):
    if file_selector.value is None:
        view_selector_1.options = []
        view_selector_2.options = []
        return
    xmi_path = os.path.join(corpus_folders[corpus_selector.value], file_selector.value)
    with open(TYPESYSTEM_PATH, 'rb') as f:
        typesystem = cassis.load_typesystem(f)
    with open(xmi_path, 'rb') as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
    views = [sofa.sofaID for sofa in cas.sofas]
    view_selector_1.options = views
    view_selector_2.options = views
    if len(views) >= 2:
        view_selector_1.value = views[0]
        view_selector_2.value = views[1]
    elif len(views) == 1:
        view_selector_1.value = views[0]
        view_selector_2.value = views[0]

def compare_and_display(corpus, filename, view1, view2):
    output.clear_output()
    if None in [corpus, filename, view1, view2]:
        return
    xmi_path = os.path.join(corpus_folders[corpus], filename)
    with output:
        try:
            compare_views(xmi_path, TYPESYSTEM_PATH, view1, view2)
        except Exception as e:
            print(f"Error: {e}")

# Setup observers to update widgets dynamically
corpus_selector.observe(update_file_list, names='value')
file_selector.observe(update_view_options, names='value')

update_file_list({'new': corpus_selector.value})

# Display interactive widgets and output
ui = widgets.VBox([corpus_selector, file_selector, view_selector_1, view_selector_2])
widgets.interact(compare_and_display, corpus=corpus_selector, filename=file_selector, view1=view_selector_1, view2=view_selector_2)
display(ui, output)


interactive(children=(Dropdown(description='Corpus:', options=('CDLK', 'KLP1'), value='CDLK'), Dropdown(descri…

VBox(children=(Dropdown(description='Corpus:', options=('CDLK', 'KLP1'), value='CDLK'), Dropdown(description='…

Output()

## Visualize Dependency Relations in a Selected View

Dependency relations represent syntactic links between words, such as subject, object, etc.

In this step, we'll extract dependency annotations from a selected view and display sample governor-dependent relations and their labels.


In [12]:
def extract_dependency_relations_fixed(view_cas, dep_type='org.dakoda.syntax.UDependency', max_examples=10):
    deps = list(view_cas.select(dep_type))
    print(f"Total {dep_type} annotations found: {len(deps)}")
    print(f"Showing first {max_examples} dependency relations:\n")

    for dep in deps[:max_examples]:
        label = getattr(dep, 'DependencyType', 
                        getattr(dep, 'dep', 'N/A'))  # Try common attribute names
        
        gov_text = "N/A"
        dep_text = "N/A"

        try:
            # Try govBegin/govEnd span
            if hasattr(dep, 'govBegin') and hasattr(dep, 'govEnd'):
                gov_text = view_cas.sofa_string[dep.govBegin:dep.govEnd]
            # Else try Governor feature with get_covered_text method
            elif hasattr(dep, 'Governor') and dep.Governor is not None:
                gov_text = dep.Governor.get_covered_text()
        except Exception as e:
            gov_text = f"Error: {e}"

        try:
            # Try depBegin/depEnd span
            if hasattr(dep, 'depBegin') and hasattr(dep, 'depEnd'):
                dep_text = view_cas.sofa_string[dep.depBegin:dep.depEnd]
            # Else try Dependent feature with get_covered_text method
            elif hasattr(dep, 'Dependent') and dep.Dependent is not None:
                dep_text = dep.Dependent.get_covered_text()
        except Exception as e:
            dep_text = f"Error: {e}"

        print(f"Dependency: '{label}', Governor: '{gov_text}', Dependent: '{dep_text}'")

# Example usage
view_name = 'ctok'
view_cas = cas.get_view(view_name)
extract_dependency_relations_fixed(view_cas)


Total org.dakoda.syntax.UDependency annotations found: 198
Showing first 10 dependency relations:

Dependency: 'punct', Governor: 'bechreibt', Dependent: '.'
Dependency: 'root', Governor: 'bechreibt', Dependent: 'bechreibt'
Dependency: 'root', Governor: '<DF-8-47.74><TM-2>', Dependent: '<DF-8-47.74><TM-2>'
Dependency: 'punct', Governor: 'Jugendherberge', Dependent: '“'
Dependency: 'amod', Governor: 'Person', Dependent: 'unbekannten'
Dependency: 'det', Governor: 'Artikel', Dependent: 'des'
Dependency: 'det', Governor: 'Person', Dependent: 'ein'
Dependency: 'punct', Governor: 'Zusammenfassung', Dependent: '‘'
Dependency: 'nmod', Governor: 'Hotel', Dependent: 'Leute'
Dependency: 'amod', Governor: 'Leute', Dependent: 'junge'


## Interactive Dependency Relation Explorer

Select corpus, file, and annotation view below to explore syntactic dependencies interactively.

You can also specify how many dependency relations to display.


In [13]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Corpus folders (reuse your paths)
corpus_folders = {
    'CDLK': CDLK_FOLDER,
    'KLP1': KLP1_FOLDER
}

# Function to get available views in the selected file
def get_views(xmi_path, typesystem_path):
    with open(typesystem_path, 'rb') as f:
        typesystem = cassis.load_typesystem(f)
    with open(xmi_path, 'rb') as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
    return [sofa.sofaID for sofa in cas.sofas]

# Widget definitions
corpus_selector = widgets.Dropdown(options=list(corpus_folders.keys()), description="Corpus:")
file_selector = widgets.Dropdown(description="File:")
view_selector = widgets.Dropdown(description="View:")
num_deps_selector = widgets.IntSlider(value=10, min=1, max=50, step=1, description='Max Deps:')

output = widgets.Output()

# Update file list based on corpus
def update_files(change):
    folder = corpus_folders[change['new']]
    files = sorted([f for f in os.listdir(folder) if f.endswith('.xmi')])
    file_selector.options = files
    if files:
        file_selector.value = files[0]

# Update views based on selected file
def update_views(change):
    if change['new'] is None:
        view_selector.options = []
        return
    corpus = corpus_selector.value
    folder = corpus_folders[corpus]
    xmi_path = os.path.join(folder, change['new'])
    try:
        views = get_views(xmi_path, TYPESYSTEM_PATH)
        view_selector.options = views
        if views:
            view_selector.value = views[0]
    except Exception as e:
        view_selector.options = []
        with output:
            clear_output()
            print(f"Error loading views: {e}")

# Display dependencies based on selection
def display_dependencies(corpus, filename, view_name, max_deps):
    output.clear_output()
    if None in [corpus, filename, view_name]:
        return
    folder = corpus_folders[corpus]
    xmi_path = os.path.join(folder, filename)
    with output:
        try:
            with open(TYPESYSTEM_PATH, 'rb') as f:
                typesystem = cassis.load_typesystem(f)
            with open(xmi_path, 'rb') as f:
                cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
            view_cas = cas.get_view(view_name)

            # Reuse extraction function
            extract_dependency_relations_fixed(view_cas, max_examples=max_deps)
        except Exception as e:
            print(f"Error: {e}")

# Observers
corpus_selector.observe(update_files, names='value')
file_selector.observe(update_views, names='value')

update_files({'new': corpus_selector.value})

# Display the widgets and output
ui = widgets.VBox([corpus_selector, file_selector, view_selector, num_deps_selector])
display(ui, output)

# Interactive linkage
widgets.interact(display_dependencies,
                 corpus=corpus_selector,
                 filename=file_selector,
                 view_name=view_selector,
                 max_deps=num_deps_selector)


VBox(children=(Dropdown(description='Corpus:', options=('CDLK', 'KLP1'), value='CDLK'), Dropdown(description='…

Output()

interactive(children=(Dropdown(description='Corpus:', options=('CDLK', 'KLP1'), value='CDLK'), Dropdown(descri…

<function __main__.display_dependencies(corpus, filename, view_name, max_deps)>

# Summary of Annotation Exploration and Visualization

In this notebook, we have completed the following key tasks to explore and visualize learner corpora annotations:

1. **Loaded learner corpora XMI files with the correct typesystem.**  
   - Demonstrated how to load and access multiple annotation views within a CAS.

2. **Listed all annotation types present in a corpus view along with their counts.**  
   - Provided insight into the variety and distribution of linguistic and learner-specific annotations.

3. **Extracted and displayed key annotations including:**  
   - Lemmas (base forms of words)  
   - Dependency relations (syntactic links between tokens)  
   - Topological fields (sentence structure info)  
   - Annotation stages

4. **Compared learner text (`ctok`) with target hypothesis views (`transgec_th1`)**  
   - Showed token count differences and demonstrated cross-layer token alignment mappings.

5. **Implemented flexible view selection with dynamic annotation visualization.**  
   - Allowed users to switch between different corpus views and visualize POS annotations inline with token color highlighting.

6. **Created an interactive POS highlighting widget**  
   - Enabled corpus and file selection with inline token text highlighted by part-of-speech tags, including a clear legend.

7. **Built an interactive dependency relation explorer**  
   - Users can select corpus, file, view, and number of dependencies to display detailed dependency relations with governor and dependent tokens.

---

This notebook provides a foundation for in-depth annotation analysis and visualization, paving the way for further linguistic and learner-specific research.

Next steps could include interactive filtering, dependency graph visualization, or learner error analysis.

