# Task 1: Basic Corpus Data Loading and Exploration

This notebook demonstrates how to load learner corpus XMI files, extract basic metadata such as text length, token counts, and sentence counts, and prepare summary tables for further analysis.

You will learn how to:
- Load XMI files using the `dkpro-cassis` library and a given typesystem
- Inspect available annotation layers (views)
- Extract basic annotation statistics from the 'ctok' view
- Process multiple files and aggregate metadata using both pandas and polars

Let's get started!


In [None]:
import cassis
import polars as pl

# Define your paths (adjust if needed)
CDLK_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\CDLK\learner_xmi"
KLP1_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\KLP1\learner_xmi"
TYPESYSTEM_PATH = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\dakoda_typesystem.xml"


### Function to extract basic metadata from one XMI file

The following function loads a single XMI file, selects a view (default 'ctok'), and returns basic metadata including:

- Filename
- Text length (number of characters)
- Token count
- Sentence count


In [12]:

def extract_metadata_xmi(xmi_file, typesystem_path, view_name='ctok'):
    """
    Load one XMI file, extract basic metadata (text length, token count, sentence count)
    from a specified view.
    """
    # Load the typesystem XML
    with open(typesystem_path, "rb") as f:
        typesystem = cassis.load_typesystem(f)
        
    # Load the XMI file
    with open(xmi_file, "rb") as f:
        cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
        
    # Check if the requested view exists
    views = [sofa.sofaID for sofa in cas.sofas]
    if view_name not in views:
        print(f"View '{view_name}' not found in {os.path.basename(xmi_file)}. Available views: {views}")
        return None
    
    # Select the view
    view_cas = cas.get_view(view_name)
    
    # Extract text length
    text_length = len(view_cas.sofa_string)
    
    # Extract tokens
    token_type = typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
    tokens = view_cas.select(token_type.name)
    
    # Extract sentences
    sentence_type = typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence')
    sentences = view_cas.select(sentence_type.name)
    
    # Return metadata as a dictionary
    return {
        "filename": os.path.basename(xmi_file),
        "text_length": text_length,
        "token_count": len(tokens),
        "sentence_count": len(sentences)
    }


In [13]:
### Example: Extract metadata from one sample file
TYPESYSTEM_PATH = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\dakoda_typesystem.xml"
XMI_FILE = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\CDLK\learner_xmi\201006ZW063.xmi"

metadata = extract_metadata_xmi(XMI_FILE, TYPESYSTEM_PATH, view_name='ctok')

print(metadata)



{'filename': '201006ZW063.xmi', 'text_length': 1195, 'token_count': 198, 'sentence_count': 16}


### Process all XMI files in a folder and aggregate metadata into a pandas DataFrame

In [14]:
def process_corpus_folder_pandas(folder_path, typesystem_path, view_name='ctok'):
    files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith(".xmi")]
    records = []
    for f in files:
        meta = extract_metadata_xmi(f, typesystem_path, view_name)
        if meta:
            records.append(meta)
    return pd.DataFrame(records)


In [15]:
cdlk_df = process_corpus_folder_pandas(CDLK_FOLDER, TYPESYSTEM_PATH)
print("CDLK corpus summary:")
display(cdlk_df.head())


CDLK corpus summary:


Unnamed: 0,filename,text_length,token_count,sentence_count
0,201006ZW005.xmi,1106,203,11
1,201006ZW012.xmi,1327,226,16
2,201006ZW019.xmi,869,155,9
3,201006ZW021.xmi,1354,234,11
4,201006ZW022.xmi,1076,204,13


### Similar batch processing using polars DataFrame


In [16]:
def process_corpus_folder_polars(folder_path, typesystem_path, view_name='ctok'):
    files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith(".xmi")]
    records = []
    for f in files:
        meta = extract_metadata_xmi(f, typesystem_path, view_name)
        if meta:
            records.append(meta)
    return pl.DataFrame(records)


In [17]:
cdlk_pl = process_corpus_folder_polars(CDLK_FOLDER, TYPESYSTEM_PATH)
print("CDLK corpus summary (polars):")
print(cdlk_pl.head())




CDLK corpus summary (polars):
shape: (5, 4)
┌─────────────────┬─────────────┬─────────────┬────────────────┐
│ filename        ┆ text_length ┆ token_count ┆ sentence_count │
│ ---             ┆ ---         ┆ ---         ┆ ---            │
│ str             ┆ i64         ┆ i64         ┆ i64            │
╞═════════════════╪═════════════╪═════════════╪════════════════╡
│ 201006ZW005.xmi ┆ 1106        ┆ 203         ┆ 11             │
│ 201006ZW012.xmi ┆ 1327        ┆ 226         ┆ 16             │
│ 201006ZW019.xmi ┆ 869         ┆ 155         ┆ 9              │
│ 201006ZW021.xmi ┆ 1354        ┆ 234         ┆ 11             │
│ 201006ZW022.xmi ┆ 1076        ┆ 204         ┆ 13             │
└─────────────────┴─────────────┴─────────────┴────────────────┘


## Save metadata CSV files for both corpora using pandas and polars

The following cells save the processed corpus metadata to CSV files in the `Outputs/metadata_csvs` folder for use in subsequent tasks.


In [None]:
import os
# Define your paths (adjust if needed)
CDLK_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\CDLK\learner_xmi"
KLP1_FOLDER = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\KLP1\learner_xmi"
TYPESYSTEM_PATH = r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes\data\dakoda_typesystem.xml"

# Create output folder if it does not exist
os.makedirs("../Outputs/metadata_csvs", exist_ok=True)

# Process corpora with pandas batch function
cdlk_meta_pd = process_corpus_folder_pandas(CDLK_FOLDER, TYPESYSTEM_PATH, 'ctok')
klp1_meta_pd = process_corpus_folder_pandas(KLP1_FOLDER, TYPESYSTEM_PATH, 'ctok')

# Save pandas metadata CSVs
cdlk_meta_pd.to_csv("../Outputs/metadata_csvs/cdlk_metadata_pandas.csv", index=False)
klp1_meta_pd.to_csv("../Outputs/metadata_csvs/klp1_metadata_pandas.csv", index=False)

print("Pandas metadata CSVs saved successfully.")

# Process corpora with polars batch function
cdlk_meta_pl = process_corpus_folder_polars(CDLK_FOLDER, TYPESYSTEM_PATH, 'ctok')
klp1_meta_pl = process_corpus_folder_polars(KLP1_FOLDER, TYPESYSTEM_PATH, 'ctok')

# Save polars metadata CSVs
cdlk_meta_pl.write_csv("../Outputs/metadata_csvs/cdlk_metadata_polars.csv")
klp1_meta_pl.write_csv("../Outputs/metadata_csvs/klp1_metadata_polars.csv")

print("Polars metadata CSVs saved successfully.")




Pandas metadata CSVs saved successfully.


### Congratulations!

You have now loaded learner corpora, extracted and summarized basic metadata using both pandas and polars.

Next, you can proceed to build interactive browsing tools, comparative analyses, and explore annotation layers in more detail.

---

**Note:** Please ensure the folder paths match your environment before running the notebook.
