# Task 3: Corpus Comparison and Statistical Analysis

In this notebook, we compare the CDLK and KLP1 learner corpora using the metadata extracted previously.

We will:

- Load metadata CSV files for both corpora.
- Compute grouped summary statistics.
- Visualize distributions with side-by-side boxplots.
- Demonstrate equivalent analysis using both pandas and polars.


### Paths to metadata CSV files
Make sure these CSV files exist in the `Outputs/metadata_csvs` folder, generated in Task 1.


In [3]:
import pandas as pd
import polars as pl
import plotly.express as px

import os
# Change working directory to your project root folder
os.chdir(r"C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes")
print("Working directory changed to:", os.getcwd())


CDLK_CSV = r"Outputs/metadata_csvs/cdlk_metadata_pandas.csv"
KLP1_CSV = r"Outputs/metadata_csvs/klp1_metadata_pandas.csv"


Working directory changed to: C:\Users\Vedang Deshmukh\Desktop\dakoda-recipes


### Load metadata into pandas and polars DataFrames

In [4]:
cdlk_pd = pd.read_csv(CDLK_CSV)
klp1_pd = pd.read_csv(KLP1_CSV)

cdlk_pl = pl.read_csv(CDLK_CSV)
klp1_pl = pl.read_csv(KLP1_CSV)

print("CDLK metadata loaded (pandas):")
display(cdlk_pd.head())

print("KLP1 metadata loaded (pandas):")
display(klp1_pd.head())


CDLK metadata loaded (pandas):


Unnamed: 0,filename,text_length,token_count,sentence_count
0,201006ZW005.xmi,1106,203,11
1,201006ZW012.xmi,1327,226,16
2,201006ZW019.xmi,869,155,9
3,201006ZW021.xmi,1354,234,11
4,201006ZW022.xmi,1076,204,13


KLP1 metadata loaded (pandas):


Unnamed: 0,filename,text_length,token_count,sentence_count
0,3360_1.xmi,408,84,11
1,3360_2.xmi,394,85,9
2,3361_1.xmi,601,112,11
3,3361_2.xmi,377,82,9
4,3362_1.xmi,69,16,4


### Combine CDLK and KLP1 metadata with a corpus label for grouped analysis

In [5]:
cdlk_pd['corpus'] = 'CDLK'
klp1_pd['corpus'] = 'KLP1'

combined_pd = pd.concat([cdlk_pd, klp1_pd], ignore_index=True)
display(combined_pd.head())

# Same for polars
cdlk_pl = cdlk_pl.with_columns(pl.lit('CDLK').alias('corpus'))
klp1_pl = klp1_pl.with_columns(pl.lit('KLP1').alias('corpus'))

combined_pl = pl.concat([cdlk_pl, klp1_pl])
print(combined_pl.head())


Unnamed: 0,filename,text_length,token_count,sentence_count,corpus
0,201006ZW005.xmi,1106,203,11,CDLK
1,201006ZW012.xmi,1327,226,16,CDLK
2,201006ZW019.xmi,869,155,9,CDLK
3,201006ZW021.xmi,1354,234,11,CDLK
4,201006ZW022.xmi,1076,204,13,CDLK


shape: (5, 5)
┌─────────────────┬─────────────┬─────────────┬────────────────┬────────┐
│ filename        ┆ text_length ┆ token_count ┆ sentence_count ┆ corpus │
│ ---             ┆ ---         ┆ ---         ┆ ---            ┆ ---    │
│ str             ┆ i64         ┆ i64         ┆ i64            ┆ str    │
╞═════════════════╪═════════════╪═════════════╪════════════════╪════════╡
│ 201006ZW005.xmi ┆ 1106        ┆ 203         ┆ 11             ┆ CDLK   │
│ 201006ZW012.xmi ┆ 1327        ┆ 226         ┆ 16             ┆ CDLK   │
│ 201006ZW019.xmi ┆ 869         ┆ 155         ┆ 9              ┆ CDLK   │
│ 201006ZW021.xmi ┆ 1354        ┆ 234         ┆ 11             ┆ CDLK   │
│ 201006ZW022.xmi ┆ 1076        ┆ 204         ┆ 13             ┆ CDLK   │
└─────────────────┴─────────────┴─────────────┴────────────────┴────────┘


### Compute summary statistics grouped by corpus

We compute mean, median, std, min, and max for token counts, sentence counts, and text length.


In [19]:
# Pandas summary
summary_pd = combined_pd.groupby('corpus').agg({
    'token_count': ['mean', 'median', 'std', 'min', 'max'],
    'sentence_count': ['mean', 'median', 'std', 'min', 'max'],
    'text_length': ['mean', 'median', 'std', 'min', 'max']
})

display(summary_pd)

# Polars summary
import polars as pl

# Polars summary
print(type(combined_pl))  # This will print <class 'polars.dataframe.frame.DataFrame'>

summary_pl = (
    combined_pl.group_by("corpus")
    .agg([
        pl.col("token_count").mean().alias("token_mean"),
        pl.col("token_count").median().alias("token_median"),
        pl.col("token_count").std().alias("token_std"),
        pl.col("token_count").min().alias("token_min"),
        pl.col("token_count").max().alias("token_max"),
        pl.col("sentence_count").mean().alias("sent_mean"),
        pl.col("sentence_count").median().alias("sent_median"),
        pl.col("sentence_count").std().alias("sent_std"),
        pl.col("sentence_count").min().alias("sent_min"),
        pl.col("sentence_count").max().alias("sent_max"),
        pl.col("text_length").mean().alias("length_mean"),
        pl.col("text_length").median().alias("length_median"),
        pl.col("text_length").std().alias("length_std"),
        pl.col("text_length").min().alias("length_min"),
        pl.col("text_length").max().alias("length_max"),
    ])
)

print(summary_pl)



Unnamed: 0_level_0,token_count,token_count,token_count,token_count,token_count,sentence_count,sentence_count,sentence_count,sentence_count,sentence_count,text_length,text_length,text_length,text_length,text_length
Unnamed: 0_level_1,mean,median,std,min,max,mean,median,std,min,max,mean,median,std,min,max
corpus,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
CDLK,207.935,204.0,45.420213,66,363,14.44,14.0,3.183879,7,24,1219.38,1193.5,274.091804,391,2151
KLP1,164.305927,167.0,56.041567,4,410,14.963671,15.0,5.437261,2,41,847.25239,859.0,297.096572,18,2240


<class 'polars.dataframe.frame.DataFrame'>
shape: (2, 16)
┌────────┬────────────┬────────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ corpus ┆ token_mean ┆ token_medi ┆ token_std ┆ … ┆ length_me ┆ length_st ┆ length_mi ┆ length_ma │
│ ---    ┆ ---        ┆ an         ┆ ---       ┆   ┆ dian      ┆ d         ┆ n         ┆ x         │
│ str    ┆ f64        ┆ ---        ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│        ┆            ┆ f64        ┆           ┆   ┆ f64       ┆ f64       ┆ i64       ┆ i64       │
╞════════╪════════════╪════════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ CDLK   ┆ 207.935    ┆ 204.0      ┆ 45.420213 ┆ … ┆ 1193.5    ┆ 274.09180 ┆ 391       ┆ 2151      │
│        ┆            ┆            ┆           ┆   ┆           ┆ 4         ┆           ┆           │
│ KLP1   ┆ 164.305927 ┆ 167.0      ┆ 56.041567 ┆ … ┆ 859.0     ┆ 297.09657 ┆ 18        ┆ 2240      │
│        ┆            ┆          

### Compare distributions with side-by-side boxplots

In [20]:
import plotly.graph_objects as go

def boxplot_comparison(df, column, title):
    fig = go.Figure()
    for corpus in df['corpus'].unique():
        fig.add_trace(go.Box(
            y=df[df['corpus'] == corpus][column],
            name=corpus
        ))
    fig.update_layout(title=title, yaxis_title=column)
    fig.show()

boxplot_comparison(combined_pd, 'token_count', 'Token Count Distribution by Corpus')
boxplot_comparison(combined_pd, 'sentence_count', 'Sentence Count Distribution by Corpus')
boxplot_comparison(combined_pd, 'text_length', 'Text Length Distribution by Corpus')


### Summary

This notebook demonstrated how to compare basic metadata statistics between two learner corpora using pandas and polars.

You saw how to:
- Load metadata for multiple corpora
- Combine and label them for grouped analyses
- Compute grouped summary statistics
- Visualize distribution differences side-by-side

Next, you can explore more linguistic annotation differences or move on to interactive visualization dashboards.
