## Evaluating Vabamorf's syllabification functionality

This notebook contains code for evaluating [Vabamorf's syllabification](https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/syllabification.ipynb). 
We use the same evaluation resources as in the paper "Looking into Estonian Syllabification" by [Kaalep (2022)](https://www.cl.ut.ee/yllitised/Kaalep_HLT2022.pdf).

In [1]:
import os, os.path
import re, sys
import json

from pandas import DataFrame
from datetime import datetime, timedelta

from estnltk.vabamorf.morf import syllabify_word

### Evaluate on IEL's syllabified words

The Institute of the Estonian Language (IEL) provides a [syllabification software module](https://www.eki.ee/tarkvara/silbitus/) and a list of 221k words syllabified with it. 

Please download the list from here: [https://www.eki.ee/tarkvara/wordlist/silbitus.dic](https://www.eki.ee/tarkvara/wordlist/silbitus.dic)

In [2]:
in_file = 'silbitus.dic'
assert os.path.exists(in_file), f'(!) Missing evaluation file {in_file!r}'

In [3]:
def test_on_silbitus_dic(in_file='silbitus.dic', encoding='ISO-8859-13', mismatches_as_df=True):
    '''Tests estnltk's syllabify_word on IEL's "silbitus.dic". 
       Returns tuple (word_count, match_count, mismatches). 
       If mismatches_as_df == True (default), then converts mismatches to DataFrame.
    '''
    assert os.path.isfile(in_file), '(!) Missing input file {!r}'.format(in_file)
    word_count = 0
    match_count = 0
    mismatch = []
    with open(in_file, mode='r', encoding=encoding) as in_f:
        for line in in_f:
            line = line.strip()
            if len(line) > 0:
                word_count += 1
                #
                # 1) Normalize compound word boundaries:
                # aad-li+me-he ==> aad-li-me-he
                #
                line = line.replace('+', '-')
                amb_variants = []
                if '?' in line:
                    #
                    # 2) Assume ? marks ambiguity and generate all variants:
                    # 'a-kor-di?o-ni' ==> 'a-kor-dio-ni', 'a-kor-di-o-ni'
                    # 'a?b?c' ==> 'a-b-c', 'a-bc', 'ab-c', 'abc'
                    all_variants = []
                    parts = line.split('?')
                    for pid, part in enumerate(parts):
                        if pid < len(parts) - 1:
                            # branching variants
                            part1 = part + '-'
                            part2 = part
                            if not all_variants:
                                all_variants = [part1, part2]
                            else:
                                new_all_variants = []
                                for var in all_variants:
                                    new_all_variants.append(var+part1)
                                    new_all_variants.append(var+part2)
                                all_variants = new_all_variants
                        else:
                            new_all_variants = []
                            for var in all_variants:
                                new_all_variants.append(var+part)
                            all_variants = new_all_variants
                    assert len(all_variants) == 2 ** (len(parts)-1)
                    amb_variants = all_variants
                word_clean = line.replace('-', '').replace('?', '')
                syllabified = syllabify_word( word_clean )
                syllabified_norm = '-'.join( [s["syllable"] for s in syllabified] )
                if '?' not in line and line == syllabified_norm:
                    match_count += 1
                elif '?' in line and syllabified_norm in amb_variants:
                    match_count += 1
                else:
                    mismatch.append({'estnltk_vabamorf': syllabified_norm, 'eli': line})
    if mismatches_as_df:
        mismatch = DataFrame.from_dict(mismatch)
    return word_count, match_count, mismatch

In [4]:
def percent(items, items_total):
    percent = (items/items_total)*100.0
    return '({:.2f}%)'.format(percent)

In [5]:
total_start_time = datetime.now()
word_count, match_count, mismatch = test_on_silbitus_dic(in_file=in_file)
print('  Total time: {}'.format(datetime.now() - total_start_time))
print('  Identically syllabified words:    {} / {} {}'.format(match_count, word_count, percent(match_count, word_count)))
print(f'  Differently syllabified words:    {len(mismatch)}')
display(mismatch.head(10))
print('  ', '...')
print()

  Total time: 0:00:26.947429
  Identically syllabified words:    216827 / 221328 (97.97%)
  Differently syllabified words:    4501


Unnamed: 0,estnltk_vabamorf,eli
0,aa-da-maü-li-kon-nas,aa-da-ma-ü-li-kon-nas
1,aar-and,aa-rand
2,aas-taa-jast,aas-ta-a-jast
3,aas-ta-ist,aas-taist
4,aa-žio,aa-ži-o
5,aa-to-mi-ka-tast-roof,aa-to-mi-ka-ta-stroof
6,aa-to-mi-ka-tast-roo-fi,aa-to-mi-ka-ta-stroo-fi
7,aa-to-mis-pio-naa-ži-a-last,aa-to-mi-spio-naa-ži-a-last
8,a-biot-si-valt,a-bi-ot-si-valt
9,a-bi-po-lit-se-ist,a-bi-po-lit-seist


   ...



### Evaluate on the syllabification corpus

[Kaalep (2022)](https://www.cl.ut.ee/yllitised/Kaalep_HLT2022.pdf) introduces a new [syllabification tool SYLL](https://gitlab.com/tilluteenused/docker_elg_syllabifier), and provides an evaluation corpus processed with the tool. 
The corpus is composed of five different subcorpora: fiction texts (`ilu`), newspaper texts (`aja`), conversations (`kone`), chatroom texts (`jututoad`) and child caregiver speeches (`childes`). 

Please download and unpack the corpus from here: https://cl.ut.ee/korpused/silbikorpus/korpused.zip

In [6]:
in_dir = 'korpus'
assert os.path.isdir(in_dir), f'(!) Missing evaluation dir {in_dir!r}'

In [7]:
def test_on_silp_file(in_file, encoding='utf-8', type_wise=True, mismatches_as_df=True):
    '''Tests estnltk's syllabify_word on .silp file from the silbikorpus. 
       Returns tuple (word_count, match_count, mismatches). 
       If type_wise=True, the evaluates only on word types (unique words). 
       Otherwise, evaluates on all words. 
       If mismatches_as_df == True (default), then converts mismatches to 
       a DataFrame.
    '''
    assert os.path.isfile(in_file), '(!) Missing input file {!r}'.format(in_file)
    word_count = 0
    match_count = 0
    mismatch = []
    seen_types = set()
    with open(in_file, mode='r', encoding=encoding) as in_f:
        for line in in_f:
            line = line.strip()
            if len(line) > 0:
                #
                # 0) Normalize syllable boundaries:
                # üt.le.sid ==> üt-le-sid
                #                
                line = line.replace('.', '-')
                #
                # 1) Normalize compound word boundaries:
                # le-vi_piir-kon-nast ==> le-vi-piir-kon-nast
                #
                line = line.replace('_', '-')
                #
                # 2) Strip surrounding hyhens:
                # kü-si- => kü-si, vä- => vä 
                # 
                line = line.strip('-')
                word_clean = line.replace('-', '')
                if type_wise and word_clean in seen_types:
                    # In case of a type wise evaluation, skip seen types
                    continue
                if len(word_clean) > 0:
                    syllabified = syllabify_word( word_clean )
                    syllabified_norm = '-'.join( [s["syllable"] for s in syllabified] )
                    if line == syllabified_norm:
                        match_count += 1
                    else:
                        mismatch.append({'estnltk_vabamorf': syllabified_norm, 'hfst-xfst_silbita': line})
                seen_types.add(word_clean)
                word_count += 1
    if mismatches_as_df:
        mismatch = DataFrame.from_dict(mismatch)
    return word_count, match_count, mismatch

In [8]:
total_start_time = datetime.now()
for fname in sorted(os.listdir(in_dir)):
    if fname.endswith('.silp'):
        print(f' Evaluating on {fname!r} ...')
        word_count, match_count, mismatch = test_on_silp_file(os.path.join(in_dir, fname), type_wise=True)
        print('  Identically syllabified types:    {} / {} {}'.format(match_count, word_count, percent(match_count, word_count)))
        print(f'  Differently syllabified types:    {len(mismatch)}')
        display(mismatch.head(10))
        print('  ', '...')
        print()
print()
print('  Total time: {}'.format(datetime.now() - total_start_time))

 Evaluating on 'aja.silp' ...
  Identically syllabified types:    32655 / 33397 (97.78%)
  Differently syllabified types:    741


Unnamed: 0,estnltk_vabamorf,hfst-xfst_silbita
0,pea-mi-nist-ri,pea-mi-nis-tri
1,re-konst-ru-ee-ri-mi-seks,re-kons-tru-ee-ri-mi-seks
2,re-konst-ru-ee-ri-mi-se,re-kons-tru-ee-ri-mi-se
3,ra-han-dus-mi-nis-tee-ri-u-mi-le,ra-han-dus-mi-nis-tee-riu-mi-le
4,väärt-pa-be-ri-spet-si-a-lis-ti-de,väärt-pa-be-ri-spet-sia-lis-ti-de
5,char-lie,char-li-e
6,kesk-de-po-si-too-ri-um,kesk-de-po-si-too-rium
7,in-te-rag-ro,in-ter-ag-ro
8,re-gist-ree-ri-tud,re-gis-tree-ri-tud
9,ä-ri-re-gist-ris,ä-ri-re-gis-tris


   ...

 Evaluating on 'childes.silp' ...
  Identically syllabified types:    20316 / 21056 (96.49%)
  Differently syllabified types:    739


Unnamed: 0,estnltk_vabamorf,hfst-xfst_silbita
0,mi-nua-rust,mi-nu-a-rust
1,mis-as-jad,mi-sas-jad
2,e-lekt-ri-pos-ti,e-lek-tri-pos-ti
3,pe-daaal,pe-daa-al
4,ah-haaa,ah-haa-a
5,öäk,ö-äk
6,õõõõh,õõ-õõh
7,õõõ,õõ-õ
8,jaaah,jaa-ah
9,aaahkkkk,aa-ahkkkk


   ...

 Evaluating on 'ilu.silp' ...
  Identically syllabified types:    25993 / 26243 (99.05%)
  Differently syllabified types:    249


Unnamed: 0,estnltk_vabamorf,hfst-xfst_silbita
0,se-be-de-u-se,se-be-deu-se
1,õr-rõr-rõrr,õrr-õrr-õrr
2,e-lekt-ri-lii-ni,e-lek-tri-lii-ni
3,sot-si-a-lis-mi,sot-sia-lis-mi
4,e-baõn-nes-tu-nud,e-ba-õn-nes-tu-nud
5,püs-to-lip-eet-ri-test,püs-to-li-peet-ri-test
6,söö-du-meist-ri-test,söö-du-meis-tri-test
7,konst-ru-ee-ri-tud,kons-tru-ee-ri-tud
8,tree-nin-gui-su,tree-nin-gu-i-su
9,pa-haa-i-ma-ma-tult,pa-ha-ai-ma-ma-tult


   ...

 Evaluating on 'jututoad.silp' ...
  Identically syllabified types:    16128 / 17022 (94.75%)
  Differently syllabified types:    894


Unnamed: 0,estnltk_vabamorf,hfst-xfst_silbita
0,päi-päää,päi-pää-ä
1,me-ga-pei-päää,me-ga-pei-pää-ä
2,dä-räää,dä-rää-ä
3,li-nu-xi-spet-si-a-list,li-nu-xis-pet-sia-list
4,eo-i,e-oi
5,öäk,ö-äk
6,njaaa,njaa-a
7,put-siiii,put-sii-ii
8,kao-zot-sibn-aist,kaoz-ot-sib-naist
9,teee,tee-e


   ...

 Evaluating on 'kone.silp' ...
  Identically syllabified types:    12476 / 12698 (98.25%)
  Differently syllabified types:    221


Unnamed: 0,estnltk_vabamorf,hfst-xfst_silbita
0,iksp-leind,iks-pleind
1,i-see-ne-sest,i-se-e-ne-sest
2,niö-el-da,ni-ö-el-da
3,füs-saõps,füs-sa-õps
4,va-naa,va-na-a
5,nõu-ko-gu-dee,nõu-ko-gu-de-e
6,ü-hea-eg-selt,ü-he-aeg-selt
7,jaa-nuar,jaa-nu-ar
8,mmis,m-mis
9,raim-ond,rai-mond


   ...


  Total time: 0:00:16.702215


---