# Evaluation of Syllabifier 1.0

*Visual evaluation of the automatic syllabifier, using Altair*

In [224]:
import pandas as pd
import numpy as np
import altair as alt

We load the original training set of 20139 sylabified words from Bouma en Hermans. This list of words is organized alphabetically and runs up until the word 'kerstauonde'; i.e. words starting in the range L-Z ar not part of the training data!

In [225]:
bouma_hermans = {w.replace('-','').strip() for w in open('../data/crm.txt')}

Next, we load the new data set, which is a sample of 2000 words taken from the Corpus of Middle Dutch rhymed texts. This sample will serve as an evaluation of the predictions made by the syllabifier. We expect that mistakes will be made with regard to words in the range L-Z.

In [226]:
new = {w.replace('-', '').strip() for w in open('../data/syllabified_sample_words_1.txt')}

Before we will evaluate the model, we want to know the overlap (or: intersection) of words between both data sets.

In [227]:
intersection = bouma_hermans.intersection(new)
print('The overlap is', len(intersection), "words.")

The overlap is 117 words.


After manually checking the predictions for the 2000-word sample, we get an idea of how well the syllabifier performs on new, unseen data. The column below shows for each letter in the alphabet how many predictions are correct (True) and how many are incorrect (False).

In [228]:
df = pd.read_excel('../data/evaluation_syllabifier.xlsx')
df['correction'] = (df['syllabifier'] == df['manual_syllabification'])
df['alphabet'] = df['word'].astype(str).str[0]

x = df.groupby('alphabet').correction.value_counts().to_frame()
x.columns = ['counts']
x = x.reset_index()
x

Unnamed: 0,alphabet,correction,counts
0,a,True,93
1,a,False,6
2,b,True,134
3,b,False,3
4,c,True,101
5,c,False,9
6,d,True,107
7,d,False,7
8,e,True,48
9,e,False,2


In [229]:
color_scale = alt.Scale(
            domain=[False, True],
            range=["#e23b3b", "#8dc456"]
        )

chart = alt.Chart(x).mark_bar().encode(
    x='alphabet',
    y='counts',
    color=alt.Color(
        'correction',
        scale = color_scale,
    ))
chart

<VegaLite 2 object>

Since some starting letters occur more frequently than others, it is advisable to calculate the relative percentages for each letter. E.g. for the letter 'a', the syllabifier has a score of 93.94% of correct predictions.

In [230]:
x_percentages = df.groupby('alphabet').correction.value_counts('True', 'False').to_frame()
x_percentages.columns = ['counts']
x_percentages = x_percentages.reset_index()
x_percentages

Unnamed: 0,alphabet,correction,counts
0,a,True,0.939394
1,a,False,0.060606
2,b,True,0.978102
3,b,False,0.021898
4,c,True,0.918182
5,c,False,0.081818
6,d,True,0.938596
7,d,False,0.061404
8,e,True,0.96
9,e,False,0.04


In [231]:
color_scale = alt.Scale(
            domain=[False, True],
            range=["#e23b3b", "#8dc456"]
        )

chart = alt.Chart(x_percentages).mark_bar().encode(
    x='alphabet',
    y='counts',
    color=alt.Color(
        'correction',
        scale = color_scale,
    ))
chart

<VegaLite 2 object>

Below, we list all the mistakes made by the syllabifier, along with their respective manual correction.
Some of the things that the model still has to learn are: 
* A single *-s-* or *-t-* can never be a syllable.
* For Latin names, the ending *-ius* has to be split up into two syllables.
* The ending *-ien* to mark plurals (today: *-iën*), has to be split up into two syllables. We know this because in Middle Dutch rhymed texts, *abdien* rhymes with *vrijen*, *partien*, *lijen*, *marien*, *prophetien*, etc.
**In Bouma & Hermans' training material, this is not the case! This needs to be corected. **

In [250]:
x = df.set_index('correction')
mistakes = x.loc[False]
mistakes

Unnamed: 0_level_0,word,syllabifier,manual_syllabification,alphabet
correction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,absolutie,ab-solu-tie,ab-so-lu-tie,a
False,aertsebiscop,aert-se-bi-s-cop,aert-se-bis-cop,a
False,amye,a-mye,a-my-e,a
False,anchanius,an-cha-nius,an-cha-ni-us,a
False,aneblic,a-neblic,a-ne-blic,a
False,arrianc,ar-rianc,ar-ri-anc,a
False,besmettet,be-s-met-tet,be-smet-tet,b
False,blavyen,bla-vyen,bla-vy-en,b
False,bolyoen,boly-oen,bo-ly-oen,b
False,caldea,cal-dea,cal-de-a,c


-----------------------------------------------------------------------------------------------------------------