This notebook can be run in Google Colab by clicking the "Open in Colab" button below and setting `colab = True` in the "Setup" code block below.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/frithureiks/Computational-Detection-of-Syllable-Boundaries/blob/main/NeuralNet_model/Paper_Experiments_from_Saved_Models.ipynb)


# Setup

If running in Google Colab, set `colab = True` and this will download the repository and dataset into Colab's file structure and set the appropriate file paths.

If running locally, ensure you have already cloned the GitHub repository and downloaded the data from Zenodo, then set the corresponding file paths to the folders containing each.

In [None]:
%%capture
colab = True

if colab == True:

    # Clone this repository into Colab
    !git clone https://github.com/frithureiks/Computational-Detection-of-Syllable-Boundaries.git

    # Download annotated data
    !pip install zenodo_get
    !zenodo_get https://doi.org/10.5281/zenodo.17418250
    !unzip '/content/Data_and_Models.zip' -d '/content'

    # Set paths to folders containing code and data
    code_folder = '/content/Computational-Detection-of-Syllable-Boundaries'
    data_folder = '/content/Data'
    results_folder = '/content'
    model_folder ='/content/Model_Weights'

else:
    # Set paths to folders containing code and data
    code_folder = 'Path/To/Local/Cloned/Github/Repository'
    data_folder = 'Path/To/Downloaded/Data/Folder'
    results_folder = 'Path/To/Folder/To/Save/Figures'
    model_folder = 'Path/To/Downloaded/Models/Folder'

# Imports

In [None]:
import os
import json
import pickle
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.patheffects as pe

from IPython.display import display
from sklearn.metrics import matthews_corrcoef, precision_recall_fscore_support

import torch
from torch.utils.data import DataLoader

os.chdir(code_folder)
from NeuralNet_model import NgramCNN, WordDataset, EqualLengthsBatchSampler, format_k, syllabify, match_syllables

# Hyperparameters and Random Seed

We set the batch size (64), epochs (10), and random seed (100) to those used in the paper. The model will load onto GPU if available, otherwise CPU.

In [None]:
batch_size = 64
epochs = 10
seed = 100
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Languages and Data

In [None]:
languages = ['cs','nl','en','fr','de','el','it','ko','no','es','sv','tr']

names = ['Czech','Dutch','English','French','German','Greek',
         'Italian','Korean','Norwegian','Spanish','Swedish','Turkish']

with open(os.path.join(data_folder, 'Word_Dataset_Train.pickle'), 'rb') as f:
    train_ds = pickle.load(f)

with open(os.path.join(data_folder, 'Word_Dataset_Val.pickle'), 'rb') as f:
    val_ds = pickle.load(f)

with open(os.path.join(data_folder, 'Word_Dataset_Test.pickle'), 'rb') as f:
    test_ds = pickle.load(f)

# Load CNN

The following code block loads the trained CNN files for each language into a dictionary.

In [None]:
models = dict()

for lang, name in zip(languages, names):
    model = NgramCNN(n_gram=7, n_filters=15, n_layers=10).to(device)
    model.load_state_dict(torch.load(os.path.join(model_folder, f'LOO_{name}_CNN.pt'),
                                              map_location=device, weights_only=True))
    models[lang] = model

# Run Predictions on Test Set

This runs the saved models on the test set and saves their predictions. Alternatively, you can load the previously saved predictions in the following code block, which should be identical.

In [None]:
outputs = dict()
loo_outputs = []

for lang in languages:
    # Select model
    model = models[lang]

    # Set Test Dataset and Dataloader
    word_idx, loo_test_ds = zip(*[(i,entry) for i,entry in enumerate(test_ds) if entry[2] == lang])
    test_dl = DataLoader(loo_test_ds, batch_size=1, shuffle=False)

    # Run predictions on Test Dataloader
    probs, truths = model.predict(test_dl)
    ids = [[id]*len(entry[0]) for id, entry in zip(word_idx, loo_test_ds)]
    ids = [id for id_list in ids for id in id_list]
    positions = [i+1 for entry in loo_test_ds for i in range(len(entry[0]))]
    segments = [segment for entry in loo_test_ds for segment in entry[3]]
    surprisals = [surprisal for entry in loo_test_ds for surprisal in entry[0]]

    outputs[lang] = pd.DataFrame({'Probabilities':probs, 'Ground Truths':truths, 'Languages':lang, 'Word Indices':ids,
                                  'Positions':positions, 'Segments':segments, 'Surprisals':surprisals})

    # Load prediction thresholds
    with open(os.path.join(data_folder, 'Prediction_Thresholds.json')) as f:
        model_thresholds = json.load(f)
    threshold = model_thresholds['CNN'][lang]

    # Add threshold and binary predictions to output
    outputs[lang]['Predictions'] = (outputs[lang]['Probabilities'] >= threshold).astype(float)
    outputs[lang]['Thresholds'] = threshold

    outputs[lang] = outputs[lang][['Word Indices','Positions','Segments','Predictions','Ground Truths',
                                   'Probabilities','Thresholds','Surprisals','Languages']]
    loo_outputs.append(outputs[lang])

loo_outputs = pd.concat(loo_outputs, ignore_index=True)

# Load Test Set Predictions

Loads previously saved model predictions on the test set. If you ran the previous code block to generate new predictions, running this block will overwrite them, though the results should be identical.

In [None]:
# Load predictions
loo_outputs = pd.read_csv(os.path.join(data_folder, 'LOO_Outputs.csv'))

# Originally positions were 0-indexed, but to calculate metrics below they must be 1-indexed
loo_outputs['Positions'] = loo_outputs['Positions'] + 1

# Additional formatting
loo_outputs = loo_outputs[loo_outputs['Model']=='CNN']
loo_outputs = loo_outputs.rename(columns={'Codes':'Languages'})
loo_outputs[['Word Indices','Positions']] = loo_outputs[['Word Indices','Positions']].astype(int)
loo_outputs = loo_outputs[['Word Indices','Positions','Segments','Predictions','Ground Truths',
                           'Probabilities','Thresholds','Surprisals','Languages']]

# Random Baseline

In [None]:
rng = np.random.default_rng(seed=seed)
random_results = loo_outputs.copy()
random_results['Predictions'] = rng.integers(0,2,len(random_results))
idx = random_results['Positions']==1
random_results.loc[idx, 'Predictions'] = 1
random_nohead = random_results[random_results['Positions']!=1]   # Remove first position from random_results

random_metrics = dict()
precision, recall, f1, support = precision_recall_fscore_support(
    random_nohead['Ground Truths'],random_nohead['Predictions'],average='binary')
random_metrics['Precision'] = precision
random_metrics['Recall'] = recall
random_metrics['F1'] = f1

counts = random_results['Positions'].value_counts()
positions = counts.index
segments = counts.values
sylls, pred_sylls, entropy = [],[],[]
precision, recall, f1, mcc = [],[],[],[]
for pos in positions:
    pos_df = random_results[random_results['Positions']==pos]
    sylls.append(pos_df['Ground Truths'].sum())
    pred_sylls.append(pos_df['Predictions'].sum())
    entropy.append(pos_df['Surprisals'].mean())
    p, r, f, _ = precision_recall_fscore_support(pos_df['Ground Truths'], pos_df['Predictions'], average='binary', zero_division=0)
    precision.append(p)
    recall.append(r)
    f1.append(f)

random_pos_metrics = pd.DataFrame([entropy, segments, sylls, pred_sylls, precision, recall, f1],
                    index=['Entropy', 'Segments', 'Syllables', 'Predicted', 'Precision', 'Recall', 'F1'],
                    columns=positions+1)
random_pos_metrics = random_pos_metrics.T.astype({'Segments':int, 'Syllables':int, 'Predicted':int})

# Segment Metrics

In [None]:
segment_metrics = dict()
df = loo_outputs.copy()
df = df[df['Positions']!=1]         # Remove first position from segment_metrics
precision, recall, f1, support = precision_recall_fscore_support(df['Ground Truths'],df['Predictions'],average='binary')
mcc = matthews_corrcoef(df['Ground Truths'],df['Predictions'])
segment_metrics['Precision'] = precision
segment_metrics['Recall'] = recall
segment_metrics['F1'] = f1

fig, ax = plt.subplots(layout='constrained', figsize=(3.5,3.5))
x = np.arange(len(segment_metrics))
bars = ax.bar(x, segment_metrics.values(), width=0.8, color='tab:blue', edgecolor='black', label='CNN')
for m, metric in enumerate(random_metrics.values()):
  ax.plot([m-0.36, m+0.37], [metric, metric], color='tab:orange', linewidth=4, label='Random',
          path_effects=[pe.Stroke(linewidth=6, foreground='black'), pe.Normal()])
ax.bar_label(bars, fmt='{:.0%}', padding=10)
ax.set_ylim(0,1)
ax.set_xticks(x, segment_metrics.keys())
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[:1:-1], labels[:1:-1])
plt.savefig(os.path.join(results_folder, 'CNN - Segment Metrics.pdf'), dpi=500)
plt.show()

# Segment Metrics by Position

In [None]:
df = loo_outputs.copy()
counts = df['Positions'].value_counts()
positions = counts.index
segments = counts.values
frequency = segments / segments[0]
sylls = [df[df['Positions']==pos]['Ground Truths'].sum() for pos in positions]
pred_sylls = [df[df['Positions']==pos]['Predictions'].sum() for pos in positions]
entropy = [df[df['Positions']==pos]['Surprisals'].mean() for pos in positions]

metrics = [precision_recall_fscore_support(
    df[df['Positions']==pos]['Ground Truths'], df[df['Positions']==pos]['Predictions'], average='binary', zero_division=0
    ) for pos in positions]
precision, recall, f1, _ = list(zip(*metrics))

loo_pos_metrics = pd.DataFrame([entropy, segments, sylls, pred_sylls, precision, recall, f1, frequency],
                    index=['Entropy', 'Segments', 'Syllables', 'Predicted', 'Precision', 'Recall', 'F1', 'Freq'],
                    columns=positions)
loo_pos_metrics = loo_pos_metrics.T.astype({'Segments':int, 'Syllables':int, 'Predicted':int})

fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(9,3), sharex=True)
df = loo_pos_metrics
positions = df.index
precision = df['Precision']
recall = df['Recall']
rand_df = random_pos_metrics
frequencies = df['Freq']

k_formatter = ticker.FuncFormatter(format_k)
ax[0].bar(positions, df['Segments'], width=1, color='tab:blue', edgecolor='black')
ax[0].set_ylabel('Num Segments')
ax[0].yaxis.set_major_formatter(k_formatter)
ax[0].set_xlabel('Position')

for j, (name, metric) in enumerate([('Precision',precision), ('Recall',recall)]):
  j += 1
  ax[j].bar(positions, metric, color='white', edgecolor='black', width=1)
  bars = ax[j].bar(positions, metric, color='tab:blue', edgecolor='black', width=1)
  for k, bar in enumerate(bars):
    bar.set_alpha(df.loc[k+1,'Freq'])
  ax[j].scatter(positions, rand_df[name], s=25, color='white', edgecolor='black')
  points = ax[j].scatter(positions, rand_df[name], s=25, color=[(255/255,127/255,14/255,freq) for freq in frequencies],
                         edgecolor='black', label='Random')
  ax[j].set_xlabel('Position')
  ax[j].set_ylabel(name)
  ax[j].set_xticks(ticks=positions, labels=[str(i) if i % 5 == 0 else '' for i in positions])
  ax[j].set_xlim(positions[0]-1, positions[-1]+1)
  ax[j].legend(loc='upper center')

plt.tight_layout()
plt.savefig(os.path.join(results_folder, f'CNN - Segment Metrics by Position.pdf'), dpi=500)
plt.show()

# Predicted Syllables

In [None]:
df = loo_outputs.copy()
df['Pred Segments'] = df.apply(syllabify, syllables='Predictions', segments='Segments', axis=1)
df['True Segments'] = df.apply(syllabify, syllables='Ground Truths', segments='Segments', axis=1)
word_groups = df.groupby(by=['Word Indices'])
loo_words = pd.DataFrame()
loo_words['Word'] = word_groups['Segments'].apply(lambda x: ''.join(x))
loo_words['Languages'] = word_groups['Languages'].first()
loo_words['True Syllables'] = word_groups['True Segments'].apply(lambda x: ''.join(x).split('^'))
loo_words['Pred Syllables'] = word_groups['Pred Segments'].apply(lambda x: ''.join(x).split('^'))
loo_words['Ground Truths'] = word_groups['Ground Truths'].apply(list)
loo_words['Predictions'] = word_groups['Predictions'].apply(list)
loo_words['True Matches'] = loo_words.apply(match_syllables, sylls='Ground Truths', ref_sylls='Predictions', axis=1)
loo_words['Pred Matches'] = loo_words.apply(match_syllables, sylls='Predictions', ref_sylls='Ground Truths', axis=1)
loo_words = loo_words[['Word', 'Languages', 'True Syllables', 'Pred Syllables', 'True Matches', 'Pred Matches']]

preds = [pred for row in loo_words['Pred Syllables'] for pred in row]
pred_len = [len(pred) for pred in preds]
pred_pos = [i+1 for row in loo_words['Pred Syllables'] for i,syll in enumerate(row)]
pred_ids = [[idx]*len(loo_words.loc[idx,'Pred Syllables']) for idx in loo_words.index]
pred_ids = [id for idx in pred_ids for id in idx]
pred_langs = [[loo_words.loc[idx, 'Languages']]*len(loo_words.loc[idx,'Pred Syllables']) for idx in loo_words.index]
pred_langs = [lang for entry in pred_langs for lang in entry]
pred_match = [pred for row in loo_words['Pred Matches'] for pred in row]
loo_preds = pd.DataFrame({'Positions':pred_pos,'Syllables':preds,'Matches':pred_match, 'Lengths':pred_len,
                          'Word Indices':pred_ids, 'Languages':pred_langs}).set_index('Positions')

truths = [true for row in loo_words['True Syllables'] for true in row]
true_len = [len(true) for true in truths]
true_pos = [i+1 for row in loo_words['True Syllables'] for i,syll in enumerate(row)]
true_ids = [[idx]*len(loo_words.loc[idx,'True Syllables']) for idx in loo_words.index]
true_ids = [id for idx in true_ids for id in idx]
true_langs = [[loo_words.loc[idx, 'Languages']]*len(loo_words.loc[idx,'True Syllables']) for idx in loo_words.index]
true_langs = [lang for entry in true_langs for lang in entry]
true_match = [true for row in loo_words['True Matches'] for true in row]
loo_trues = pd.DataFrame({'Positions':true_pos,'Syllables':truths,'Matches':true_match, 'Lengths':true_len,
                          'Word Indices':true_ids, 'Languages':true_langs}).set_index('Positions')

# Random Syllables

In [None]:
df = random_results.copy()
df['Pred Segments'] = df.apply(syllabify, syllables='Predictions', segments='Segments', axis=1)
df['True Segments'] = df.apply(syllabify, syllables='Ground Truths', segments='Segments', axis=1)
word_groups = df.groupby(by=['Word Indices'])
random_words = pd.DataFrame()
random_words['Word'] = word_groups['Segments'].apply(lambda x: ''.join(x))
random_words['Languages'] = word_groups['Languages'].first()
random_words['True Syllables'] = word_groups['True Segments'].apply(lambda x: ''.join(x).split('^'))
random_words['Pred Syllables'] = word_groups['Pred Segments'].apply(lambda x: ''.join(x).split('^'))
random_words['Ground Truths'] = word_groups['Ground Truths'].apply(list)
random_words['Predictions'] = word_groups['Predictions'].apply(list)
random_words['True Matches'] = random_words.apply(match_syllables, sylls='Ground Truths', ref_sylls='Predictions', axis=1)
random_words['Pred Matches'] = random_words.apply(match_syllables, sylls='Predictions', ref_sylls='Ground Truths', axis=1)
random_words = random_words[['Word', 'Languages', 'True Syllables', 'Pred Syllables', 'True Matches', 'Pred Matches']]

preds = [pred for row in random_words['Pred Syllables'] for pred in row]
pred_len = [len(pred) for pred in preds]
pred_pos = [i+1 for row in random_words['Pred Syllables'] for i,syll in enumerate(row)]
pred_ids = [[idx]*len(random_words.loc[idx,'Pred Syllables']) for idx in random_words.index]
pred_ids = [id for idx in pred_ids for id in idx]
pred_langs = [[random_words.loc[idx, 'Languages']]*len(random_words.loc[idx,'Pred Syllables']) for idx in random_words.index]
pred_langs = [lang for entry in pred_langs for lang in entry]
pred_match = [pred for row in random_words['Pred Matches'] for pred in row]
random_preds = pd.DataFrame({'Positions':pred_pos,'Syllables':preds,'Matches':pred_match, 'Lengths':pred_len,
                             'Word Indices':pred_ids, 'Languages':pred_langs}).set_index('Positions')

truths = [true for row in random_words['True Syllables'] for true in row]
true_len = [len(true) for true in truths]
true_pos = [i+1 for row in random_words['True Syllables'] for i,syll in enumerate(row)]
true_ids = [[idx]*len(random_words.loc[idx,'True Syllables']) for idx in random_words.index]
true_ids = [id for idx in true_ids for id in idx]
true_langs = [[random_words.loc[idx, 'Languages']]*len(random_words.loc[idx,'True Syllables']) for idx in random_words.index]
true_langs = [lang for entry in true_langs for lang in entry]
true_match = [true for row in random_words['True Matches'] for true in row]
random_trues = pd.DataFrame({'Positions':true_pos,'Syllables':truths,'Matches':true_match, 'Lengths':true_len,
                             'Word Indices':true_ids, 'Languages':true_langs}).set_index('Positions')

# Syllable Metrics

In [None]:
pred_list = []
true_list = []
for pos in loo_preds.index.unique():
    pred_df = loo_preds.loc[[pos],['Syllables','Lengths','Matches','Languages']].reset_index()
    pred_df['Count'] = pred_df['Syllables']
    pred_df = pred_df.groupby(by=['Languages','Syllables'],as_index=False).agg(
        {'Positions':'max', 'Lengths':'max', 'Matches':'sum', 'Count':'count'})
    pred_df['True'] = pred_df['Matches'] > 0
    pred_df = pred_df.set_index('Positions')
    pred_list.append(pred_df)

for pos in loo_trues.index.unique():
    true_df = loo_trues.loc[[pos],['Syllables','Lengths','Matches','Languages']].reset_index()
    true_df['Count'] = true_df['Syllables']
    true_df = true_df.groupby(by=['Languages','Syllables'],as_index=False).agg(
        {'Positions':'max', 'Lengths':'max', 'Matches':'sum', 'Count':'count'})
    true_df['Predicted'] = true_df['Matches'] > 0
    true_df = true_df.set_index('Positions')
    true_list.append(true_df)

loo_unique_preds = pd.concat(pred_list)
loo_unique_trues = pd.concat(true_list)

pred_list = []
true_list = []
for pos in random_preds.index.unique():
    pred_df = random_preds.loc[[pos],['Syllables','Lengths','Matches','Languages']].reset_index()
    pred_df['Count'] = pred_df['Syllables']
    pred_df = pred_df.groupby(by=['Languages','Syllables'],as_index=False).agg(
        {'Positions':'max', 'Lengths':'max', 'Matches':'sum', 'Count':'count'})
    pred_df['True'] = pred_df['Matches'] > 0
    #pred_df = pred_df[pred_df['Count']>10]
    pred_df = pred_df.set_index('Positions')
    pred_list.append(pred_df)

for pos in random_trues.index.unique():
    true_df = random_trues.loc[[pos],['Syllables','Lengths','Matches','Languages']].reset_index()
    true_df['Count'] = true_df['Syllables']
    true_df = true_df.groupby(by=['Languages','Syllables'],as_index=False).agg(
        {'Positions':'max', 'Lengths':'max', 'Matches':'sum', 'Count':'count'})
    true_df['Predicted'] = true_df['Matches'] > 0
    true_df = true_df.set_index('Positions')
    true_list.append(true_df)

random_unique_preds = pd.concat(pred_list)
random_unique_trues = pd.concat(true_list)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(3.5,3.5))
precision = loo_unique_preds['True'].mean()
recall = loo_unique_trues['Predicted'].mean()
f1 = 2 * precision * recall / (precision + recall)
rand_prec = random_unique_preds['True'].mean()
rand_recall = random_unique_trues['Predicted'].mean()
rand_f1 = 2 * rand_prec * rand_recall / (rand_prec + rand_recall)
x = [0,1,2]
bars = ax.bar(x, [precision, recall, f1], color='tab:blue', edgecolor='black', label='CNN')
for i, metric in enumerate([rand_prec, rand_recall, rand_f1]):
    ax.plot([x[i]-0.36,x[i]+0.37],[metric, metric], color='tab:orange', linewidth=4, label='Random',
            path_effects=[pe.Stroke(linewidth=6, foreground='black'), pe.Normal()])
ax.bar_label(bars, fmt='{:.0%}', padding=3)
ax.set_ylim(0,1)
ax.set_xticks(x, ['Precision', 'Recall', 'F1'])
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[:1:-1], labels[:1:-1])
plt.savefig(os.path.join(results_folder, 'CNN - Unique Syllable Metrics.pdf'), dpi=500)
plt.show()

# Syllable Metrics by Position

In [None]:
max_pos = 0
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(8.5,3))
preds = loo_unique_preds
trues = loo_unique_trues
pred_pos = preds.index.unique()
true_pos = trues.index.unique()
positions = np.array(range(1, max(pred_pos.max(),true_pos.max())+1))
if positions.max() > max_pos:
    max_pos = positions.max()

pred_all = [len(preds[preds.index==pos]) for pos in positions]
true_all = [len(trues[trues.index==pos]) for pos in positions]
pred_max = max(pred_all)
true_max = max(true_all)

precision = np.array([preds[preds.index==pos]['True'].mean() for pos in positions])
recall = np.array([trues[trues.index==pos]['Predicted'].mean() for pos in positions])

ax[0].bar(positions[:11]-0.2, pred_all[:11], width=0.4, color='tab:blue', edgecolor='black', label='Pred Syllables')
ax[0].bar(positions[:11]+0.2, true_all[:11], width=0.4, color='tab:orange', edgecolor='black', label='True Syllables')
ax[0].set_ylabel('Unique Syllables')
ax[0].yaxis.set_major_formatter(k_formatter)
ax[0].set_xticks(ticks=range(12), labels=['','','2','','4','','6','','8','','10',''])
ax[0].set_xlabel('Position')

rand_prec = np.array([random_unique_preds[random_unique_preds.index==pos]['True'].mean() for pos in positions])
rand_recall = np.array([random_unique_trues[random_unique_trues.index==pos]['Predicted'].mean() for pos in positions])

ax[1].bar(positions, precision, color='white', edgecolor='black', width=1)
bars = ax[1].bar(positions, precision, color='tab:blue', edgecolor='black', width=1)
for k, bar in enumerate(bars):
    bar.set_alpha(pred_all[k]/pred_max)
ax[1].scatter(positions, rand_prec, s=36, color='white', edgecolor='black')
ax[1].scatter(positions, rand_prec, s=36, color=[(255/255,127/255,14/255,pred/pred_max) for pred in pred_all],
              edgecolor='black', label='Random')
ax[1].set_xticks(ticks=np.arange(1,max_pos+1), labels=[i if i%5==0 else '' for i in range(1,max_pos+1)])
ax[1].set_yticks(ticks=np.arange(0,1.1,0.1), labels=[f'{i/10:.1f}' if i%2==0 else '' for i in range(0,11)])
ax[1].set_xlabel('Position')
ax[1].set_ylabel('Precision')
ax[1].set_xlim(0,22)
ax[1].legend(loc='upper left')

ax[2].bar(positions, recall, color='white', edgecolor='black', width=1)
bars = ax[2].bar(positions, recall, color='tab:blue', edgecolor='black', width=1)
for k, bar in enumerate(bars):
    bar.set_alpha(pred_all[k]/pred_max)
ax[2].scatter(positions, rand_recall, s=36, color='white', edgecolor='black')
ax[2].scatter(positions, rand_recall, s=36, color=[(255/255,127/255,14/255,true/true_max) for true in true_all],
              edgecolor='black', label='Random')
ax[2].set_xticks(ticks=np.arange(1,max_pos+1), labels=[i if i%5==0 else '' for i in range(1,max_pos+1)])
ax[2].set_yticks(ticks=np.arange(0,1.1,0.1), labels=[f'{i/10:.1f}' if i%2==0 else '' for i in range(0,11)])
ax[2].set_xlabel('Position')
ax[2].set_ylabel('Recall')
ax[2].set_xlim(0,12)
ax[2].legend(loc='upper left')

plt.tight_layout()
plt.savefig(os.path.join(results_folder, 'CNN - Unique Syllable Metrics by Position.pdf'), dpi=500)
plt.show()

# UDHR Common First Syllables

Runs the CNN model trained on all languages except English on the English text of the Universal Declaration of Human Rights, and saves all beginning syllables predicted > 4 times. Alternatively, you can run the second block to load the presaved results, which should be identical.

In [None]:
# Load UDHR Data
udhr_surprisals = pd.read_csv(os.path.join(data_folder, 'UDHR_Dataset.csv'))
udhr_ds = WordDataset(udhr_surprisals)
udhr_dl = DataLoader(udhr_ds, batch_size=1, shuffle=False)

# Predict Syllable Breaks
probs, truths = models['en'].predict(udhr_dl)
ids = [[i+1]*len(entry[0]) for i, entry in enumerate(udhr_ds)]
ids = [id for id_list in ids for id in id_list]
positions = [i+1 for entry in udhr_ds for i in range(len(entry[0]))]
segments = [segment for entry in udhr_ds for segment in entry[3]]
surprisals = [surprisal for entry in udhr_ds for surprisal in entry[0]]

udhr_outputs = pd.DataFrame({'Probabilities':probs, 'Ground Truths':truths, 'Languages':'en', 'Word Indices':ids,
                              'Positions':positions, 'Segments':segments, 'Surprisals':surprisals})

threshold = model_thresholds['CNN'][lang]
udhr_outputs['Predictions'] = (udhr_outputs['Probabilities'] >= threshold).astype(float)
udhr_outputs['Thresholds'] = threshold

udhr_outputs = udhr_outputs[['Word Indices','Positions','Segments','Predictions','Ground Truths',
                              'Probabilities','Thresholds','Surprisals','Languages']]

# Extract Syllables
df = udhr_outputs.copy()
df['Pred Segments'] = df.apply(syllabify, syllables='Predictions', segments='Segments', axis=1)
df['True Segments'] = df.apply(syllabify, syllables='Ground Truths', segments='Segments', axis=1)
word_groups = df.groupby(by=['Word Indices'])
udhr_words = pd.DataFrame()
udhr_words['Word'] = word_groups['Segments'].apply(lambda x: ''.join(x))
udhr_words['Languages'] = word_groups['Languages'].first()
udhr_words['True Syllables'] = word_groups['True Segments'].apply(lambda x: ''.join(x).split('^'))
udhr_words['Pred Syllables'] = word_groups['Pred Segments'].apply(lambda x: ''.join(x).split('^'))
udhr_words['Ground Truths'] = word_groups['Ground Truths'].apply(list)
udhr_words['Predictions'] = word_groups['Predictions'].apply(list)
udhr_words['True Matches'] = udhr_words.apply(match_syllables, sylls='Ground Truths', ref_sylls='Predictions', axis=1)
udhr_words['Pred Matches'] = udhr_words.apply(match_syllables, sylls='Predictions', ref_sylls='Ground Truths', axis=1)
udhr_words = udhr_words[['Word', 'Languages', 'True Syllables', 'Pred Syllables', 'True Matches', 'Pred Matches']]

preds = [pred for row in udhr_words['Pred Syllables'] for pred in row]
pred_len = [len(pred) for pred in preds]
pred_pos = [i+1 for row in udhr_words['Pred Syllables'] for i,syll in enumerate(row)]
pred_ids = [[idx]*len(udhr_words.loc[idx,'Pred Syllables']) for idx in udhr_words.index]
pred_ids = [id for idx in pred_ids for id in idx]
pred_langs = [[udhr_words.loc[idx, 'Languages']]*len(udhr_words.loc[idx,'Pred Syllables']) for idx in udhr_words.index]
pred_langs = [lang for entry in pred_langs for lang in entry]
pred_match = [pred for row in udhr_words['Pred Matches'] for pred in row]
udhr_preds = pd.DataFrame({'Positions':pred_pos,'Syllables':preds,'Matches':pred_match, 'Lengths':pred_len,
                          'Word Indices':pred_ids, 'Languages':pred_langs}).set_index('Positions')

# Display beginning syllables predicted more than 4 times by the model
udhr_syllables = udhr_preds.loc[[1],['Syllables','Lengths','Matches']].reset_index()
udhr_syllables['Count'] = udhr_syllables['Syllables']
udhr_syllables = udhr_syllables.groupby(by='Syllables',as_index=True).agg(
    {'Lengths':'max', 'Matches':'mean', 'Count':'count'})
udhr_syllables['True'] = udhr_syllables['Matches'] > 0
udhr_syllables = udhr_syllables[udhr_syllables['Count']>4].reset_index()

display(udhr_syllables)

# Load UDHR Predicted Syllables

Load the saved predictions on the Universal Declaration of Human Rights. Should be identical to the above results.

In [None]:
pred_df = pd.read_csv(os.path.join(data_folder, 'UDHR_Predicted_Syllables.csv'), index_col=0)
display(pred_df)