# Syllabification experiments using Transformers

In this notebook we show our first experiments using the Transformer architecture to build a syllabifier. The training and the hyperparameters are not optimal (just 10 epochs, no hyperparameter sweeps performed), however we got pretty good results and this inspired us to keep working on this architecture. Other experiments can be found in the `Char2Char` and `Word2Char` notebooks.

In [194]:
import io
import os
import re
import time
import unicodedata

import matplotlib.pyplot as plt
import math
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers.experimental import preprocessing

import wandb
from deepcomedy.models.transformer import *
from deepcomedy.preprocessing import *
from deepcomedy.utils import *
from deepcomedy.metrics import *

import tqdm

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Data loading and preprocessing

In [2]:
raw_text = open("./data/divina_textonly.txt", "rb").read().decode(encoding="utf-8")
raw_syll_text = (
    open("./data/divina_syll_textonly.txt", "rb").read().decode(encoding="utf-8")
)
syll_text = preprocess_text(raw_syll_text, end_of_tercet="")
text = preprocess_text(raw_text, end_of_tercet="")

Split preprocessed text into verses

In [3]:
sep = "<EOV>"
input_verses = [x + sep for x in text.split(sep)][:-1]
target_verses = [x + sep for x in syll_text.split(sep)][:-1]

Encode with tokenizer

In [4]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    char_level=False, filters="", lower=False
)
tokenizer.fit_on_texts(target_verses)
enc_input_verses = tokenizer.texts_to_sequences(input_verses)
enc_target_verses = tokenizer.texts_to_sequences(target_verses)
vocab_size = len(tokenizer.word_index) + 1

Pad sequences

In [5]:
input_text = tf.keras.preprocessing.sequence.pad_sequences(
    enc_input_verses, padding="post"
)
target_text = tf.keras.preprocessing.sequence.pad_sequences(
    enc_target_verses, padding="post"
)

In [6]:
input_train, input_test, target_train, target_test = train_test_split(
    input_text, target_text
)

In [7]:
batch_size = 32
dataset = make_dataset(input_train, target_train, batch_size=batch_size)
validation_dataset = make_dataset(input_test, target_test, batch_size=batch_size)

## 2. Training

In [161]:
best_config = {"num_layers": 4, "d_model": 256, "num_heads": 4, "dff": 1024}

transformer, transformer_trainer = make_transformer_model(
    best_config, vocab_size, vocab_size, checkpoint_save_path=None
)

In [162]:
transformer_trainer.train(dataset, 10, validation_dataset=validation_dataset, validation_every=1)

Epoch 1 Batch 0 Loss 5.1930 Accuracy 0.0071
Epoch 1 Batch 50 Loss 4.0188 Accuracy 0.1295
Epoch 1 Batch 100 Loss 3.5211 Accuracy 0.1728
Epoch 1 Batch 150 Loss 3.2827 Accuracy 0.1966
Epoch 1 Batch 200 Loss 3.0484 Accuracy 0.2309
Epoch 1 Batch 250 Loss 2.8551 Accuracy 0.2611
Epoch 1 Batch 300 Loss 2.7064 Accuracy 0.2849
Epoch 1 Batch 0 Validation Loss 1.7240 Validation Accuracy 0.4466
Epoch 1 Batch 50 Validation Loss 1.7252 Validation Accuracy 0.4591
Epoch 1 Batch 100 Validation Loss 1.7278 Validation Accuracy 0.4567
Epoch 1 Loss 2.6256 Accuracy 0.2976
Time taken for 1 epoch: 42.77 secs

Epoch 2 Batch 0 Loss 1.8372 Accuracy 0.4202
Epoch 2 Batch 50 Loss 1.8110 Accuracy 0.4314
Epoch 2 Batch 100 Loss 1.7727 Accuracy 0.4409
Epoch 2 Batch 150 Loss 1.7371 Accuracy 0.4488
Epoch 2 Batch 200 Loss 1.7048 Accuracy 0.4569
Epoch 2 Batch 250 Loss 1.6710 Accuracy 0.4657
Epoch 2 Batch 300 Loss 1.6272 Accuracy 0.4786
Epoch 2 Batch 0 Validation Loss 0.8765 Validation Accuracy 0.7110
Epoch 2 Batch 50 Valida

## 3. Syllabification

### 3.1. Syllabification example 
Here we try to syllabify the first 100 verses of the test set. We use the evaluate function to pass the input to the model and autoregressively generate the output from the transformer in a loop.

In [163]:
start_symbol = tokenizer.word_index['<GO>']
stop_symbol = tokenizer.word_index['<EOV>']

In [179]:
encoder_input = tf.convert_to_tensor(input_test[:100])
decoder_input = tf.repeat([[start_symbol]], repeats=encoder_input.shape[0], axis=0)

In [180]:
output = evaluate(transformer, encoder_input, decoder_input,  stop_symbol, stopping_condition=stop_after_stop_symbol)

In [95]:
# Only take output before the first end of verse
stripped_output = list(map(lambda x: x.split('<EOV>')[0], tokenizer.sequences_to_texts(output.numpy())))

In [183]:
stripped_output = list(map(strip_tokens, stripped_output))

In [184]:
stripped_output

['|lu|cen|te |più |as|sai |di |quel |ch’ el|l’ e|ra.',
 '|che |si |sta|va|no a |l’ om|bra |die|tro al |sas|so',
 '|Poi, |ral|lar|ga|ti |per |la |stra|da |so|la,',
 '|Po|scia |ch’ io |v’ eb|bi al|cun |ri|co|no|sciu|to,',
 '|e |co|me |quel |ch’ è |pa|sto |la |ri|mi|ra;',
 '|con |le |quai |la |tua |E|ti|ca |per|trat|ta',
 '|ma |noi |siam |pe|re|grin |co|me |voi |sie|te.',
 '|La |lin|gua |ch’ io |par|lai |fu |tut|ta |spen|ta',
 '|che |guar|da ’l |pon|te, |che |Fio|ren|za |fes|se',
 '« |Io |sa|rò |pri|mo, e |tu |sa|rai |se|con|do».',
 '|por|re un |uom |per |lo |po|po|lo a’ |mar|tì|ri.',
 '|pri|ma |che |pos|sa |tut|ta in |sé |mu|tar|si;',
 '|con|tra ’l |di|sio, |fo |ben |ch’ io |non |dio|man|do”.',
 '|se |non |co|me |tri|sti|zia o |se|te o |fa|me:',
 '|vie |più |lu|cen|do, |co|min|cia|ron |can|ti',
 '|E |se |più |fu |lo |suo |par|lar |dif|fu|so,',
 '|a |Ce|pe|ran, |là |do|ve |fu |bu|giar|do',
 '|al |mio |di|sio |cer|ti|fi|ca|to |fer|mi.',
 '|non |fos|se |sta|ta a |Ce|sa|re |no|ver|ca,',
 '|p

In [99]:
correct_syll = target_test[:100]
correct_syll = ' '.join(tokenizer.sequences_to_texts(correct_syll))
correct_syll = strip_tokens(correct_syll)
correct_syll = correct_syll.split('\n')

In [110]:
exact_matches, similarities = zip(*validate_syllabification(stripped_output, correct_syll))

In [111]:
accuracy = sum(exact_matches) / len(exact_matches)
avg_similarities = np.mean(similarities)

In [109]:
print('Syllabification exact matches: {:.2f}%'.format(accuracy * 100))

Syllabification exact matches: 76.00%


In [112]:
print('Average similarity: {:.2f}'.format(avg_similarities))

Average similarity: 0.99


In [121]:
stripped_output = np.array(stripped_output)
correct_syll = np.array(correct_syll)
error_mask = ~np.array(exact_matches)

errors_output = stripped_output[error_mask]
errors_correct = correct_syll[error_mask]

In [126]:
errors_correct[1]

'|Po|scia |ch’ io |v’ eb|bi al|cun |ri|co|no|sciu|to,'

In [127]:
errors_output[1]

'|Po|scia |ch’ io |v’ eb|bi al|cun |ri|co|no|sciu|sco,'

### 3.2. Syllabification of the validation set

The evaluate function can handle many syllabification tasks in parallel, generating each output sentence simultaneously until all outputs contain at least one \<EOV\> token. This is faster than handling one sentence at a time, however we found that giving the whole test set in parallel results in GPU out-of-memory, so we came up with this solution that seems to be a good trade-off between parallelism and memory consumption.

What we do is split the test set in batches of 100 verses, and call `evaluate` on one batch at a time passing the appropriate stopping condition.

As an empirical proof, try using a `window_size` of 1: you will see that the ETA will grow to ~3 hours, while the whole process only took 20 minutes in our run.

In [195]:
window_size = 100

result = []

for i in tqdm.tqdm(range(math.ceil(len(input_test) / window_size))):
    window = input_test[i*window_size:min((i + 1)*window_size, len(input_test))]
    
    encoder_input = tf.convert_to_tensor(window)
    decoder_input = tf.repeat([[start_symbol]], repeats=encoder_input.shape[0], axis=0)
    
    output = evaluate(transformer, encoder_input, decoder_input,  stop_symbol, stopping_condition=stop_after_stop_symbol)
    
    # Only take output before the first end of verse
    stripped_output = list(map(lambda x: x.split('<EOV>')[0], tokenizer.sequences_to_texts(output.numpy())))
    stripped_output = list(map(strip_tokens, stripped_output))
    
    result += stripped_output

100%|██████████| 36/36 [20:38<00:00, 34.41s/it]


In [228]:
window_size = 1

result = []

for i in tqdm.tqdm(range(math.ceil(len(input_test) / window_size))):
    window = input_test[i*window_size:min((i + 1)*window_size, len(input_test))]
    
    encoder_input = tf.convert_to_tensor(window)
    decoder_input = tf.repeat([[start_symbol]], repeats=encoder_input.shape[0], axis=0)
    
    output = evaluate(transformer, encoder_input, decoder_input,  stop_symbol, stopping_condition=stop_after_stop_symbol)
    
    # Only take output before the first end of verse
    stripped_output = list(map(lambda x: x.split('<EOV>')[0], tokenizer.sequences_to_texts(output.numpy())))
    stripped_output = list(map(strip_tokens, stripped_output))
    
    result += stripped_output

  0%|          | 7/3559 [00:26<3:48:15,  3.86s/it]


KeyboardInterrupt: 

Now we compare the syllabification we got from our model with the correct syllabification. The `validate_syllabification` function returns information about the verses that were correctly syllabified and the Levenshtein similarity (1 - edit distance) of each syllabified verse with the correct syllabification.

In [209]:
correct_syll = target_test
correct_syll = ' '.join(tokenizer.sequences_to_texts(correct_syll))
correct_syll = strip_tokens(correct_syll)
correct_syll = correct_syll.split('\n')

In [216]:
exact_matches, similarities = zip(*validate_syllabification(result, correct_syll))

In [217]:
accuracy = sum(exact_matches) / len(exact_matches)
avg_similarities = np.mean(similarities)

In [218]:
print('Syllabification exact matches: {:.2f}%'.format(accuracy * 100))

Syllabification exact matches: 86.40%


In [219]:
print('Average similarity: {:.2f}'.format(avg_similarities))

Average similarity: 0.99


In [221]:
stripped_output = np.array(result)
correct_syll = np.array(correct_syll)
error_mask = ~np.array(exact_matches)

errors_output = stripped_output[error_mask]
errors_correct = correct_syll[error_mask]

### 3.3. Syllabification of other "poetry"

We thought it would be a fun experiment to see if the model could syllabify other poetry, not just hendecasyllabic verses. To stay true to our Roman roots we chose a classic folk roman song, which incidentally contains quite a few synalephas.

In [204]:
arbitrary_verses = """
È una canzone senza titolo
Tanto pe’ cantà
Pe’ fa quarche cosa
Non è gnente de straordinario
È robba der paese nostro
Che se po’ cantà pure senza voce
Basta ’a salute
Quanno c'è 'a salute c'è tutto
Basta ’a salute e un par de scarpe nove
Poi girà tutto er monno
E m’a accompagno da me
Pe’ fa la vita meno amara
Me so’ comprato 'sta chitara
E quanno er sole scenne e more
Me sento ’n core cantatore
La voce e’ poca ma ’ntonata
Nun serve a fa ’na serenata
Ma solamente a fa 'n maniera
De famme ’n sogno a prima sera
Tanto pe’ cantà
Perché me sento un friccico ner core
Tanto pe’ sognà
Perché ner petto me ce naschi ’n fiore
Fiore de lillà
Che m'ariporti verso er primo amore
Che sospirava le canzoni mie
E m’aritontoniva de bucie
Canzoni belle e appassionate
Che Roma mia m’aricordate
Cantate solo pe’ dispetto
Ma co’ ’na smania dentro ar petto
Io nun ve canto a voce piena
Ma tutta l’anima è serena
E quanno er cielo se scolora
De me nessuna se ’nnamora
Tanto pe’ cantà
Perché me sento un friccico ner core
Tanto pe’ sognà
Perché ner petto me ce naschi un fiore
Fiore de lillà
Che m’ariporti verso er primo amore
Che sospirava le canzoni mie
E m’aritontoniva de bucie
"""

arbitrary_verses = preprocess_text(arbitrary_verses)
arbitrary_verses = [verse.strip() + ' <EOV>' for verse in arbitrary_verses.split('<EOV>')]
arbitrary_verses

['<GO> È <SEP> u n a <SEP> c a n z o n e <SEP> s e n z a <SEP> t i t o l o <EOV>',
 '<GO> T a n t o <SEP> p e ’ <SEP> c a n t à <EOV>',
 '<GO> P e ’ <SEP> f a <SEP> q u a r c h e <SEP> c o s a <EOV>',
 '<GO> N o n <SEP> è <SEP> g n e n t e <SEP> d e <SEP> s t r a o r d i n a r i o <EOV>',
 '<GO> È <SEP> r o b b a <SEP> d e r <SEP> p a e s e <SEP> n o s t r o <EOV>',
 '<GO> C h e <SEP> s e <SEP> p o ’ <SEP> c a n t à <SEP> p u r e <SEP> s e n z a <SEP> v o c e <EOV>',
 '<GO> B a s t a <SEP> ’ a <SEP> s a l u t e <EOV>',
 "<GO> Q u a n n o <SEP> c ' è <SEP> ' a <SEP> s a l u t e <SEP> c ' è <SEP> t u t t o <EOV>",
 '<GO> B a s t a <SEP> ’ a <SEP> s a l u t e <SEP> e <SEP> u n <SEP> p a r <SEP> d e <SEP> s c a r p e <SEP> n o v e <EOV>',
 '<GO> P o i <SEP> g i r à <SEP> t u t t o <SEP> e r <SEP> m o n n o <EOV>',
 '<GO> E <SEP> m ’ a <SEP> a c c o m p a g n o <SEP> d a <SEP> m e <EOV>',
 '<GO> P e ’ <SEP> f a <SEP> l a <SEP> v i t a <SEP> m e n o <SEP> a m a r a <EOV>',
 "<GO> M e <SEP> s

In [205]:
encoded_verses = tokenizer.texts_to_sequences(arbitrary_verses)
padded_verses = tf.keras.preprocessing.sequence.pad_sequences(
    encoded_verses, padding="post"
)

In [206]:

encoder_input = tf.convert_to_tensor(padded_verses)
decoder_input = tf.repeat([[start_symbol]], repeats=encoder_input.shape[0], axis=0)

output = evaluate(transformer, encoder_input, decoder_input,  stop_symbol, stopping_condition=stop_after_stop_symbol)


In [207]:
# Only take output before the first end of verse
stripped_output = list(map(lambda x: x.split('<EOV>')[0], tokenizer.sequences_to_texts(output.numpy())))
stripped_output = list(map(strip_tokens, stripped_output))

In [208]:
stripped_output

['|T u|na |can|zo|ne |sen|za |ti|to|lo',
 '|Tan|to |pe’ |can|tà',
 '|Pe’ |fa |quar|che |co|sa',
 '|Non |è |gnen|te |de |stra|or|di|na|rio',
 '|T |rob|ba |der |pa|e|se |no|stro',
 '|Che |se |po’ |can|tà |pu|re |sen|za |vo|ce',
 '|Ba|sta ’a |sa|lu|te',
 '|Quan|no |cè |a |sa|lu|te |cè |tut|to',
 '|Ba|sta ’a |sa|lu|te e |un |par |de |scar|pe |no|ve',
 '|Poi |gi|rà |tut|to er |mon|no',
 '|E |m’ ac|ca|com|pa|pno |da |me',
 '|Pe’ |fa |la |vi|ta |me|no a|ma|ra',
 '|Me |so’ |com|pra|to |sta |chi|ta|ra|ra',
 '|E |quan|no er |so|le |scen|ne e |e |mo|re',
 '|Me |sen|to ’n |co|re |can|tan|to|re',
 '|La |vo|ce e’’ |po|ca |ma ’n|to|na|ta|ta',
 '|Nun |ser|ve a |fa |fa|na |se|re|na|ta',
 '|Ma |so|la|men|te a |fa |n |ma|na|ra',
 '|De |fam|me ’n |so|gno a |pri|ma |se|ra',
 '|Tan|to |pe’ |can|tà',
 '|Per|ché |me |sen|to un |fric|ci|co |ner |co|re',
 '|Tan|to |pe’ |so|gnà',
 '|Per|ché |ner |pet|to |me |ce |na|schi ’n |fio|re',
 '|Fio|re |de |lil|là',
 '|Che |ma|ri|por|ti |ver|so er |pri|mo a|mo|re',
 '|Che