# seq2seq Workshop Excercise 2: Transliteration

In this excercise we will train a seq2seq model to transliterate Hebrew text into Latin characters, without any prior knowledge of Hebrew.

## Part 1: Hebrew Unicode

For our purposes it will be useful to know a bit about how text in Hebrew is encoded in Python strings.

Recall that in Python a string is made up of **characters** than can be accessed with square brackets. The length of the string is the number of characters it contains:


In [1]:
print("hello"[1], "hello"[4], len("hello"))

e o 5


In Python 3, a string is a sequence of **Unicode code points**, or unique numeric identifiers for each character. Python lets us see the Unicode code point for a character by using the built-in function *ord*:

In [2]:
print("Unicode code points for characters in 'hello':", *[ord(char) for char in "hello"])

Unicode code points for characters in 'hello': 104 101 108 108 111


**Questions**
  1. What are the Unicode code points for each character in the word "naivete"? What about when it is written "naïveté"?
  2. Use the built-in Python function *hex* to get the hexidecimal (base-16) values for these code points. What are they?
  3. Use the [Show Unicode Character](http://qaz.wtf/u/show.cgi) tool to look at the Unicode characters in each of these two words. Where can we see the code point values? What about the names of the unicode characters?
  4. What is the difference between the words "naïveté" and "naïveté"? What is the length of each as a Python string?

In [11]:
# 1
print("Unicode code points for characters in 'naivete':", *[ord(char) for char in "naivete"])
print("Unicode code points for characters in 'naïveté':", *[ord(char) for char in "naïveté"])
# The encoding for i and ï or e and é are completely different in the unicode mapping table

# 2
print("\nHex values of Unicode characters in 'naïveté':", *[hex(ord(char)) for char in "naïveté"])

# 3
# The Show Unicode Character displays the Unicode value of each character, followed by the hex value, the character itself and a plain English description

# 4
print("\nAlthough they look identical, len('naïveté')={} and len('naïveté')={}.".format(len('naïveté'), len('naïveté')))
print("This is not magic, but one is using the combining character feature of Unicode where the other uses one character that renders identically")

Unicode code points for characters in 'naivete': 110 97 105 118 101 116 101
Unicode code points for characters in 'naïveté': 110 97 239 118 101 116 233

Hex values of Unicode characters in 'naïveté': 0x6e 0x61 0xef 0x76 0x65 0x74 0xe9

Although they look identical, len('naïveté')=7 and len('naïveté')=9.
This is not magic, but one is using the combining character feature of Unicode where the other uses one character that renders identically


Hebrew words can be written either without vowels, or with vowel symbols called **nikkud**. Let's consider how these are represented in Python and in Unicode.

**Questions:**
  5. What are the first and last letters in the Python string for the Hebrew word בלשנות? What are their hexidecimal Unicode codepoints?
  6. How many characters does the Hebrew string בַּלְשָׁנוּת have? Why this number?
  7. What are the second and third characters of יִשְׂרָאֵל? What are their hexidecimal Unicode codepoints?
 

In [19]:
# 5
print("First and last letters of בלשנות are respectively: {} and {}".format('בלשנות'[0],'בלשנות'[-1]))
print("This proves us that right-to-left is only a display feature but in memory the characters are in the correct order")

# 6
print("\nThe word 'בַּלְשָׁנוּת' has a length equal to {}. This is because each nikkud is a combining character to its letter.".format(len('בַּלְשָׁנוּת')))
print("In the word 'בַּלְשָׁנוּת', we can count 6 nikkudim in addition to the 6 letters of the word giving a count of 12.")

# 7
print("\nThe second and third characters of 'יִשְׂרָאֵל' are respectively: {} and {}, with hex values: {} and {}.".format('יִשְׂרָאֵל'[1], 'יִשְׂרָאֵל'[2], hex(ord('יִשְׂרָאֵל'[1])), hex(ord('יִשְׂרָאֵל'[2]))))

First and last letters of בלשנות are respectively: ב and ת
This proves us that right-to-left is only a display feature but in memory the characters are in the correct order

The word 'בַּלְשָׁנוּת' has a length equal to 12. This is because each nikkud is a combining character to its letter.
In the word 'בַּלְשָׁנוּת', we can count 6 nikkudim in addition to the 6 letters of the word giving a count of 12.

The second and third characters of 'יִשְׂרָאֵל' are respectively: ִ and ש, with hex values: 0x5b4 and 0x5e9.


## Part 2: Data processing

We'll be using the data in the attached file *nikkud_seq2seq_data.csv* to train and test our model. This contains Hebrew words without nikkud (vowels), the words with nikkud, and their transliterations (pronunciation written in Latin characters), scraped from articles on the [Hebrew-language Wiktionary](https://he.wiktionary.org/wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99).

**Questions:**
  8. Load the data into a Pandas DataFrame variable *df*. How many entries does df contain? Looking at some sample entries, do the transliterations look correct?
  9. See if you can find where the transliterations were taken from in Wiktionary. (follow the link above and search for the given words.)

In [33]:
# 8
df = pd.read_csv('nikkud_seq2seq_data.csv')
df.head()
# The df contains 15490 entries. The transliterations look pretty good.

# 9
# The transliterations are available for each word page, in the 'ניתוח דקדוקי' table, on the 'הגייה' row
# An example is eugenika in the following page: https://he.wiktionary.org/wiki/%D7%90%D7%90%D7%95%D7%92%D7%A0%D7%99%D7%A7%D7%94

Unnamed: 0,nikkud,transliteration,word
0,פְּרוֹסְתֵטִית,prostetit,פרוסתטית
1,אֵאוּגֶנִיקָה,eugenika,אאוגניקה
2,אֵאוֹזִינוֹפִיל,e'ozinofil,אאוזינופיל
3,אָאוּטִינְג,auting,אאוטינג
4,אָב,av,אב


Our model will be simpler if we pad all words to be the same length, and add start- and end-of-word characters. 

**Questions:**
  10. Define variables *nikkud_maxlen* and *translit_maxlen* as the length of the longest word in the *nikkud* and *transliteration* columns, respectively. What are these lengths?
  11. Define the function *pad_word* as shown in the comments below, to add start- and end-of-word characters to a word and pad it to a given length.

In [45]:
nikkud_maxlen = df.nikkud.apply(len).max()
translit_maxlen = df.transliteration.apply(len).max()
print("Longest word in the 'nikkud' and 'transliteration' columns are respectively: {} and {}".format(nikkud_maxlen, translit_maxlen))

def pad_word(word, pad_length):
  #### add code here so the function adds ^ to the beginning of the word, spaces  after the word, and $ at the end
  #### so that the output string is of length pad_length
  #### example: pad_word("hello", 12) should return the string "^hello     $" which is of length 12
  padding = pad_length-len(word)-2 if pad_length-len(word)-2 > 0 else 0
  return '^%s%s$' % (word, ' '*padding)

print("\nTesting pad_word('hello', 12) = {}".format(pad_word('hello', 12)))

Longest word in the 'nikkud' and 'transliteration' columns are respectively: 31 and 25

Testing pad_word('hello', 12) = ^hello     $


Now we define strings containing all characters used in our words, along with starting, padding, and ending tokens:

In [0]:
nikkud_charset = '^$ ' + ''.join(sorted(set(''.join(df.nikkud))))
translit_charset = '^$ ' + ''.join(sorted(set(''.join(df.transliteration))))

**Questions:**
  12. How many characters are used in words with nikkud? In transliterations?
  13. Try printing out these character sets? Do you see anything strange in the output? Why?

In [67]:
# 12
print("{} characters are used in the column 'nikkud' and {} in 'transliterations'".format(len(nikkud_charset), len(translit_charset)))

# 13
print("\nNikkud character set:\n{}".format(nikkud_charset))
print("\nTransliteration character set:\n{}".format(translit_charset))

print("\nThe tav 'ת' character had all the nikkudim merge with it, because these characters are special characters that attach to the preceding letter! 😂")

46 characters are used in the column 'nikkud' and 31 in 'transliterations'

Nikkud character set:
^$ "'ְֱֲֳִֵֶַָֹֻּׁׂאבגדהוזחטיךכלםמןנסעףפץצקרשת

Transliteration character set:
^$ "'abcdefghijklmnopqrstuvwxyz

The tav 'ת' character had all the nikkudim merge with it, because these characters are special characters that attach to the preceding letter! 😂


Now let's define functions to produce sequence vectors from words with nikkud or transliterations:

In [0]:
def nikkud2sequence(nikkud):
  return [nikkud_charset.index(c) for c in pad_word(nikkud, nikkud_maxlen + 2)]
def translit2sequence(translit):
  return [translit_charset.index(c) for c in pad_word(translit, translit_maxlen + 2)]

**Questions:**
  14. What are the feature vectors for "שָׁלוֹם" and "shalom"? What do the numbers in the vectors mean?
  15. Add code to the comment below, to define functions *nikkud2onehot* and *translit2onehot*. These should take in strings (either a Hebrew word with nikkud, or a transliteration) and return a matrix where each character is one-hot encoded. Hint: Use *tf.keras.utils.to_categorical*, with attribute *num_classes = (number of characters in the character set)*.
  16. If you implemented those functions correctly, nikkud2onehot('שָׁלוֹם').shape should equal (33, 46) and translit2onehot('shalom').shape should equal (27, 31). What do these dimensions mean?

In [71]:
# 14
print("Vectors for 'שָׁלוֹם' and 'shalom' are respectively:\n{} \nand \n{}".format(nikkud2sequence('שָׁלוֹם'), translit2sequence('shalom')))
      
# 15
from tensorflow.keras.utils import to_categorical

def nikkud2onehot(word):
  return to_categorical(nikkud2sequence(word), num_classes=46)
      
def translit2onehot(word):
  return to_categorical(translit2sequence(word), num_classes=31)
  

# 16
print("\nThe shapes of nikkud2onehot('שָׁלוֹם') and translit2onehot('shalom') are {} and {}.".format(nikkud2onehot('שָׁלוֹם').shape, translit2onehot('shalom').shape))

Vectors for 'שָׁלוֹם' and 'shalom' are respectively:
[0, 44, 13, 17, 31, 24, 14, 32, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] 
and 
[0, 23, 12, 5, 16, 19, 17, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]

The shapes of nikkud2onehot('שָׁלוֹם') and translit2onehot('shalom') are (33, 46) and (27, 31).


Now let's combine the matrixes for all the words together into tensors:

In [0]:
import numpy as np
X = np.array([nikkud2onehot(nikkud) for nikkud in df.nikkud])
Y = np.array([translit2onehot(translit) for translit in df.transliteration])

Notice that the first dimension of each tensor is the sample size (number of words):

In [73]:
X.shape, Y.shape

((15490, 33, 46), (15490, 27, 31))

In the seq2seq model that we will train, we will try to predict the next character in the transliteration from the characters already generated and from the given nikkud. Since Y contains the encoding for the characters in the transliteration, we want to shift it by one to represent the next character that needs to be predicted.  This is simple with the numpy function *np.roll*. We save this in the tensor Z which will be predicted by the model given X (nikkud) and Y (transliteration):

In [0]:
Z = np.roll(Y, -1, axis = 1)

## Part 3: Seq2seq with LSTMs:

We'll now build a seq2seq model with Keras to predict transliteration from nikkud. First let's build and train our model:



In [106]:
import tensorflow as tf
latent_dim = 256

encoder_inputs = tf.keras.layers.Input(shape = (None, len(nikkud_charset))) ## BONUS
encoder = tf.keras.layers.LSTM(latent_dim, return_state = True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = tf.keras.layers.Input(shape = (None, len(translit_charset))) ## BONUS
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences = True, return_state = True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state = encoder_states)
decoder_dense = tf.keras.layers.Dense(len(translit_charset), activation = 'softmax') ## BONUS
decoder_outputs = decoder_dense(decoder_outputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')
model.fit([X, Y], Z, batch_size = 256, epochs = 100, validation_split = 0.2)

encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)

decoder_state_input_h = tf.keras.layers.Input(shape = (latent_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape = (latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs,
                                    initial_state = decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = tf.keras.models.Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

Train on 12392 samples, validate on 3098 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/10

Based on this model, we can decode transliteration from nikkud one character at a time, at each step taking the most likely next character predicted by the model. The function *nikkud2translit* takes in a nikkud string and returns the predicted transliteration:

In [0]:
def decode_sequence(input_text, input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1, len(translit_charset))) ## BONUS
    target_seq[0, 0, 0] = 1.
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)
        char_probabilities = {
            c: p for c, p in zip(translit_charset, output_tokens[0, -1, :]) ## BONUS
        }
        sampled_char = max(translit_charset, key = lambda c: char_probabilities[c]) ## BONUS
        sampled_token_index = translit_charset.index(sampled_char) ## BONUS
        decoded_sentence += sampled_char
        if (sampled_char == '$' or
           len(decoded_sentence) > translit_maxlen): ## BONUS
            stop_condition = True
        target_seq = np.zeros((1, 1, len(translit_charset))) ## BONUS
        target_seq[0, 0, sampled_token_index] = 1.
        states_value = [h, c]
    return decoded_sentence

def nikkud2translit(nikkud):
  tensor = nikkud2onehot(nikkud)[None] ## BONUS
  return decode_sequence(nikkud, tensor).replace('$', '').strip()

**Questions:**
  18. Make a new dataframe *df2* containing 100 random samples from *df*. Add a new column *predicted_translit* to the dataframe *df2* with the model's predicted transliteration of the given nikkud. How often does this equal the actual transliteration? What kinds of errors do you see in the output?
  17. Change the value of *epochs =* above to train the model on more epochs. How does this affect the loss? How about the observed results?

**Bonus:** Modify the problem so that we are instead predicting Hebrew text with nikkud from a transliteration. You will have to switch X and Y, and change code where the comment ## BONUS is written above.

In [91]:
# 18
df2 = df.sample(100)
df2['predicted_translit'] = df2.nikkud.apply(nikkud2translit)
print("The model's transliteration is equal to the actual transliteration {}/100 times\n".format((df2.transliteration == df2.predicted_translit).sum()))
df2.head(20)

# The model is pretty much never spot on. Many times though, the first letter of the prediction matches,
# which seems to indicate the LSTM model was not trained enough in order to predict well the following letters...

The model's transliteration is equal to the actual transliteration 0/100 times



Unnamed: 0,nikkud,transliteration,word,predicted_translit
8424,מַלְטָה,malta,מלטה,ma'ara
10869,עֲבָדִים,'avadim,עבדים,arakha
5659,הַעֹגֶן,ha'ogen,העגן,ma'ara
2994,גַּל,gal,גל,bara
1586,אֶצְבָּעוֹן,etsba'on,אצבעון,arika
3811,דֶּשֶׁא,deshe,דשא,beret
11727,פּוֹפְּקוֹרְן,popkorn,פופקורן,kharat
11962,פֶן,fen,פן,khara
410,אוֹקוּלְטִיזְם,okultizm,אוקולטיזם,arakha
11850,פִּילִינְג,piling,פילינג,khilut


In [93]:
# 19
df2 = df.sample(100)
df2['predicted_translit'] = df2.nikkud.apply(nikkud2translit)
print("The new model's transliteration is correct {}/100 times".format((df2.transliteration == df2.predicted_translit).sum()))
df2.head(20)


The new model's transliteration is correct 90/100 times


Unnamed: 0,nikkud,transliteration,word,predicted_translit
12613,צַמֶּרֶת,tzameret,צמרת,tsameret
12898,קוֹמְבִּינָטוֹרִיקָה,kombinatorika,קומבינטוריקה,komyofiya
2926,גִּזָּרוֹן,gizaron,גזרון,gizaron
11761,פַּח,pakh,פח,pakh
15030,תּוֹרַשְׁתִּי,torashti,תורשתי,torashti
160,אַגְנוֹסְטִי,agnosti,אגנוסטי,agnosti
9411,מְשֻׁשֶּׁה,meshushe,משושה,meshushe
8961,הֲגָאִים,haga'im,הגאים,haga'im
13626,פָּשׁוּט,pashut,פשוט,pashut
5600,חַשְׁמוֹנַאי,khashmonay,חשמונאי,khashmonay


In [107]:
# Extra words not in the training set:
print("Transliteration of '{}' returned {}".format('נִסְמַכְתִּי', nikkud2translit('נִסְמַכְתִּי')))
print("Transliteration of '{}' returned {}".format('תְהִלָּתִי', nikkud2translit('תְהִלָּתִי')))
print("Transliteration of '{}' returned {}".format('לַמְנַצֵּחַ', nikkud2translit('לַמְנַצֵּחַ')))

# Althought the results against the training test are pretty impressive,
# we can see here that the generalization is not as impressive, altghough not too far

Transliteration of 'נִסְמַכְתִּי' returned nismakhmit
Transliteration of 'תְהִלָּתִי' returned tila'it
Transliteration of 'לַמְנַצֵּחַ' returned lamatstana


In [99]:
# BONUS QUESTION
Z = np.roll(X, -1, axis = 1)
latent_dim = 256

encoder_inputs = tf.keras.layers.Input(shape = (None, len(translit_charset)))
encoder = tf.keras.layers.LSTM(latent_dim, return_state = True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = tf.keras.layers.Input(shape = (None, len(nikkud_charset)))
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences = True, return_state = True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state = encoder_states)
decoder_dense = tf.keras.layers.Dense(len(nikkud_charset), activation = 'softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')
model.fit([Y, X], Z, batch_size = 256, epochs = 100, validation_split = 0.2)

encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)

decoder_state_input_h = tf.keras.layers.Input(shape = (latent_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape = (latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs,
                                    initial_state = decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = tf.keras.models.Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

def decode_sequence_to_heb(input_text, input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1, len(nikkud_charset)))
    target_seq[0, 0, 0] = 1.
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)
        char_probabilities = {
            c: p for c, p in zip(nikkud_charset, output_tokens[0, -1, :])
        }
        sampled_char = max(nikkud_charset, key = lambda c: char_probabilities[c])
        sampled_token_index = nikkud_charset.index(sampled_char)
        decoded_sentence += sampled_char
        if (sampled_char == '$' or
           len(decoded_sentence) > nikkud_maxlen):
            stop_condition = True
        target_seq = np.zeros((1, 1, len(nikkud_charset)))
        target_seq[0, 0, sampled_token_index] = 1.
        states_value = [h, c]
    return decoded_sentence

def translit2nikkud(translit):
  tensor = translit2onehot(translit)[None]
  return decode_sequence_to_heb(translit, tensor).replace('$', '').strip()

Train on 12392 samples, validate on 3098 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/10

In [100]:
# BONUS - TESTING NEW MODEL
df2 = df.sample(100)
df2['predicted_nikkud'] = df2.transliteration.apply(translit2nikkud)
print("The transliteration to nikkud model was correct {}/100 times".format((df2.nikkud == df2.predicted_nikkud).sum()))
df2.head(20)

# Interestingly, most errors don't fall far away from the expected word.
# For example, tsaar was predicted with an Aleph instead of an Ayin and a kamats instead of a Patah.
# In this example and many other, the model was able to predict the correct pronunciation.

# As these predictions are generated and not based on a dictionary, we could argue 
# that a prediction that preserves the pronunciation is valid which would boost our accuracy up a lot!

The transliteration to nikkud model was correct 71/100 times


Unnamed: 0,nikkud,transliteration,word,predicted_nikkud
12653,צַעַר,tsaar,צער,צַאָר
4075,הַפָּתוּחַ,hapatu'akh,הפתוח,הַפָּתוּחַ
8466,מַמְזֵר,mamzer,ממזר,מַמְזֵר
1548,אֶפְעֶה,ef'e,אפעה,אֶפְעֶה
13432,קָשֶׁה,kashe,קשה,קָשֵׁה
11459,עֵקֶל,ekel,עקל,אֶכֶל
12799,חֲזִיר,khazir,חזיר,חֲזִיר
5810,טִלְטוּל,tiltul,טלטול,טִלְטוּל
4159,הָלַךְ,halakh,הלך,הָלַךְ
906,אָלֶלוֹפַּתְיָה,alelopatya,אללופתיה,אָלֶלוֹפַּתְיָה


In [104]:
# Extra words not in the training set:
print("Transliteration of '{}' returned {}".format('trumpeldor', translit2nikkud('trumpeldor')))
print("Transliteration of '{}' returned {}".format('jeremy', translit2nikkud('jeremy')))
print("Transliteration of '{}' returned {}".format("raanana", translit2nikkud("raanana")))
print("Transliteration of '{}' returned {}".format("titkhadesh", translit2nikkud("titkhadesh")))

# The performance on these words is interesting, but not as impressive
# We can conclude that our model is very likely overfitting 

Transliteration of 'trumpeldor' returned טְרוּמְדֶּנְטֶרְי
Transliteration of 'jeremy' returned גֶ'רֶמֶה
Transliteration of 'raanana' returned אֲרָנָלוֹ
Transliteration of 'titkhadesh' returned טִיְחַדְךְ
