# NMT Workshop Excercise 2: Transliteration

In this excercise we will train a seq2seq model to transliterate Hebrew text into Latin characters, without any prior knowledge of Hebrew.

## Part 1: Hebrew Unicode

For our purposes it will be useful to know a bit about how text in Hebrew is encoded in Python strings.

Recall that in Python a string is made up of **characters** than can be accessed with square brackets. The length of the string is the number of characters it contains:


In [1]:
print("hello"[1], "hello"[4], len("hello"))

e o 5


In Python 3, a string is a sequence of **Unicode code points**, or unique numeric identifiers for each character. Python lets us see the Unicode code point for a character by using the built-in function *ord*:

In [2]:
print("Unicode code points for characters in 'hello':", *[ord(char) for char in "hello"])

Unicode code points for characters in 'hello': 104 101 108 108 111


**Questions**
  1. What are the Unicode code points for each character in the word "naivete"? What about when it is written "naïveté"?
  2. Use the built-in Python function *hex* to get the hexidecimal (base-16) values for these code points. What are they?
  3. Use the [Show Unicode Character](http://qaz.wtf/u/show.cgi) tool to look at the Unicode characters in each of these two words. Where can we see the code point values? What about the names of the unicode characters?
  4. What is the difference between the words "naïveté" and "naïveté"? What is the length of each as a Python string?

In [16]:
print("Unicode code points for characters in 'naiveté':", *[ord(char) for char in "naiveté"])

Unicode code points for characters in 'naiveté': 110 97 105 118 101 116 101 769


In [15]:
print("Unicode code points for characters in 'naïveté':", *[ord(char) for char in "naïveté"])

Unicode code points for characters in 'naïveté': 110 97 105 776 118 101 116 101 769


In [14]:
print("hexidecimal (base-16) points for characters in 'naiveté':", *[hex(ord(char)) for char in "naiveté"])

hexidecimal (base-16) points for characters in 'naiveté': 0x6e 0x61 0x69 0x76 0x65 0x74 0x65 0x301


In [13]:
print("hexidecimal (base-16) points for characters in 'naïveté':", *[hex(ord(char)) for char in "naïveté"])

hexidecimal (base-16) points for characters in 'naïveté': 0x6e 0x61 0x69 0x308 0x76 0x65 0x74 0x65 0x301


3.

Info for string "naïveté"

110     006E     n     LATIN SMALL LETTER N

97     0061     a     LATIN SMALL LETTER A

105     0069     i     LATIN SMALL LETTER I

776     0308     ̈     COMBINING DIAERESIS

118     0076     v     LATIN SMALL LETTER V

101     0065     e     LATIN SMALL LETTER E

116     0074     t     LATIN SMALL LETTER T

101     0065     e     LATIN SMALL LETTER E

769     0301     ́     COMBINING ACUTE ACCENT

Info for string "naiveté"

110     006E     n     LATIN SMALL LETTER N

97     0061     a     LATIN SMALL LETTER A

105     0069     i     LATIN SMALL LETTER I

118     0076     v     LATIN SMALL LETTER V

101     0065     e     LATIN SMALL LETTER E

116     0074     t     LATIN SMALL LETTER T

101     0065     e     LATIN SMALL LETTER E

769     0301     ́     COMBINING ACUTE ACCENT

The code point values are on the far left while the names are on the far right.

the difference between both words is:

776 0308 ̈ COMBINING DIAERESIS

Which is the character  ̈ that is considered a separate character. 

In [17]:
print("length of naiveté: {}".format(len("naiveté")))

length of naiveté: 8


In [18]:
print("length of naïveté: {}".format(len("naïveté")))

length of naïveté: 9


Hebrew words can be written either without vowels, or with vowel symbols called **nikkud**. Let's consider how these are represented in Python and in Unicode.

**Questions:**
  5. What are the first and last letters in the Python string for the Hebrew word בלשנות? What are their hexidecimal Unicode codepoints?
  6. How many characters does the Hebrew string בַּלְשָׁנוּת have? Why this number?
  7. What are the second and third characters of יִשְׂרָאֵל? What are their hexidecimal Unicode codepoints?
 

In [20]:
my_string = "בלשנות"
my_string[0]
my_string[-1]
print("The first letter is: {}".format(my_string[0]))
print("The last letter is: {}".format(my_string[-1]))

The first letter is: ב
The last letter is: ת


In [29]:
print("hexidecimal (base-16) points for characters in בלשנות: \n", *[hex(ord(char)) for char in my_string])

hexidecimal (base-16) points for characters in בלשנות: 
 0x5d1 0x5dc 0x5e9 0x5e0 0x5d5 0x5ea


In [22]:
my_string2 = "בַּלְשָׁנוּת"
print("the length is: {}".format(len(my_string2)))

the length is: 12


We can count above 6 letters and 6 "nikkuds". Each nikkud is considered 1 character on top of the 6 characters that are the letters.

In [36]:
my_string3 = "יִשְׂרָאֵל"
print("The 2nd character is: {}  and the hexadecimal code is: {}".format(my_string3[1], hex(ord(my_string3[1]))))
print("The 2nd character is: {}  and the hexadecimal code is: {}".format(my_string3[1], hex(ord(my_string3[2]))))

The 2nd character is: ִ  and the hexadecimal code is: 0x5b4
The 2nd character is: ִ  and the hexadecimal code is: 0x5e9


## Part 2: Data processing

We'll be using the data in the attached file *nikkud_seq2seq_data.csv* to train and test our model. This contains Hebrew words without nikkud (vowels), the words with nikkud, and their transliterations (pronunciation written in Latin characters), scraped from articles on the [Hebrew-language Wiktionary](https://he.wiktionary.org/wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99).

**Questions:**
  8. Load the data into a Pandas DataFrame variable *df*. How many entries does df contain? Looking at some sample entries, do the transliterations look correct?
  9. See if you can find where the transliterations were taken from in Wiktionary. (follow the link above and search for the given words.)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('nikkud_seq2seq_data.csv')

In [3]:
df.shape

(15490, 3)

In [5]:
df.sample(n=4)

Unnamed: 0,nikkud,transliteration,word
3882,הֵד,hed,הד
3412,חַי,khai,חי
4347,הִצְטַנְּנוּת,hitztanenut,הצטננות
680,אֵין,ein,אין


Yes the transliterations look correct.

9.

I looked and found them in the wiktionary.

https://he.wiktionary.org/wiki/%D7%94%D7%A6%D7%98%D7%A0%D7%A0%D7%95%D7%AA#%D7%94%D6%B4%D7%A6%D6%B0%D7%98%D6%B7%D7%A0%D6%B0%D6%BC%D7%A0%D7%95%D6%BC%D7%AA

https://he.wiktionary.org/wiki/%D7%90%D7%99%D7%9F

https://he.wiktionary.org/wiki/%D7%97%D7%99

https://he.wiktionary.org/wiki/%D7%94%D7%93

Our model will be simpler if we pad all words to be the same length, and add start- and end-of-word characters. 

**Questions:**
  10. Define variables *nikkud_maxlen* and *translit_maxlen* as the length of the longest word in the *nikkud* and *transliteration* columns, respectively. What are these lengths?
  11. Define the function *pad_word* as shown in the comments below, to add start- and end-of-word characters to a word and pad it to a given length.

In [10]:
#10.
nikkud_maxlen = df['nikkud'].apply(len).max()
translit_maxlen = df['transliteration'].apply(len).max()

In [12]:
nikkud_maxlen

31

In [14]:
translit_maxlen

25

In [15]:
# answer to question 11
def pad_word(word, pad_length):       
#### add code here so the function adds ^ to the beginning of the word, spaces  after the word, and $ at the end
#### so that the output string is of length pad_length
#### example: pad_word("hello", 12) should return the string "^hello     $" which is of length 12
    num_spaces = pad_length - len(word) - 2
    return '^' + word + ' '*num_spaces + '$'


Now we define strings containing all characters used in our words, along with starting, padding, and ending tokens:

In [19]:
nikkud_charset = '^$ ' + ''.join(sorted(set(''.join(df.nikkud))))
translit_charset = '^$ ' + ''.join(sorted(set(''.join(df.transliteration))))

**Questions:**
  12. How many characters are used in words with nikkud? In transliterations?
  13. Try printing out these character sets? Do you see anything strange in the output? Why?

In [26]:
len(nikkud_charset) - len('^$ ')

43

In [27]:
len(translit_charset) - len('^$ ')

28

In [24]:
nikkud_charset

'^$ "\'ְֱֲֳִֵֶַָֹֻּׁׂאבגדהוזחטיךכלםמןנסעףפץצקרשת'

We can see in the nikkud output that all the nikkuds were combined into the last letter of the hebrew alphabet.

In [23]:
translit_charset

'^$ "\'abcdefghijklmnopqrstuvwxyz'

Now let's define functions to produce sequence vectors from words with nikkud or transliterations:

In [25]:
def nikkud2sequence(nikkud):
  return [nikkud_charset.index(c) for c in pad_word(nikkud, nikkud_maxlen + 2)]
def translit2sequence(translit):
  return [translit_charset.index(c) for c in pad_word(translit, translit_maxlen + 2)]

**Questions:**
  14. What are the feature vectors for "שָׁלוֹם" and "shalom"? What do the numbers in the vectors mean?
  15. Add code to the comment below, to define functions *nikkud2onehot* and *translit2onehot*. These should take in strings (either a Hebrew word with nikkud, or a transliteration) and return a matrix where each character is one-hot encoded. Hint: Use *keras.utils.to_categorical*, with attribute *num_classes = (number of characters in the character set)*.
  16. If you implemented those functions correctly, nikkud2onehot('שָׁלוֹם').shape should equal (33, 46) and translit2onehot('shalom').shape should equal (27, 31). What do these dimensions mean?

In [28]:
#14
nikkud2sequence("שָׁלוֹם")

[0,
 44,
 13,
 17,
 31,
 24,
 14,
 32,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1]

In [29]:
translit2sequence("shalom")

[0,
 23,
 12,
 5,
 16,
 19,
 17,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1]

The numbers in the vectors represent the indexes of the word's characters that are found in the respective character sets.

In [31]:
from keras.utils import to_categorical

Using TensorFlow backend.


In [37]:
def nikkud2onehot(nikkud):
    return to_categorical(nikkud2sequence(nikkud), num_classes=len(nikkud_charset))
def translit2onehot(translit):
    return to_categorical(translit2sequence(translit), num_classes=len(translit_charset))

In [38]:
nikkud2onehot('שָׁלוֹם').shape

(33, 46)

In [39]:
translit2onehot('shalom').shape

(27, 31)

In [40]:
nikkud2onehot('שָׁלוֹם')

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

The shape represents the size of each onehot encoded vector (the size of their respective character sets) and the number of characters in the word.

Now let's combine the matrixes for all the words together into tensors:

In [41]:
import numpy as np
X = np.array([nikkud2onehot(nikkud) for nikkud in df.nikkud])
Y = np.array([translit2onehot(translit) for translit in df.transliteration])

Notice that the first dimension of each tensor is the sample size (number of words):

In [42]:
X.shape, Y.shape

((15490, 33, 46), (15490, 27, 31))

In the seq2seq model that we will train, we will try to predict the next character in the transliteration from the characters already generated and from the given nikkud. Since Y contains the encoding for the characters in the transliteration, we want to shift it by one to represent the next character that needs to be predicted.  This is simple with the numpy function *np.roll*. We save this in the tensor Z which will be predicted by the model given X (nikkud) and Y (transliteration):

In [43]:
Z = np.roll(Y, -1, axis = 1)

## Part 3: Seq2seq with LSTMs:

We'll now build a seq2seq model with Keras to predict transliteration from nikkud. First let's build and train our model:



In [44]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model, Sequential

In [45]:
latent_dim = 256

encoder_inputs = Input(shape = (None, len(nikkud_charset))) ## BONUS
encoder_outputs, state_h, state_c = LSTM(latent_dim, return_state = True)(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape = (None, len(translit_charset))) ## BONUS
decoder_lstm = LSTM(latent_dim, return_sequences = True, return_state = True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state = encoder_states)
decoder_dense = Dense(len(translit_charset), activation = 'softmax') ## BONUS
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')
model.fit([X, Y], Z, batch_size = 256, epochs = 50, validation_split = 0.2)

encoder_model = Model(encoder_inputs, encoder_states)

decoder_states_inputs = [
    Input(shape = (latent_dim,)),
    Input(shape = (latent_dim,))
]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs,
                                    initial_state = decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

W0817 01:28:41.557801 140735958451072 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0817 01:28:41.952869 140735958451072 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0817 01:28:42.088677 140735958451072 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0817 01:28:46.004592 140735958451072 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.tr

Train on 12392 samples, validate on 3098 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


**Question:**

18. Check the input and output shapes of this model. What do these dimensions correspond to?

In [47]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 46)     0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 31)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 310272      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  294912      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
          

the input and output shapes are the shapes of the nikkud and transliteration vectors.

Based on this model, we can decode transliteration from nikkud one character at a time, at each step taking the most likely next character predicted by the model. The function *nikkud2translit* takes in a nikkud string and returns the predicted transliteration:

In [48]:
def decode_sequence(input_text, input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1, len(translit_charset))) ## BONUS
    target_seq[0, 0, 0] = 1.
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)
        char_probabilities = {
            c: p for c, p in zip(translit_charset, output_tokens[0, -1, :]) ## BONUS
        }
        sampled_char = max(translit_charset, key = lambda c: char_probabilities[c]) ## BONUS
        sampled_token_index = translit_charset.index(sampled_char) ## BONUS
        decoded_sentence += sampled_char
        if (sampled_char == '$' or
           len(decoded_sentence) > translit_maxlen): ## BONUS
            stop_condition = True
        target_seq = np.zeros((1, 1, len(translit_charset))) ## BONUS
        target_seq[0, 0, sampled_token_index] = 1.
        states_value = [h, c]
    return decoded_sentence

def nikkud2translit(nikkud):
  tensor = nikkud2onehot(nikkud)[None] ## BONUS
  return decode_sequence(nikkud, tensor).replace('$', '').strip()

**Questions:**
  19. Make a new dataframe *df2* containing 100 random samples from *df*. Add a new column *predicted_translit* to the dataframe *df2* with the model's predicted transliteration of the given nikkud. How often does this equal the actual transliteration? What kinds of errors do you see in the output?
  20. Change the value of *epochs =* above to train the model on more epochs. How does this affect the loss? How about the observed results?

**Bonus:** Modify the problem so that we are instead predicting Hebrew text with nikkud from a transliteration. You will have to switch X and Y, and change code where the comment ## BONUS is written above.

In [50]:
#19
df2 = df.sample(n=100)

In [53]:
df2['predicted_translit'] = df2['nikkud'].apply(nikkud2translit)

In [58]:
acc = (df2['predicted_translit'] == df2['transliteration']).sum()/df2.shape[0]*100
print("The prediction equals the actual transliteration {}% of the time".format(acc))

The prediction equals the actual transliteration 37.0% of the time


In [59]:
df2.head()

Unnamed: 0,nikkud,transliteration,word,predicted_translit
3567,דִּיסְטוֹפְּיָה,distopya,דיסטופיה,distopot
14734,שְׂפַת,sfat,שפת,sfat
10704,סְמַרְטוּט,smartut,סמרטוט,starmut
13342,קָרַחַת,karachat,קרחת,karkata
4097,הִיפֶּרְגְלִיקֶמְיָה,hiperglikemya,היפרגליקמיה,hiprekholit


We can find mistakes such as some minor errors such as "ch" being predicted as "k" and sometimes predictions are very different while maintaining similar roots.

In [60]:
#20
latent_dim = 256

encoder_inputs = Input(shape = (None, len(nikkud_charset))) ## BONUS
encoder_outputs, state_h, state_c = LSTM(latent_dim, return_state = True)(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape = (None, len(translit_charset))) ## BONUS
decoder_lstm = LSTM(latent_dim, return_sequences = True, return_state = True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state = encoder_states)
decoder_dense = Dense(len(translit_charset), activation = 'softmax') ## BONUS
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')
model.fit([X, Y], Z, batch_size = 256, epochs = 70, validation_split = 0.2)

encoder_model = Model(encoder_inputs, encoder_states)

decoder_states_inputs = [
    Input(shape = (latent_dim,)),
    Input(shape = (latent_dim,))
]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs,
                                    initial_state = decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

Train on 12392 samples, validate on 3098 samples
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


The loss drops since after 50 epochs we were still underfitting (there was no sign of overfitting since the train and validation loss were still dropping). Therefore the observed results are more accurate as shown below.

In [61]:
df3 = df.sample(n=100)
df3['predicted_translit'] = df3['nikkud'].apply(nikkud2translit)
acc = (df3['predicted_translit'] == df3['transliteration']).sum()/df3.shape[0]*100
print("The prediction equals the actual transliteration {}% of the time".format(acc))

The prediction equals the actual transliteration 54.0% of the time
