### English to Hindi Transliteration Tutorial

### Setup
- On Google Colab make sure you select Python 3/GPU runtime before running the code

#### GPU

In [2]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


#### Download Data

In [3]:
!mkdir data
!wget -N "https://raw.githubusercontent.com/bsantraigi/tensorflow-seq2seq-hindi/master/data/Hindi%20-%20Word%20Transliteration%20Pairs%201.txt" -P data/

mkdir: cannot create directory ‘data’: File exists
--2019-08-10 19:42:28--  https://raw.githubusercontent.com/bsantraigi/tensorflow-seq2seq-hindi/master/data/Hindi%20-%20Word%20Transliteration%20Pairs%201.txt
Connecting to 172.16.2.30:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 773211 (755K) [text/plain]
Last-modified header missing -- time-stamps turned off.
--2019-08-10 19:42:29--  https://raw.githubusercontent.com/bsantraigi/tensorflow-seq2seq-hindi/master/data/Hindi%20-%20Word%20Transliteration%20Pairs%201.txt
Connecting to 172.16.2.30:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 773211 (755K) [text/plain]
Saving to: ‘data/Hindi - Word Transliteration Pairs 1.txt’


2019-08-10 19:42:31 (793 KB/s) - ‘data/Hindi - Word Transliteration Pairs 1.txt’ saved [773211/773211]



#### Import Stuff

In [4]:
import nltk
from collections import Counter
from tqdm import tqdm_notebook
import numpy as np
import tensorflow as tf
from tensorflow.contrib import seq2seq
from tensorflow.contrib.rnn import DropoutWrapper
import random

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/bishal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Global Parameters

In [6]:
MAX_SEQ_LEN = 20
BATCH_SIZE = 64

### Language Vocabulary 
* (Vocab of characters, i.e. an Alphabet)

In [7]:
class Lang:
    def __init__(self, counter, vocab_size):
        self.word2id = {}
        self.id2word = {}
        self.pad = "<PAD>"
        self.sos = "<SOS>"
        self.eos = "<EOS>"
        self.unk = "<UNK>"
        
        self.ipad = 0
        self.isos = 1
        self.ieos = 2
        self.iunk = 3
        
        self.word2id[self.pad] = 0
        self.word2id[self.sos] = 1
        self.word2id[self.eos] = 2
        self.word2id[self.unk] = 3
        
        self.id2word[0] = self.pad
        self.id2word[1] = self.sos
        self.id2word[2] = self.eos
        self.id2word[3] = self.unk
        
        curr_id = 4
        for w, c in counter.most_common(vocab_size):
            self.word2id[w] = curr_id
            self.id2word[curr_id] = w
            curr_id += 1
            
    def encodeSentence(self, s, max_len=-1):
        wseq = s.lower().strip()
        if max_len == -1:
            return [self.word2id[w] if w in self.word2id else self.iunk for w in wseq]
        else:
            return ([self.word2id[w] if w in self.word2id else self.iunk for w in wseq] + [self.ieos] + [self.ipad]*max_len)[:max_len]
        
    def encodeSentence2(self, s, max_len=-1):
        wseq = wseq = s.lower().strip()
        return min(max_len, len(wseq)+1), \
            ([self.word2id[w] if w in self.word2id else self.iunk for w in wseq] + \
                [self.ieos] + [self.ipad]*max_len)[:max_len]
    
    def decodeSentence(self, id_seq):
        id_seq = np.array(id_seq + [self.ieos])
        j = np.argmax(id_seq==self.ieos)
        s = ''.join([self.id2word[x] for x in id_seq[:j]])
        s = s.replace(self.unk, "UNK")
        return s

In [8]:
# Total number of samples to read
N = 30823

### Reading the data files
- Each line contains a hindi word in both English and Devnagari script

In [9]:
hi_counter = Counter()
hi_sentences=[]
en_counter = Counter()
en_sentences=[]
with open("data/Hindi - Word Transliteration Pairs 1.txt") as f:
    for line in tqdm_notebook(f, total=N, desc="Reading file:"):
        en, hi = line.strip().split("\t")
        hi_sentences.append(hi)
        en_sentences.append(en)
    for line in tqdm_notebook(hi_sentences, desc="Processing inputs:"):
        for w in line.strip():
            hi_counter[w] += 1
    for line in tqdm_notebook(en_sentences, desc="Processing inputs:"):
        for w in line.strip():
            en_counter[w] += 1

HBox(children=(IntProgress(value=0, description='Reading file:', max=30823, style=ProgressStyle(description_wi…




HBox(children=(IntProgress(value=0, description='Processing inputs:', max=30823, style=ProgressStyle(descripti…




HBox(children=(IntProgress(value=0, description='Processing inputs:', max=30823, style=ProgressStyle(descripti…




In [10]:
# A few sample hindi characters
print("Most common hi characters in dataset:\n", hi_counter.most_common(5))

print("\nTotal (hi)characters gathered from dataset:",len(hi_counter))

# A few sample english characters
print("\nMost common en characters in dataset:\n", en_counter.most_common(5))

print("\nTotal (en)characters gathered from dataset:", len(en_counter))

Most common hi characters in dataset:
 [('ा', 21123), ('र', 9205), ('े', 8100), ('न', 7225), ('ी', 6546)]

Total (hi)characters gathered from dataset: 66

Most common en characters in dataset:
 [('a', 57220), ('n', 15015), ('i', 14015), ('h', 13805), ('e', 12264)]

Total (en)characters gathered from dataset: 27


In [11]:
en_lang = Lang(en_counter, len(en_counter))
hi_lang = Lang(hi_counter, len(hi_counter))

In [12]:
print("Test en encoding:", en_lang.encodeSentence("Shukriya"))

print("Test en decoding:", en_lang.decodeSentence(en_lang.encodeSentence("Shukriya", 10)))

print("Test hindi encoding:", hi_lang.encodeSentence("शुक्रिया", 10))

print("Test hindi decoding:", hi_lang.decodeSentence((hi_lang.encodeSentence("शुक्रिया", 10))))

Test en encoding: [15, 7, 10, 13, 9, 6, 20, 4]
Test en decoding: shukriya
Test hindi encoding: [35, 19, 15, 22, 5, 12, 21, 4, 2, 0]
Test hindi decoding: शुक्रिया


In [13]:
VE = len(en_lang.word2id)
VH = len(hi_lang.word2id)

### The Seq2Seq architecture
- We will implement a seq2seq architecture for transliteration in Tensorflow r1.13.1 / r1.14
- Debugging Tip: Always keep track of tensor dimensions!
- **Tensorflow Computation Graph** - We will build a tf computation graph first. This is the representation used by tf for any neural network architecture. Once the computation graph is built, you can feed data to it for training or inference

#### Character Embedding Matrix

In [14]:
en_word_emb_matrix = tf.get_variable("en_word_emb_matrix", (VE, 300), dtype=tf.float32)
hi_word_emb_matrix = tf.get_variable("hi_word_emb_matrix", (VH, 300), dtype=tf.float32)

Instructions for updating:
Colocations handled automatically by placer.


#### Placeholders
- Input to a tensorflow graph is 

In [15]:
keep_prob = tf.placeholder(tf.float32)

input_ids = tf.placeholder(tf.int32, (None, MAX_SEQ_LEN))
input_lens = tf.placeholder(tf.int32, (None, ))

ph_target_ids = tf.placeholder(tf.int32, (None, MAX_SEQ_LEN))
target_lens = tf.placeholder(tf.int32, (None, ))

In [16]:
# Add SOS or GO symbol
target_ids = tf.concat([tf.fill([BATCH_SIZE,1], hi_lang.isos), ph_target_ids], -1)

#### Building the computation graph

In [17]:
input_emb = tf.nn.embedding_lookup(en_word_emb_matrix, input_ids)
target_emb = tf.nn.embedding_lookup(hi_word_emb_matrix, target_ids[:, :-1])

In [18]:
input_emb.shape

TensorShape([Dimension(None), Dimension(20), Dimension(300)])

#### Encoder - RNN based sequence encoder

In [19]:
encoder_cell = tf.nn.rnn_cell.GRUCell(128) # 128 is the dimension of hidden state
encoder_cell = DropoutWrapper(encoder_cell, output_keep_prob=keep_prob) # Adding Dropout for regularization

Instructions for updating:
This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.


In [20]:
enc_outputs, enc_state = tf.nn.dynamic_rnn(
    encoder_cell, # The encoder GRU cell
    input_emb, # Embedded input sequence
    sequence_length=input_lens, # Sequence lengths of individual inputs in a batch
    initial_state=encoder_cell.zero_state(BATCH_SIZE, dtype=tf.float32)
)

Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [21]:
# Confirm the shape of the final hidden state
enc_state.shape

TensorShape([Dimension(64), Dimension(128)])

#### Decoder

In [23]:
decoder_cell = tf.nn.rnn_cell.GRUCell(128)
decoder_cell = DropoutWrapper(decoder_cell, output_keep_prob=keep_prob)

#### Decoder to Output Vocab Projection Layer

In [24]:
output_projection = tf.layers.Dense(len(hi_lang.word2id))

#### Decoder Training Helper

In [25]:
helper = seq2seq.TrainingHelper(target_emb, target_lens)
decoder = seq2seq.BasicDecoder(decoder_cell, helper, enc_state, output_projection)
outputs, _, outputs_lens = seq2seq.dynamic_decode(decoder, maximum_iterations=MAX_SEQ_LEN, 
                                                  impute_finished=False, swap_memory=True)
output_max_len = tf.reduce_max(outputs_lens)

#### And Decoder Inference Helper

In [26]:
# Using the decoder_cell without dropout here.
infer_helper = seq2seq.GreedyEmbeddingHelper(hi_word_emb_matrix, tf.fill([BATCH_SIZE, ], hi_lang.isos), hi_lang.ieos)
infer_decoder = seq2seq.BasicDecoder(decoder_cell, infer_helper, enc_state, output_projection)
infer_output = seq2seq.dynamic_decode(infer_decoder, maximum_iterations=MAX_SEQ_LEN, swap_memory=True)

#### Loss and Optimizers

In [27]:
# Sequence mask:
# To make sure we don't back-propagate error from output of length positions
masks = tf.sequence_mask(target_lens, output_max_len, dtype=tf.float32, name='masks')

# Loss function - weighted softmax cross entropy
cost = seq2seq.sequence_loss(
    outputs[0],
    target_ids[:, 1:(output_max_len + 1)],
    masks)

# Optimizer
optimizer = tf.train.AdamOptimizer(0.0001)

In [28]:
train_op = optimizer.minimize(cost)

In [29]:
init = tf.global_variables_initializer()

#### Tensorflow Sessions

In [30]:
sess_config = tf.ConfigProto()
sess_config.gpu_options.allow_growth = True

In [31]:
sess = tf.InteractiveSession(config=sess_config)
sess.run(init)

#### Minibatch Training + Validation
- Performance Evaluation using BLEU scores

In [32]:
random.seed(41)

In [33]:
parallel = list(zip(en_sentences, hi_sentences))

In [34]:
random.shuffle(parallel)

In [35]:
parallel[1000]

('hazaarii', 'हज़ारी')

In [36]:
train_n = int(0.95*N)
valid_n = N - train_n

In [37]:
train_pairs = parallel[:train_n].copy()
valid_pairs = parallel[train_n:]

In [38]:
def small_test():
    all_bleu = []
    smoothing = nltk.translate.bleu_score.SmoothingFunction().method7
    for m in range(0, valid_n, BATCH_SIZE):
        # print(f"Status: {m}/{N}", end='\r')
        n = m + BATCH_SIZE
        if n > valid_n:
            # print("Epoch Complete...")
            break

        input_batch = np.zeros((BATCH_SIZE, MAX_SEQ_LEN), dtype=np.int32)
        input_lens_batch = np.zeros((BATCH_SIZE,), dtype=np.int32)
        for i in range(m, n):
            b,a = en_lang.encodeSentence2(valid_pairs[i][0], MAX_SEQ_LEN)
            input_batch[i-m,:] = a
            input_lens_batch[i-m] = b

    #     target_batch = np.zeros((BATCH_SIZE, MAX_SEQ_LEN), dtype=np.int32)
    #     target_lens_batch = np.zeros((BATCH_SIZE,), dtype=np.int32)
    #     for i in range(m, n):
    #         b,a = hi_lang.encodeSentence2(valid_pairs[i][1], MAX_SEQ_LEN)
    #         target_batch[i-m,:] = a
    #         target_lens_batch[i-m] = b

        feed_dict={
            input_ids: input_batch,
            input_lens: input_lens_batch,
            #target_ids: target_batch,
            #target_lens: target_lens_batch,
            keep_prob: 1.0
        }
        pred_batch = sess.run(infer_output[0].sample_id, feed_dict=feed_dict)
        for k, pred_ in enumerate(pred_batch):
            pred_s = hi_lang.decodeSentence(list(pred_))
            ref = valid_pairs[m+k][1]
            try:
                _bx = nltk.translate.bleu_score.sentence_bleu(
                    [ref],
                    pred_s,
                    weights=[1/4]*4,
                    smoothing_function=smoothing)
            except ZeroDivisionError:
                _bx = 0
            all_bleu.append(_bx)

    print(f"\n\nBLEU Score: {np.mean(all_bleu)}\n")

In [39]:
for _e in range(20):
    # Mix things up a bit.
    random.shuffle(train_pairs)
    for m in range(0, train_n, BATCH_SIZE):
        n = m + BATCH_SIZE
        if n > train_n:
            print("\nEpoch Complete...")
            break

        input_batch = np.zeros((BATCH_SIZE, MAX_SEQ_LEN), dtype=np.int32)
        input_lens_batch = np.zeros((BATCH_SIZE,), dtype=np.int32)
        for i in range(m, n):
            b,a = en_lang.encodeSentence2(train_pairs[i][0], MAX_SEQ_LEN)
            input_batch[i-m,:] = a
            input_lens_batch[i-m] = b

        target_batch = np.zeros((BATCH_SIZE, MAX_SEQ_LEN), dtype=np.int32)
        target_lens_batch = np.zeros((BATCH_SIZE,), dtype=np.int32)
        for i in range(m, n):
            b,a = hi_lang.encodeSentence2(train_pairs[i][1], MAX_SEQ_LEN)
            target_batch[i-m,:] = a
            target_lens_batch[i-m] = b

        feed_dict={
            input_ids: input_batch,
            input_lens: input_lens_batch,
            ph_target_ids: target_batch,
            target_lens: target_lens_batch,
            keep_prob: 0.8 
        }
        sess.run(train_op, feed_dict=feed_dict)
        batch_loss = sess.run(cost, feed_dict=feed_dict)
        print(f"Epoch: {_e} >> Status: {n}/{train_n} >> Loss: {batch_loss}", end="\r")
        if (1 + n//BATCH_SIZE) % 100 == 0:
            small_test()

Epoch: 0 >> Status: 6336/29281 >> Loss: 3.5062947273254395

BLEU Score: 0.029901360008871503

Epoch: 0 >> Status: 12736/29281 >> Loss: 3.1823196411132812

BLEU Score: 0.07751817844441643

Epoch: 0 >> Status: 19136/29281 >> Loss: 2.9157474040985107

BLEU Score: 0.09985992977826874

Epoch: 0 >> Status: 25536/29281 >> Loss: 2.7789411544799805

BLEU Score: 0.09933714196500155

Epoch: 0 >> Status: 29248/29281 >> Loss: 2.7930457592010555
Epoch Complete...
Epoch: 1 >> Status: 6336/29281 >> Loss: 2.7485635280609134

BLEU Score: 0.17597957174813808

Epoch: 1 >> Status: 12736/29281 >> Loss: 2.6061580181121826

BLEU Score: 0.19867908983711735

Epoch: 1 >> Status: 19136/29281 >> Loss: 2.4363019466400146

BLEU Score: 0.22377641272294765

Epoch: 1 >> Status: 25536/29281 >> Loss: 2.2718882560729984

BLEU Score: 0.2377829636667604

Epoch: 1 >> Status: 29248/29281 >> Loss: 2.3076264858245857
Epoch Complete...
Epoch: 2 >> Status: 6336/29281 >> Loss: 2.2798261642456055

BLEU Score: 0.2594323559726662

Ep

Epoch: 18 >> Status: 6336/29281 >> Loss: 0.6174688339233398

BLEU Score: 0.5901467749447065

Epoch: 18 >> Status: 12736/29281 >> Loss: 0.5997928380966187

BLEU Score: 0.6029722149401584

Epoch: 18 >> Status: 19136/29281 >> Loss: 0.6832786202430725

BLEU Score: 0.5994256454117576

Epoch: 18 >> Status: 25536/29281 >> Loss: 0.5212688446044922

BLEU Score: 0.6030822408168152

Epoch: 18 >> Status: 29248/29281 >> Loss: 0.65234178304672247
Epoch Complete...
Epoch: 19 >> Status: 6336/29281 >> Loss: 0.5704870820045471

BLEU Score: 0.6047980393493926

Epoch: 19 >> Status: 12736/29281 >> Loss: 0.6026765108108526

BLEU Score: 0.6111215935007538

Epoch: 19 >> Status: 19136/29281 >> Loss: 0.65075635910034187

BLEU Score: 0.6098337010866531

Epoch: 19 >> Status: 25536/29281 >> Loss: 0.5079979300498962

BLEU Score: 0.61390327612183

Epoch: 19 >> Status: 29248/29281 >> Loss: 0.63342958688735965
Epoch Complete...


### Let's see some real translation examples now!

In [44]:
def transliterate(s):
    input_batch = np.zeros((BATCH_SIZE, MAX_SEQ_LEN), dtype=np.int32)
    input_lens_batch = np.zeros((BATCH_SIZE,), dtype=np.int32)
    b,a = en_lang.encodeSentence2(s, MAX_SEQ_LEN)
    input_batch[0, :] = a
    input_lens_batch[0] = b
    
    feed_dict={
        input_ids: input_batch,
        input_lens: input_lens_batch,
        #target_ids: target_batch,
        #target_lens: target_lens_batch,
        keep_prob: 1.0
    }
    pred_batch = sess.run(infer_output[0].sample_id, feed_dict=feed_dict)
    pred_ = pred_batch[0]
    pred_s = hi_lang.decodeSentence(list(pred_))
    # ref = valid_pairs[m+k][1]
    return pred_s

In [45]:
transliterate("saya")

'साया'

In [46]:
transliterate("ram")

'राम'

In [47]:
transliterate("laksh")

'लकश'