Neural Machine Translation:
===
The aim of this notebook is to illustrate how to create and train a neural machine translation pipeline. Machine Translation is the task of automatically converting source text in one language to text in another language. In this notebook, we will try to build End-2-End pipeline to translate setences from English to Hindi

The notebook has been divided into two parts: In part I, I have discussed step-by-step process of building an end-2-end pipeline from scratch to translate sentences from English to Hindi. In part II, Instead of building a pipeline from scratch, I have shown the usage of AutoML framework to create a Neural Machine Translation System.





Part I:
---
The step-by-step process of building an End-2-End pipeline from scratch is divided into following parts:

  * Pre-process our data (Data Preprocessing)
    * Convert to lowercase
    * Remove all the special characters
    * Remove all the numbers from text
  * Tokenization
    * Sentence tokenization
    * Remove extra spaces (leading and trailing)
    * START and END tokens to sentences
  * Model Architecture
    * Choose an sequence model
    * Architecture and layers explanation
    * Attention and explanation
  * Model training and Hyperparameters optimization
    * Epochs, Batch Size, Layers
    * Sample sentence prediction
  * Evaluation metric: Calculate the BLEU score for the translation.

**Note: if you are running this notebook in Google Colaboratory, you have to download the dataset**

### Data Download:

In [0]:
!wget --header="Host: doc-08-a0-docs.googleusercontent.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://mail.google.com/mail/u/0/" --header="Cookie: AUTH_h2usbqe61t3p2javmc9f8vev3k7aj2re=09663297046543970124|1590855225000|vnbtokaapd9nf3iroaoqmb4dpinng0fa; NID=202=LrP4uDlw-tpJIrOsSSVGjEUAYFBvMgE7pYsHllDi2RvAiohnhILxLsaostbt-xc33InPBNV05z_jAtAw4Km239HIJCcIFMoWS8Yh1FmMW-xpd8qowQuNbmhfho6LYG4h7_JheMnumByEDYwFL3O_tCzX30sFKeeVfNE_XxYUTwM" --header="Connection: keep-alive" "https://doc-08-a0-docs.googleusercontent.com/docs/securesc/035r11cge54f61vkg6q5351jddbf1r9g/i9m3eas59qd0vc9tku6hdm61d6k9a150/1590855225000/09367493494736841528/09663297046543970124/15yNTXYb65oz43nn9PIu6kyp792rr_yZ5?e=download&authuser=0" -c -O 'Hi-En-Parallel_Corpus.xlsx'

--2020-05-30 16:16:07--  https://doc-08-a0-docs.googleusercontent.com/docs/securesc/035r11cge54f61vkg6q5351jddbf1r9g/i9m3eas59qd0vc9tku6hdm61d6k9a150/1590855225000/09367493494736841528/09663297046543970124/15yNTXYb65oz43nn9PIu6kyp792rr_yZ5?e=download&authuser=0
Resolving doc-08-a0-docs.googleusercontent.com (doc-08-a0-docs.googleusercontent.com)... 64.233.189.132, 2404:6800:4008:c07::84
Connecting to doc-08-a0-docs.googleusercontent.com (doc-08-a0-docs.googleusercontent.com)|64.233.189.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘Hi-En-Parallel_Corpus.xlsx’

          Hi-En-Par     [<=>                 ]       0  --.-KB/s               Hi-En-Parallel_Corp     [ <=>                ]  16.83M   107MB/s    in 0.2s    

2020-05-30 16:16:08 (107 MB/s) - ‘Hi-En-Parallel_Corpus.xlsx’ saved [17653110]



In [0]:
# check if file is downloaded successfully
!ls

Hi-En-Parallel_Corpus.xlsx  sample_data


### Read the file:

In [0]:
# import libraries for processing
import pandas as pd
import numpy as np
import string
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

In [0]:
lines = pd.read_excel('Hi-En-Parallel_Corpus.xlsx')
# a look to data
lines.head()

### Data Preprocessing:

As we can see, our text consist of both capital and small words (e.g. "The", "the"), **numbers** (e.g. 0.001, 200), **punctuation** (e.g. ",", ".", ";", "--"), etc.

These observations, suggest that before we start building the model, we need to perfrom data cleaning and preparation.

Firstly, we will **lowercase** the entire text, this way we won't have to worry about **case sensitivity** of text. 

After that, we will **remove special character** (e.g. "--", ".") using regex since, they don't add any meaning to the text. A RegEx or **Regular Expression is a sequence of characters that define a search pattern**. We will be using RegEx module python built-in package re and filter only useful text.

Similarly, we will remove numeric characters from our text.

In [0]:
# before preprocessing we make sure our data is in string format 
lines.english_sentence=lines.english_sentence.astype(str)
lines.hindi_sentence=lines.hindi_sentence.astype(str)
# a look to data to observe changes
lines.head()

In [0]:
# Lowercase all characters
lines.english_sentence=lines.english_sentence.apply(lambda x: x.lower())
lines.hindi_sentence=lines.hindi_sentence.apply(lambda x: x.lower())
# a look to data to observe changes
lines.head()

In [0]:
exclude = set(string.punctuation) # Set of all special characters
# Remove all the special characters
lines.english_sentence=lines.english_sentence.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.hindi_sentence=lines.hindi_sentence.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
# a look to data to observe changes
lines.tail()

In [0]:
# Remove all numbers from text
remove_digits = str.maketrans('', '', digits)
lines.english_sentence=lines.english_sentence.apply(lambda x: x.translate(remove_digits))
lines.hindi_sentence = lines.hindi_sentence.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))
# a look to data to observe changes
lines.head()

### Tokenization:

Once, we have pre-process our text, we can further focus on **splitting our text into tokens like list of sentences or list of words**. We can think of token as parts like a word is a token in a sentence, and sentence is a token in a paragraph. The former one known as sentence tokenization and later one word tokenization. We will now try to tokenize the text

In [0]:
# Sentence tokenization
lines.english_sentence=lines.english_sentence.apply(lambda x: x.strip())
lines.hindi_sentence=lines.hindi_sentence.apply(lambda x: x.strip())
# a look to data to observe changes
lines.tail()

In [0]:
# Remove extra spaces
lines.english_sentence=lines.english_sentence.apply(lambda x: x.lstrip())
lines.hindi_sentence=lines.hindi_sentence.apply(lambda x: x.lstrip())
lines.english_sentence=lines.english_sentence.apply(lambda x: x.rstrip())
lines.hindi_sentence=lines.hindi_sentence.apply(lambda x: x.rstrip())
# a look to data to observe changes
lines.tail()

In [0]:
# Add start and end tokens to target sequences
lines.hindi_sentence = lines.hindi_sentence.apply(lambda x : 'START_ '+ x + ' _END')

In [0]:
lines = shuffle(lines)
lines.sample(10)

Dataset consist of observations with maximum sentence length of . Therefore, we will perform experiment on small subset of dataset with only 33725 observations

In [0]:
# Randomly sample 33725 observations
lines = lines.sample(n=33725)
lines.shape

We will define the problem such that the input and output sequences are of same length and pad
the output sequences with "0" values as needed.

In [0]:
# Vocabulary of English
all_eng_words=set()
for eng in lines.english_sentence:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)

# Vocabulary of French 
all_hindi_words=set()
for mar in lines.hindi_sentence:
    for word in mar.split():
        if word not in all_hindi_words:
            all_hindi_words.add(word)

We then try to identify the length of the largest sentence by iterting through all the available Input Sequences

In [0]:
# Max Length of source sequence
lenght_list=[]
for l in lines.english_sentence:
    lenght_list.append(len(l.split(' ')))
max_length_src = np.max(lenght_list)
max_length_src

In [0]:
# Max Length of target sequence
lenght_list=[]
for l in lines.hindi_sentence:
    lenght_list.append(len(l.split(' ')))
max_length_tar = np.max(lenght_list)
max_length_tar

In [0]:
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_hindi_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_hindi_words)
num_encoder_tokens, num_decoder_tokens

In [0]:
num_decoder_tokens += 1 # For zero padding
num_decoder_tokens

Now we begin to segregate the Input Sequences by collecting all the encoded words following upto the last word as the input to the model and the last encoded word as the target label and store both of them in separate lists.

In [0]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

In [0]:
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [0]:
# Train - Test Split
X, y = lines.english_sentence, lines.hindi_sentence
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)
X_train.shape, X_test.shape

#### Save the train and test dataframes for reproducing the results later, as they are shuffled.

In [0]:
X_train.to_pickle('X_train.pkl')
X_test.to_pickle('X_test.pkl')

In [0]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

### Model Architecture:

In this part, we will study how to define a neural network for machine translation, the neural network for Neural Machine Translation uses a Sequential-2-Sequential model. We can think of the seq-2-seq model as being comprised of two key parts: the encoder and the decoder. We can develop a simple encoder-decoder model in keras by taking the output from an encoder LSTM model, repeating it n times for the number of timesteps in the output sequence, then using a decoder to predict the output sequence.

In [0]:
latent_dim = 50 # number of units in neural network

In [0]:
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

### Embedding layer:

A word-embedding layer is used to represent words input to the encoder. This is a distributed representation where each word is mapped to a fixed-vector of continous values. Advantage of using this layer is that different words with similar meaning will have a similar representation. 

This distributed representation of similar meaning is often learned while fitting the model on the training data. The embedding size defines the length of the vectors used to represent words in this case max length of sentence in training dataset. 

In [0]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

### Recurrent neural network:

There are three types of recurrent neural network cells that are commonly used:
* Simple RNN
* Long Short-Term Memory or LSTM
* Gated Recurrent Unit or GRU

The LSTM was developed to address the vanishing gradient problem of the Simple RNN that limited the training of deep RNNs. By using LSTM type RNN we will create a seq-2-seq model for neural machine translation

In [0]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

We will train the model with **categorical cross entropy loss** using **rmsprop adaptive optimization algorithm** for **3 epochs** since, one epoch take **4 hours on Tesla P100 GPU** with **4 batch size** so that our GPU able to store them in memory

In [0]:
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 4
epochs = 4

In [0]:
model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps =
                    val_samples//batch_size)  

### Save the weights
  

In [0]:
model.save_weights('nmt_weights.h5')

### Load the weights, if you close the application

In [0]:
model.load_weights('nmt_weights.h5')

### Inference Setup

In [0]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

### Decode sample sequeces

In [0]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_END' or
           len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

### Sample Sentence Prediction

In [0]:
train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1

### Evaluting Prediction

In [0]:
from pickle import load
from numpy import array
from numpy import argmax
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

actual, predicted = list(), list()
for k, x in enumerate(train_gen):
  k+=1
  (input_seq, actual_output), _ = x
  decoded_sentence = decode_sequence(input_seq)
  print('Input English sentence:', X_train[k:k+1].values[0])
  print('Actual Marathi Translation:', y_train[k:k+1].values[0][6:-4])
  print('Predicted Marathi Translation:', decoded_sentence[:-4])
  #print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
  actual.append([y_train[k:k+1].values[0][6:-4]])
  predicted.append(decoded_sentence[:-4])
  if k>10:
    	# calculate BLEU score
      print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
      print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
      print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
      print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
      break

Input English sentence: encouraged by thishitler mounted attack on sudetenland which was the western part of czechoslovakia where ethnic german population was highwas mounted attack on it
Actual Marathi Translation:  इस बात से उत्साहित होकर हिटलर ने सदतेनलैंड जो की चेकोस्लोवाकिया का पश्चिमी हिस्सा है और जहाँ जर्मन भाषा बोलने वालों की ज्यादा तादात थी वहां पर हमला बोल दिया । 
Predicted Marathi Translation:  यह एक है कि वे एक और एक एक है 
Input English sentence: there is a sequentialactivation and repression of genes through the three phases of life development reproduction and senescence
Actual Marathi Translation:  जीवन के तीनों कालोंविकास प्रजनन और जराजन्यताके दौरान सक्रिय सक्रियता और निग्रह होता है 
Predicted Marathi Translation:  लिए यह है कि वे अपने लिए यह है कि वह अपने लिए भी 
Input English sentence: he had lost his wife and two children of the three surviving ones the eldest daughter lived with her husband outside bengal the eldest son had been sent to the united states the previo

Part 2: Using AutoML framework - Uber Ludwig
---
Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.

All you need to provide is a CSV file containing your data, a list of columns to use as inputs, and a list of columns to use as outputs, in the form of yaml file; Ludwig will do the rest. Simple commands can be used to train models both locally and in a distributed way, and to use them to predict on new data.

### Clone the repository

In [0]:
!git clone https://github.com/uber/ludwig.git

Cloning into 'ludwig'...
remote: Enumerating objects: 166, done.[K
remote: Counting objects: 100% (166/166), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 9530 (delta 97), reused 93 (delta 49), pack-reused 9364[K
Receiving objects: 100% (9530/9530), 13.39 MiB | 7.57 MiB/s, done.
Resolving deltas: 100% (6801/6801), done.


### Change requirements.txt to train on GPU

In [0]:
%%writefile ludwig/requirements.txt
Cython>=0.25
h5py>=2.6
numpy>=1.15
pandas>=0.19
scipy>=0.18
tabulate>=0.7
scikit-learn
tqdm
tensorflow-gpu==1.15.2
PyYAML>=3.12
absl-py

Overwriting ludwig/requirements.txt


### Install the library using bash commands

In [0]:
%%bash
cd ludwig
pip install -r requirements.txt
python setup.py install

Collecting tensorflow-gpu==1.15.2
  Downloading https://files.pythonhosted.org/packages/32/ca/58e40e5077fa2a92004f398d705a288e958434f123938f4ce75ffe25b64b/tensorflow_gpu-1.15.2-cp36-cp36m-manylinux2010_x86_64.whl (411.0MB)
Collecting tensorflow-estimator==1.15.1
  Downloading https://files.pythonhosted.org/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503kB)
Collecting tensorboard<1.16.0,>=1.15.0
  Downloading https://files.pythonhosted.org/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Building wheels for collected packages: gast
  Building wheel for gast (setup.py): started
  Building wheel for gast (setup.py): finished with status 'done'
  Created wheel for gast: filename=g

ERROR: tensorflow 2.2.0 has requirement gast==0.3.3, but you'll have gast 0.2.2 which is incompatible.
ERROR: tensorflow 2.2.0 has requirement tensorboard<2.3.0,>=2.2.0, but you'll have tensorboard 1.15.0 which is incompatible.
ERROR: tensorflow 2.2.0 has requirement tensorflow-estimator<2.3.0,>=2.2.0, but you'll have tensorflow-estimator 1.15.1 which is incompatible.
ERROR: tensorflow-probability 0.10.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.
zip_safe flag not set; analyzing archive contents...
ludwig.__pycache__.neuropod_export.cpython-36: module references __file__


### Download the dataset

In [0]:
!wget --header="Host: doc-08-a0-docs.googleusercontent.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://mail.google.com/mail/u/0/" --header="Cookie: AUTH_h2usbqe61t3p2javmc9f8vev3k7aj2re=09663297046543970124|1590855225000|vnbtokaapd9nf3iroaoqmb4dpinng0fa; AUTH_h2usbqe61t3p2javmc9f8vev3k7aj2re_nonce=jegbp49t4g4c6; NID=202=LrP4uDlw-tpJIrOsSSVGjEUAYFBvMgE7pYsHllDi2RvAiohnhILxLsaostbt-xc33InPBNV05z_jAtAw4Km239HIJCcIFMoWS8Yh1FmMW-xpd8qowQuNbmhfho6LYG4h7_JheMnumByEDYwFL3O_tCzX30sFKeeVfNE_XxYUTwM" --header="Connection: keep-alive" "https://doc-08-a0-docs.googleusercontent.com/docs/securesc/035r11cge54f61vkg6q5351jddbf1r9g/v4neda48nf159ukdu6sibi9d1aq1f1l8/1590855525000/09367493494736841528/09663297046543970124/15yNTXYb65oz43nn9PIu6kyp792rr_yZ5?e=download&authuser=0&nonce=jegbp49t4g4c6&user=09663297046543970124&hash=r0obp839dgutqnrohet7ej0oq9dhmcbg" -c -O 'Hi-En-Parallel_Corpus.xlsx'

--2020-05-30 16:20:08--  https://doc-08-a0-docs.googleusercontent.com/docs/securesc/035r11cge54f61vkg6q5351jddbf1r9g/v4neda48nf159ukdu6sibi9d1aq1f1l8/1590855525000/09367493494736841528/09663297046543970124/15yNTXYb65oz43nn9PIu6kyp792rr_yZ5?e=download&authuser=0&nonce=jegbp49t4g4c6&user=09663297046543970124&hash=r0obp839dgutqnrohet7ej0oq9dhmcbg
Resolving doc-08-a0-docs.googleusercontent.com (doc-08-a0-docs.googleusercontent.com)... 64.233.189.132, 2404:6800:4008:c07::84
Connecting to doc-08-a0-docs.googleusercontent.com (doc-08-a0-docs.googleusercontent.com)|64.233.189.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘Hi-En-Parallel_Corpus.xlsx’

          Hi-En-Par     [<=>                 ]       0  --.-KB/s               Hi-En-Parallel_Corp     [ <=>                ]  16.83M   101MB/s    in 0.2s    

2020-05-30 16:20:08 (101 MB/s) - ‘Hi-En-Parallel_Corpus.xlsx’ sa

In [0]:
!ls

Hi-En-Parallel_Corpus.xlsx  ludwig  sample_data


### Pre-process the dataset into csv file

In [0]:
# import libraries for processing
import pandas as pd
import numpy as np
import string
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

Using TensorFlow backend.


In [0]:
lines = pd.read_excel('Hi-En-Parallel_Corpus.xlsx')
lines.head()

Unnamed: 0,english_sentence,hindi_sentence
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर..."
1,"I'd like to tell you about one such child,",मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...
2,This percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,what we really mean is that they're bad at not...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,.The ending portion of these Vedas is called U...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।


In [0]:
lines.english_sentence=lines.english_sentence.astype(str)
lines.hindi_sentence=lines.hindi_sentence.astype(str)
lines.head()

Unnamed: 0,english_sentence,hindi_sentence
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर..."
1,"I'd like to tell you about one such child,",मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...
2,This percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,what we really mean is that they're bad at not...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,.The ending portion of these Vedas is called U...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।


In [0]:
lines.to_csv('translation.csv', index=None)

In [0]:
!head translation.csv

english_sentence,hindi_sentence
politicians do not have permission to do what needs to be done.,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है ."
"I'd like to tell you about one such child,","मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी,"
This percentage is even greater than the percentage in India.,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
what we really mean is that they're bad at not paying attention.,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
.The ending portion of these Vedas is called Upanishad.,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।
"The then Governor of Kashmir resisted transfer , but was finally reduced to subjection with the aid of British .","कश्मीर के तत्कालीन गवर्नर ने इस हस्तांतरण का विरोध किया था , लेकिन अंग्रेजों की सहायता से उनकी आवाज दबा दी गयी ."
In this lies the circumstances of people before you.,इसमें तुमसे पूर्व गुज़रे हुए लोगों के हालात हैं।
"And who are we to say, even, that they are wrong",और हम होते कौन हैं यह कहने भी

### Create model_definition_file.yaml specifying column name and model architecture with attention mechansim

In [0]:
%%writefile model_definition_file.yaml
input_features:
    -
        name: english_sentence
        type: text
        level: word
        encoder: rnn
        cell_type: lstm
        reduce_output: null
        preprocessing:
          word_format: english_tokenize

output_features:
    -
        name: hindi_sentence
        type: text
        level: word
        decoder: generator
        cell_type: lstm
        attention: bahdanau
        loss:
            type: sampled_softmax_cross_entropy
        preprocessing:
          word_format: hindi_tokenize

training:
    batch_size: 96
    epochs: 8

Writing model_definition_file.yaml


### Check GPU is available

In [0]:
!nvidia-smi

Sat May 30 16:30:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### Train the model

In [0]:
!ludwig experiment \
  --data_csv translation.csv \
  --model_definition_file model_definition_file.yaml

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.2.2.7 - Experiment

Experiment name: experiment
Model name: run
Output path: results/experiment_run


ludwig_version: '0.2.2.7'
command: ('/usr/local/bin/ludwig experiment --data_csv translation.csv '
 '--model_definition_file model_definition_file.yaml')
random_seed: 42
input_data: 'translation.csv'
model_definition: {   'combiner': {'type': 'concat'},
    'input_features': [   {   'cell_type': 'lstm',
                              'encoder': 'rnn',
   

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
    'markedness': 0.0,
    'matthews_correlation_coefficient': 0,
    'miss_rate': 1.0,
    'negative_predictive_value': 1.0,
    'positive_predictive_value': 0,
    'precision': 0,
    'recall': 0,
    'sensitivity': 0,
    'specificity': 1.0,
    'true_negative_rate': 1.0,
    'true_negatives': 25449,
    'true_positive_rate': 0,
    'true_positives': 0},
  nauksaana: {   'accuracy': 1.0,
    'f1_score': 0,
    'fall_out': 0.0,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 0,
    'false_omission_rate': 0.0,
    'false_positive_rate': 0.0,
    'false_positives': 0,
    'hit_rate': 0,
    'informedness': 0.0,
    'markedness': 0.0,
    'matthews_correlation_coefficient': 0,
    'miss_rate': 1.0,
    'negative_predictive_value': 1.0,
    'positive_predictive_value': 0,
    'precision': 0,
    'recall': 0,
    'sensitivity': 0,
    'specificity': 1.0,
    'true_negative_rate': 1.0,


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



perplexity: 3.329976799811374
token_accuracy: 0.06756778391330276

Finished: experiment_run
Saved to: results/experiment_run


In [0]:
!ls

Hi-En-Parallel_Corpus.xlsx  results	     translation.hdf5
ludwig			    sample_data      translation.json
model_definition_file.yaml  translation.csv


### Predicting Sample Sentence, specify model path and csv file

In [0]:
!ludwig predict --data_csv translation.csv --model_path results/experiment_run/model

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.2.2.7 - Predict

Dataset path: translation.csv
Model path: results/experiment_run/model

Found hdf5 with the same filename of the csv, using it instead
Loading metadata from: results/experiment_run/model/train_set_metadata.json
Loading data from: translation.hdf5

╒═══════════════╕
│ LOADING MODEL │
╘═══════════════╛

Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructi