#AIS MACHINE TRANSLATION WORKSHOP

*Using RNN (Recurrent Neural Network) for Natural Language Processing to translate data from French to English.*


By Michael Le, Maitreyee Mhasakar

Content and other contributions by Janam Parikh, Arshdeep Singh, Rama Narayan Lakshmanan

Github link to resources: [https://github.com/aisutd/Fall19_Workshop2_Machine_Translation](https://github.com/aisutd/Fall19_Workshop2_Machine_Translation)







## What is Natural Language Processing?

Natural Language Processing, usually shortened as NLP,subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages.

<img src='https://res.cloudinary.com/rsmglobal/image/fetch/t_default%2Cf_auto%2Cq_auto/https://www.rsm.global/singapore/sites/default/files/media/Publications/Our%20Expert%20Insights/rsm-tmt-nlp.jpg' height="500" width="600"/>





## What is Machine Translation?

Machine translation (MT) refers to fully automated software that can translate source content into target languages. 
Humans may use MT to help them render text and speech into another language, or the MT software may operate without human intervention.


Main approaches to machine translation:

*   **First-generation rule-based (RbMT) systems** : Based on Grammar, Syntax, Phraseology

*   **Statistical systems (SMT)** : Based on Search and Big Data.With lots of parallel texts becoming available, SMT developers learned to pattern-match reference texts to find translations that are statistically most likely to be suitable. These systems train faster than RbMT, provided there is enough existing language material to reference.
 
*   **Neural MT (NMT)** : Machine learning technology to teach software how to produce the best result. This process consumes large amounts of processing power, and that is why it’s often run on graphics units of CPUs. NMT started gaining visibility in 2016. Many MT providers are now switching to this technology.






In [0]:
#Importing required libraries

import string
import re
import math
import io
import numpy as np
from numpy import array, argmax, random, take

import pandas as pd

from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Input, RepeatVector, TimeDistributed, GRU
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model, Model
from keras.utils import to_categorical
from keras import optimizers


import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
import tensorflow as tf
tf.__version__


In [0]:
#Upload dataset
from google.colab import files
uploaded = files.upload()

In [0]:
#Import English sentence data from file into a dataframe
english_df = pd.read_csv('small_vocab_en', sep='\n', header=None, names=['English'])
print(english_df.head())
english_df.shape

In [0]:
#Import French sentence data from file into a dataframe
french_df = pd.read_csv('small_vocab_fr', sep='\n', header=None, names=['French'])
print(french_df.head())
french_df.shape

In [0]:
#Final dataset dataframe
df = pd.concat([english_df, french_df], axis=1, join='inner')
df.info()
print(df.head())
df.shape

In [0]:
#Remove missing and blank records from data
"""
df['English'].replace('', np.nan, inplace=True)
df['French'].replace('', np.nan, inplace=True)
df.dropna(subset=['English'], inplace=True)
df.dropna(subset=['French'], inplace=True)
print(df.shape)
"""

In [0]:
#Lowercase english sentences as part of preprocessing
df1=df.copy()
df1["English"] = df1["English"].str.lower()
print(df1.head())
print(df1.shape)

## What are Neural Networks?

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. 
In this sense, neural networks refer to systems of neurons, either organic or artificial in nature. 

Neural networks can adapt to changing input; so, the network generates the best possible result without needing to redesign the output criteria. 

The concept of neural networks, which has its roots in artificial intelligence, is swiftly gaining popularity in the development of trading systems.

<img src='https://miro.medium.com/max/1592/1*yGMk1GSKKbyKr_cMarlWnA.jpeg'>



**Three fundamental components** of neural networks:

1. **Structure** - what the neural network looks like, including all the mathematical functions involved, the number of inputs and outputs, and the parameters, called **weights** that the network has to learn.
    
2. **Loss Function** - a metric that tells us how good or bad the network's predictions are. 
3. **Optimizer** - the algorithm used for **learning the weights** that give the network the best predictions.


### The Simplest Neural Network - The Perceptron
The perceptron, arguably the simplest neural network, was invented by psychologist Frank Rosenblatt in 1957 and looks something like this:
![perceptron](https://docs.google.com/uc?export=download&id=1SbHK9XPrP1PSO9T-lh9uG9CTCNjdXhU1)

(image source: http://ataspinar.com/2016/12/22/the-perceptron/)

A perceptron is basically a neural network with a single **artificial neuron**. Similar to the biological neuron, a perceptron has the following characteristics:

- **inputs** - the perceptron receives a given number of real-valued inputs (the inputs are numbers).
- **weights** - the perceptron has a weight $ w_i $ associated with each input $ x_i $. These weighted connections are like synapses and they are parameters that the perceptron must "learn".
- **weighted sum (basically a dot product)** - the inputs are multiplied by the weights and the results are added together to produce a weighted sum.
- **activation function** - the perceptron has an activation function called the unit-step function that produces an output of 1 if the weighted sum is greater than some threshold $\theta$ and -1 otherwise.
  

*tanh*:
tanh is like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).

![tanh](https://miro.medium.com/max/744/1*f9erByySVjTjohfFdNkJYQ.jpeg)

*Softmax*: 
Softmax function takes a vector as input and produces a vector of the same shape as the output. In a way, this function basically acts on an entire layer. The softmax function basically converts a vector of real values into a probability distribution and is useful for representing the probabilities of different classes.




**Hidden layers** : layer of neurons other than the input and output layers

**Dropout** : Technique to reduce overfitting in neural networks by shutting particular or random neurons at a point of time.

**Loss function** :  Method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, loss function would cough up a very large number. Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction.

*Cross-entropy loss*: measures the performance of a classification model whose output is a probability value between 0 and 1.


**Forward Pass**: The forward pass refers to calculation process, values of the output layers from the inputs data. It's traversing through all neurons from first to last layer.

**Backpropagation**:
Backward pass refers to process of counting changes in weights, using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.


In [0]:
#Tokenization
def tokenization(sentences):
      tokenizer = Tokenizer(lower=False)
      tokenizer.fit_on_texts(sentences)
      return tokenizer

In [0]:
#English Tokenization and Unique word/Vocabulary count
eng_tokenizer = tokenization(df1["English"].astype('str'))

eng_vocab_size = len(eng_tokenizer.word_index) + 1

print('English Vocabulary Size: %d' % eng_vocab_size)

In [0]:
#French Tokenization and Unique word/Vocabulary count

fren_tokenizer = tokenization(df1["French"].astype('str'))

print(f'French Vocabulary Size: {len(fren_tokenizer.word_index) + 1}')

In [0]:
#Convert text to integer sequences for English
english_sequences = eng_tokenizer.texts_to_sequences(df1["English"].values)
print(english_sequences[0])
print(df1["English"].values[0])
print(eng_tokenizer.word_index)


In [0]:
#Convert text to integer sequences for French

french_sequences = fren_tokenizer.texts_to_sequences(df1["French"].values)
print(french_sequences[0])
print(df1["French"].values[0])
print(fren_tokenizer.word_index)

In [0]:
#Pad sequences with zeros to amke them of equal length for processing
english_sequences = pad_sequences(english_sequences, padding='post')
french_sequences = pad_sequences(french_sequences, padding='post')
print(english_sequences.shape)
print(english_sequences[0])
print(french_sequences.shape)
print(french_sequences[0])

In [0]:
#Split data into train and test data
train_french_input, test_french_input, train_english_output, test_english_output = train_test_split(french_sequences, 
                                                    english_sequences, 
                                                    test_size=0.2, 
                                                    random_state=1)

num_train_samples = train_french_input.shape[0]
num_test_samples = test_french_input.shape[0]
print(f'Number of training samples: {num_train_samples}')
print(f'Number of testing samples:  {num_test_samples}')
print()

max_english_sentence_length = train_french_input.shape[1]
max_french_sentence_length = train_french_input.shape[1]
print(f'Max english sentence length:    {max_english_sentence_length}')
print(f'Max french sentence length:     {max_french_sentence_length}')
print()

train_french_input = train_french_input.reshape(num_train_samples, max_french_sentence_length, 1)
train_english_output = pad_sequences(train_english_output, maxlen=max_french_sentence_length, padding='post')
train_english_output = train_english_output.reshape(num_train_samples, max_french_sentence_length, 1)

test_french_input = test_french_input.reshape(num_test_samples, max_french_sentence_length, 1)
test_english_output = pad_sequences(test_english_output, maxlen=max_french_sentence_length, padding='post')
test_english_output = test_english_output.reshape(num_test_samples, max_french_sentence_length, 1)

print(f'Train French:   {train_french_input.shape}')
print(f'Test French:    {test_french_input.shape}')
print(f'Train English:  {train_english_output.shape}')
print(f'Test English:   {test_english_output.shape}')

## What are Recurrent Neural Networks?
A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.


RNNs are designed to take sequences of text as inputs or return sequences of text as outputs, or both. 

They're called recurrent because the network's hidden layers have a loop in which the output from one time step becomes an input at the next time step. This recurrence serves as a form of memory. 

It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step. 

<img src="https://qph.fs.quoracdn.net/main-qimg-6eced51767f5bcd94b32bbe50da438e9">

# **Vanishing Gradient Problem **

As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.

A large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.

A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.



## What are LSTMs (Long short-term memory)?

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.


A common LSTM unit is composed of a **cell**, an **input gate**, an **output gate** and a **forget gate**. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units allow gradients to also flow unchanged.



<img src='https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png'>



**The cell** : Responsible for keeping track of the dependencies between the elements in the input sequence. 

**The input gate** : Controls the extent to which a new value flows into the cell.

**The forget gate**: Controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. 

The activation function of the LSTM gates is often the logistic sigmoid function.

<img src='https://miro.medium.com/max/2840/1*0f8r3Vd-i4ueYND1CUrhMA.png'>



In [0]:
english_vocab_size = len(eng_tokenizer.word_index) + 1

#Create and Build the RNN model
model = Sequential()
#model.add(Embedding(input_dim=len(fren_tokenizer.word_index) + 1, output_dim=128, mask_zero=True))


# return sequences is to get the output of the LSTM for each time step to pass
#   to the next layer in the model
model.add(LSTM(256, input_shape=train_french_input.shape[1:], return_sequences=True)) # Layer 1 (Input Layer)

model.add(TimeDistributed(Dense(512, activation='tanh'))) # Layer 2 (Only hidden layer)

# model ouput probabilities for english words from input word
model.add(TimeDistributed(Dense(english_vocab_size, activation='softmax'))) # Final (Output) Layer

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [0]:
#Run the model on training data

model.fit(train_french_input, train_english_output, batch_size=1024, epochs=50)

In [0]:
model.save('50_epochs_model.h5')

In [0]:
#Load data
model = load_model('50_epochs_model.h5')




In [0]:
#Predict on unseen data
sen_prediction = model.predict_classes(test_french_input)

In [0]:
def get_word(n, tokenizer):
      for word, index in tokenizer.word_index.items():
          if index == n:
              return word
      return None



In [0]:
preds_text=[]
for i in sen_prediction:
    temp = []
    for j in range(len(i)):
      t = get_word(i[j], eng_tokenizer)
      if j > 0:
        if (t == get_word(i[j-1], eng_tokenizer)) or (t == None):
          temp.append('')
        else:
          temp.append(t)
      else:
        if(t == None):
          temp.append('')
        else:
          temp.append(t)

    preds_text.append(' '.join(temp))

In [0]:
#Original Output and Predictions 

print(f"Original Sentence:     {' '.join(fren_tokenizer.sequences_to_texts(test_french_input[50]))}")


print(f"Expected Sentence:     {' '.join(eng_tokenizer.sequences_to_texts(test_english_output[50]))}")

print("Predicted Sentence:   ",preds_text[50])

In [0]:

print(fren_tokenizer.sequences_to_texts(test_french_input[2]))
print(eng_tokenizer.sequences_to_texts(test_english_output[2]))

In [0]:
print(eng_tokenizer.word_index['autumn'])
print(eng_tokenizer.word_index['fall'])
