# Seq2Seq: Translation Chatbot with Attention

### Machine Translation – A Brief History
Most of us were introduced to machine translation when Google came up with the service. But the concept has been around since the middle of last century.

Research work in Machine Translation (MT) started as early as 1950’s, primarily in the United States. These early systems relied on huge bilingual dictionaries, hand-coded rules, and universal principles underlying natural language.

In 1954, IBM held a first ever public demonstration of a machine translation. The system had a pretty small vocabulary of only 250 words and it could translate only 49 hand-picked Russian sentences to English. The number seems minuscule now but the system is widely regarded as an important milestone in the progress of machine translation.

Soon, two schools of thought emerged:

- Empirical trial-and-error approaches, using statistical methods, and
- Theoretical approaches involving fundamental linguistic research

In 1964, the Automatic Language Processing Advisory Committee (ALPAC) was established by the United States government to evaluate the progress in Machine Translation. ALPAC did a little prodding around and published a report in November 1966 on the state of MT. Below are the key highlights from that report:

- It raised serious questions on the feasibility of machine translation and termed it hopeless
- Funding was discouraged for MT research
- It was quite a depressing report for the researchers working in this field
- Most of them left the field and started new careers

Not exactly a glowing recommendation!

A long dry period followed this miserable report. Finally, in 1981, a new system called the METEO System was deployed in Canada for translation of weather forecasts issued in French into English. It was quite a successful project which stayed in operation until 2001.

The world’s first web translation tool, Babel Fish, was launched by the AltaVista search engine in 1997.

And then came the breakthrough we are all familiar with now – Google Translate. It has since changed the way we work (and even learn) with different languages.

### Introduction to Sequence-to-Sequence (Seq2Seq) Modeling
Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.

Here, both the input and output are sentences. In other words, these sentences are a sequence of words going in and out of a model. This is the basic idea of Sequence-to-Sequence modeling. 

A typical seq2seq model has 2 major components –

- an encoder
- a decoder

Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network

Use cases of Sequence-to-Sequence modeling below (apart from Machine Translation, of course):

- Speech Recognition
- Name Entity/Subject Extraction to identify the main subject from a body of text
- Relation Classification to tag relationships between various entities tagged in the above step
- Chatbot skills to have conversational ability and engage with customers
- Text Summarization to generate a concise summary of a large amount of text
- Question Answering systems
 

### Business example

For our example implementation, we will use a dataset of pairs of English sentences and their German translation.

We will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character. Another option would be a word-level model, which tends to be more common for machine translation. 

#### Here's a summary of our process:

- 1) Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:
    - encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.
    - decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_german_characters) containg a one-hot vectorization of the French sentences.
    - decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].
- 2) Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.
- 3) Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).
Because the training process and inference process (decoding sentences) are quite different, we use different models for both, albeit they all leverage the same inner layers.

This is our training model. It leverages three key features of Keras RNNs:

- The return_state contructor argument, configuring a RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. This is used to recover the states of the encoder.
- The inital_state call argument, specifying the initial state(s) of a RNN. This is used to pass the encoder states to the decoder as initial states.
- The return_sequences constructor argument, configuring a RNN to return its full sequence of outputs (instead of just the last output, which the defaults behavior). This is used in the decoder.

**Summary of the algorithm**
- We start with input sequences from a domain (e.g. English sentences)
    and corresponding target sequences from another domain
    (e.g. German sentences).
- An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.

![](https://cdn-images-1.medium.com/max/2600/1*1I2tTjCkMHlQ-r73eRn4ZQ.png)

## Encoder-Decoder
1. Sentence Data（encoder_input_data、decoder_input_data、decoder_target_data).
  - encoder_input_data（num_pairs、max_english_sentence_length、num_english_characters）
  - decoder_input_data（num_pairs、max_french_sentence_length、num_french_characters）
  - decoder_target_data (decoder_input_data 1 Time Step).
2. encoder_input_data, decoder_input_data, decoder_target_data.

In [56]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 809A-5B06

 Directory of C:\Users\USER\Desktop\Text Mining\seq2seq_translate_slackbot-master

10/05/2019  09:02 AM    <DIR>          .
10/05/2019  09:02 AM    <DIR>          ..
03/18/2019  10:31 AM             6,148 .DS_Store
09/26/2019  08:06 PM    <DIR>          .ipynb_checkpoints
10/05/2019  09:02 AM            93,146 0315_seq2seq_translater.ipynb
03/18/2019  10:31 AM               725 attention.py
03/18/2019  10:31 AM             1,689 explain_mention.py
07/30/2018  04:21 PM        10,200,869 fra.txt
03/18/2019  10:31 AM         3,574,873 jpn.txt
04/14/2019  08:14 AM           104,442 Neural machine translation - Encoder-Decoder seq2seq model-New.ipynb
03/18/2019  10:31 AM    <DIR>          plugins
03/18/2019  10:31 AM             2,228 README.md
03/18/2019  10:31 AM               156 run.py
03/18/2019  10:31 AM               726 seq2seq_attention.py
03/18/2019  10:31 AM            12,388 seq2seq_translate.py
03/18/2019  10:

In [57]:
import numpy as np
import pandas as pd
import string
import pickle
import operator
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Preprocess feeding data

In [58]:
### Importing input file

In [133]:
with open("jpn.txt", 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

In [134]:
len(lines)

43954

In [135]:
lines[0]

'Go.\t行け。'

In [136]:
lines = lines[:10000]

In [137]:
len(lines)

10000

In [138]:
### Creating input & target data

In [139]:
input_texts = []
target_texts = []
bos = "<BOS> "
eos = " <EOS>"

for line in lines:
    if len(line.split("\t")) == 2:
        input_text, target_text = line.split("\t")[0], line.split("\t")[1]
        target_text = bos + target_text + eos
        input_texts.append(input_text)
        target_texts.append(target_text)

len(input_texts)

10000

In [140]:
input_texts

['Go.',
 'Go.',
 'Hi.',
 'Hi.',
 'Run.',
 'Run.',
 'Who?',
 'Wow!',
 'Wow!',
 'Wow!',
 'Fire!',
 'Fire!',
 'Help!',
 'Jump!',
 'Jump!',
 'Jump!',
 'Jump!',
 'Jump!',
 'Jump.',
 'Jump.',
 'Jump.',
 'Stop!',
 'Stop!',
 'Wait!',
 'Go on.',
 'Go on.',
 'Go on.',
 'Go on.',
 'Hello!',
 'Hello!',
 'Hello!',
 'Hurry!',
 'I see.',
 'I see.',
 'I see.',
 'I see.',
 'I see.',
 'I see.',
 'I see.',
 'I try.',
 'I try.',
 'I try.',
 'I try.',
 'I try.',
 'I won!',
 'I won!',
 'I won!',
 'I won!',
 'I won!',
 'Oh no!',
 'Oh no!',
 'Oh no!',
 'Oh no!',
 'Oh no!',
 'Oh no!',
 'Relax.',
 'Relax.',
 'Relax.',
 'Shoot!',
 'Smile.',
 'Smile.',
 'Cheers!',
 'Freeze!',
 'Get up.',
 'Get up.',
 'Get up.',
 'Go now.',
 'Got it!',
 'Got it!',
 'He ran.',
 'He ran.',
 'Hop in.',
 'Hop in.',
 'Hug me.',
 'Hug me.',
 'I know.',
 'I know.',
 'I left.',
 'I lost.',
 'I quit.',
 'I quit.',
 "I'm 19.",
 "I'm OK.",
 "I'm OK.",
 "I'm up.",
 'Listen.',
 'Listen.',
 'No way!',
 'No way!',
 'No way!',
 'No way!',
 'Reall

In [141]:
target_texts

['<BOS> 行け。 <EOS>',
 '<BOS> 行きなさい。 <EOS>',
 '<BOS> やっほー。 <EOS>',
 '<BOS> こんにちは！ <EOS>',
 '<BOS> 走れ。 <EOS>',
 '<BOS> 走って！ <EOS>',
 '<BOS> 誰？ <EOS>',
 '<BOS> すごい！ <EOS>',
 '<BOS> ワォ！ <EOS>',
 '<BOS> わぉ！ <EOS>',
 '<BOS> 火事だ！ <EOS>',
 '<BOS> 火事！ <EOS>',
 '<BOS> 助けて！ <EOS>',
 '<BOS> 飛び越えろ！ <EOS>',
 '<BOS> 跳べ！ <EOS>',
 '<BOS> 飛び降りろ！ <EOS>',
 '<BOS> 飛び跳ねて！ <EOS>',
 '<BOS> ジャンプして！ <EOS>',
 '<BOS> 跳べ！ <EOS>',
 '<BOS> 飛び跳ねて！ <EOS>',
 '<BOS> ジャンプして！ <EOS>',
 '<BOS> やめろ！ <EOS>',
 '<BOS> 止まれ！ <EOS>',
 '<BOS> 待って！ <EOS>',
 '<BOS> 続けて。 <EOS>',
 '<BOS> 進んで。 <EOS>',
 '<BOS> 進め。 <EOS>',
 '<BOS> 続けろ。 <EOS>',
 '<BOS> こんにちは。 <EOS>',
 '<BOS> もしもし。 <EOS>',
 '<BOS> こんにちは！ <EOS>',
 '<BOS> 急げ！ <EOS>',
 '<BOS> なるほど。 <EOS>',
 '<BOS> なるほどね。 <EOS>',
 '<BOS> わかった。 <EOS>',
 '<BOS> わかりました。 <EOS>',
 '<BOS> そうですか。 <EOS>',
 '<BOS> そうなんだ。 <EOS>',
 '<BOS> そっか。 <EOS>',
 '<BOS> 頑張ってみる。 <EOS>',
 '<BOS> やってみる。 <EOS>',
 '<BOS> 試してみる。 <EOS>',
 '<BOS> やってみよう！ <EOS>',
 '<BOS> トライしてみる。 <EOS>',
 '<BOS> 俺の勝ちー！ <EOS>',
 '<BOS> 勝ったぁ！ <

In [142]:
num_samples = len(target_texts)
num_samples

10000

In [143]:
input_texts[9000]

'This bird cannot fly.'

In [144]:
target_texts[9000]

'<BOS> この鳥は飛ぶことができない。 <EOS>'

### 2-2. Tokenization

In [145]:
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

# Process english and french sentences
for line in range(len(lines)):
    
    input_line = str(lines[line]).split('\t')[0]
    
    # Append '\t' for start of the sentence and '\n' to signify end of the sentence
    target_line = '\t' + str(lines[line]).split('\t')[1] + '\n'
    input_texts.append(input_line)
    target_texts.append(target_line)
    
    for ch in input_line:
        if (ch not in input_characters):
            input_characters.add(ch)
            
    for ch in target_line:
        if (ch not in target_characters):
            target_characters.add(ch)

In [146]:
# Get Maximum Sentense Length
english_maxlen = max([ len(text) for text in input_texts ])
japanese_maxlen = max([ len(text) for text in target_texts ])

In [170]:
english_char_size = len(input_characters)

japanese_char_size = len(target_characters)

### 2-3. Vectorization & Dictionary Creation

In [171]:
input_char_index = dict([(char, i) for i, char in enumerate(input_characters)])

In [172]:
target_char_index = dict([(char, i) for i, char in enumerate(target_characters)])

In [173]:
input_index_char = dict((i, char) for char, i in input_char_index.items())

In [174]:
target_index_char = dict((i, char) for char, i in target_char_index.items())

In [175]:
# 3D tensor (number of samples, MAX_LEN, dictionary size)
encoder_input_data = np.zeros((len(input_texts), english_maxlen, english_char_size), dtype="float32")

decoder_input_data = np.zeros((len(input_texts), japanese_maxlen, japanese_char_size), dtype="float32")

decoder_output_data = np.zeros((len(input_texts), japanese_maxlen, japanese_char_size), dtype="float32")

In [176]:
encoder_input_data.shape

(10000, 22, 72)

In [177]:
decoder_input_data.shape

(10000, 32, 1476)

In [178]:
decoder_output_data.shape

(10000, 32, 1476)

In [179]:
for i, text in enumerate(input_texts):
    char = char.lower()
    for j, char in enumerate(text):
        encoder_input_data[i][j][input_char_index[char]] = 1.

In [180]:
for i, text in enumerate(target_texts):
    for j, char in enumerate(text):
        decoder_input_data[i][j][target_char_index[char]] = 1.
        # Decoder output shifts by one step
        if j > 0:
            decoder_output_data[i, j-1, target_char_index[char]] = 1.

## 2. Modeling Seq2Seq

## Process
1. Encode the input sequence into state vectors.
2. Start with a target sequence of size 1 (just the start-of-sequence character).
3. Feed the state vectors and 1-char target sequence to the decoder to produce predictions for the next character.
4. Sample the next character using these predictions (we simply use argmax).
5. Append the sampled character to the target sequence
6. Repeat until we generate the end-of-sequence character or we hit the character limit.

- Encoder input data: id word sequence (English)
- Encoder input layer: None × 2D matrix of word sequence length
- Encoder LSTM layer: number of hidden layer units

- Decoder input data: id word sequence (Japanese)
- Decoder input layer: None × 2D matrix of word sequence length
- Decoder LSTM layer: number of hidden layer units

In [181]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

In [182]:
HIDDEN_DIM = 256
NUM_ENCODER_TOKENS = english_char_size
NUM_DECODER_TOKENS = japanese_char_size

In [183]:
encoder_inputs = Input(shape=(None, NUM_ENCODER_TOKENS))
encoder_LSTM = LSTM(units=HIDDEN_DIM, return_state=True) # this is the key
encoder_outputs, state_h, state_c = encoder_LSTM(encoder_inputs) # Ignore encoder output, save only memory cell state
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None, NUM_DECODER_TOKENS))
decoder_LSTM = LSTM(units=HIDDEN_DIM, return_sequences=True, return_state=True) # this is the key
decoder_outputs, _h, _c = decoder_LSTM(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(units=NUM_DECODER_TOKENS, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [184]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [185]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, None, 72)     0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, None, 1476)   0                                            
__________________________________________________________________________________________________
lstm_3 (LSTM)                   [(None, 256), (None, 336896      input_3[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, None, 256),  1774592     input_4[0][0]                    
                                                                 lstm_3[0][1]                     
          

## LSTM Layer works as encoder and decoder

- Encoder: Processes the input sequence and returns its own internal state, which serves as the `` context '' and `` state '' of the decoder in the next step.
- Decoder: If there is a previous character, it is trained to predict the next character in the target sequence.
- Thought Vector: Encoder uses the state vector from Encoder as the initial state. This is a method to get information about what Decoder generates

- return_state constructor argument. 
- Configure the RNN layer to return a list where the first entry is output and the next entry is an internal RNN state. 
- This is used to recover the encoder state.

- An inital_state call argument that specifies the initial state of the RNN. 
- This is used to pass the encoder state to the decoder as the initial state.

- return_sequences constructor argument. 
- Set the RNN to return its complete output sequence (not just the last output, which is the default behavior). 
- This is used in the decoder.

## 3. Fitting and Visualization

In [187]:
BATCH_SIZE = 32
EPOCHS = 5

In [188]:
model.fit([encoder_input_data, decoder_input_data], decoder_output_data,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          validation_split=0.2)

model.save("seq2seq_translation.h5")

Train on 8000 samples, validate on 2000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  '. They will not be included '


In [192]:
from keras.models import load_model
model = load_model('seq2seq_translation.h5')

In [None]:
#The below step is how to tune the parameters (Dont run the below step)

In [None]:
BATCH_SIZE = [32, 64]
EPOCHS = [5, 10]

for i in range(2):
    BATCH_SIZE = BATCH_SIZE[i]
    for e in range(2):
        EPOCHS = EPOCHS[e]
        print("------------------------------------")
        print("BATCH_SIZE: ", BATCH_SIZE)
        print("EPOCHS: ", EPOCHS)
        history = model.fit([encoder_input_data, decoder_input_data], decoder_output_data,
                            batch_size=BATCH_SIZE,
                            epochs=EPOCHS,
                            validation_split=0.2)

## 4. Inference

In [198]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(HIDDEN_DIM,))
decoder_state_input_c = Input(shape=(HIDDEN_DIM,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, decoder_h,decoder_c = decoder_LSTM(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [decoder_h, decoder_c ]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(inputs = [decoder_inputs] + decoder_states_inputs, outputs=[decoder_outputs] + decoder_states)

In [199]:
def inference_translater(english_input):
    
    encoder_states = encoder_model.predict(english_input)

    japanese_seq = np.zeros((1, 1, NUM_DECODER_TOKENS))
    japanese_seq[0, 0, target_char_index[bos]] = 1.
    
    stop_condition = False
    japanese_output = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([japanese_seq] + encoder_states)
        # Softmax returns the id of the largest char np.argmax
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        # Convert to Japanese word
        sampled_char = target_index_char[sampled_token_index]
        japanese_output += sampled_char

        # Exit condition: either hit max length or find stop character.
        if (sampled_char == eos or len(japanese_output) > japanese_maxlen):
            stop_condition = True

        japanese_seq = np.zeros((1, 1, NUM_DECODER_TOKENS))
        japanese_seq[0, 0, sampled_token_index] = 1.

        encoder_states = [_h, _c]

    return japanese_output

In [216]:
english_text = "I am so tired and I feel bored to go back to my home."

In [219]:
inp_seq = encoder_input_data[1:2]

In [220]:
inp_seq

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]], dtype=float32)

In [221]:
inference_translater(inp_seq)

KeyError: '<BOS> '

In [218]:
for seq_index in range(10):
    inp_seq = encoder_input_data[seq_index:seq_index+1]
    translated_sent = inference_translater(inp_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', translated_sent)

KeyError: '<BOS> '

In [207]:
decode_seq(english_text)

AttributeError: 'str' object has no attribute 'ndim'

In [201]:
inference_translater(english_text)

AttributeError: 'str' object has no attribute 'ndim'