                                                18S01ACS011

First step is to import all the necessary libraries for the dataset loading and the model teaining procedure.

In this notebook I will take use of the Encoder-Decoder architecture for NMT due to its flexibility and ease to train a single end to end model directly on source and target sentences and its ability to handle variable input and output sequences of text which my dataset is based on.

The model 'WarwinuTranslates' has been trained on languge pairs of english sentences as the domain language and corresponding gikuyu sentences to help improve on the quality of the model.

In [398]:
import numpy as np 
import pandas as pd 
import re
import os
import tensorflow as tf


In [399]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,LSTM,Dense

myBatchSize=28
nduati_epochs=30
latent_dim=100
samples=1350

data_path = "C:/Users/user/Desktop/clean.txt"

The next step is vectorizing the data. Vectorizing is a technique in machine learning to make code to execute fast by optimization of algorithms.

it is the process of converting input data from it's raw format to real numbers since the computer only understands the binary language of zero's and one's.It is specifically useful in machine learning for feature extraction where some distinct features are acquired from the text for the model to train on through conversion of text to numeric vectors.

In [400]:
# Vectorize the data.
nduatiInputTxts = []
nduatiTargetTxts = []
nduatiInputChars = set()
nduatiTargetChars = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(samples, len(lines) - 1)]:
    unpack = re.split(r'\t+', line)
    
    #input_text, target_text = line.split('\t')
    
    nduatiInputTxt, nduatiTargetTxt = unpack
    
    # I use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    
    nduatiTargetTxt = '\t' + nduatiTargetTxt + '\n'
    nduatiInputTxts.append(nduatiInputTxt)
    nduatiTargetTxts.append(nduatiTargetTxt)
    for val in nduatiInputTxt:
        if val not in nduatiInputChars:
            nduatiInputChars.add(val)
    for val in nduatiTargetTxt:
        if val not in nduatiTargetChars:
            nduatiTargetChars.add(val)

In [401]:
nduatiInputChars=sorted(list(nduatiInputChars))
nduatiTargetChars=sorted(list(nduatiTargetChars))

my_encoder_tokens=len(nduatiInputChars)
my_decoder_tokens=len(nduatiTargetChars)

maxSeqEncoderLength=max([len(txt) for txt in nduatiInputTxts])
maxSeqDecoderLength=max([len(txt) for txt in nduatiTargetTxts])

In [402]:
print('Nduati dataset samples:', len(nduatiInputTxts))
print('Unique input tokens:', my_encoder_tokens)
print('Unique output tokens:', my_decoder_tokens)
print('Max sequence length for inputs:', maxSeqEncoderLength)
print('Max sequence length for outputs:', maxSeqDecoderLength)

Nduati dataset samples: 1348
Unique input tokens: 43
Unique output tokens: 44
Max sequence length for inputs: 481
Max sequence length for outputs: 543


In [403]:
nduatiInputTokenIndex=dict(
    [(val,i) for i, val in enumerate(nduatiInputChars)])
nduatiTargetTokenIndex=dict(
[(val,i) for i, val in enumerate(nduatiTargetChars)])

In [404]:
nduatiEncoderInputData = np.zeros(
    (len(nduatiInputTxts), maxSeqEncoderLength, my_encoder_tokens),
    dtype='float32')
nduatiDecoderInputData = np.zeros(
    (len(nduatiInputTxts), maxSeqDecoderLength, my_decoder_tokens),
    dtype='float32')
nduatiDecoderTargetData = np.zeros(
    (len(nduatiInputTxts), maxSeqDecoderLength, my_decoder_tokens),
    dtype='float32')

In [405]:
for i, (nduatiInputTxt, nduatiTargetTxt) in enumerate(zip(nduatiInputTxts, nduatiTargetTxts)):
    for t, val in enumerate(nduatiInputTxt):
        nduatiEncoderInputData[i,t, nduatiInputTokenIndex[val]] = 1.
    nduatiEncoderInputData[i, t + 1:, nduatiInputTokenIndex[' ']] = 1.
    for t, val in enumerate(nduatiTargetTxt):
        
        # nduatiDecoderTargetData is ahead of decoder_input_data by one timestep
        
        nduatiDecoderInputData[i, t, nduatiTargetTokenIndex[val]] = 1.
        if t > 0:
            # nduatiDecoderTargetData will be ahead by one timestep
            
            # and will not include the start character.
            nduatiDecoderTargetData[i, t - 1, nduatiTargetTokenIndex[val]] = 1.
    nduatiDecoderInputData[i, t + 1:, nduatiTargetTokenIndex[' ']] = 1.
    nduatiDecoderTargetData[i, t:, nduatiTargetTokenIndex[' ']] = 1.

In [406]:
# Define sampling models
encoderModel = Model(nduatiEncoderInputs, myEncoderStates)

decoderStateInput_h = Input(shape=(latent_dim,))
decoderStateInput_c = Input(shape=(latent_dim,))
decoderStateInputs = [decoderStateInput_h, decoderStateInput_c]
myDecoderOutputs, state_h, state_c = myDecoderLstm(
    nduatiDecoderInputs, initial_state=decoderStateInputs)
decoderStates = [state_h, state_c]
myDecoderOutputs = decoderDense(myDecoderOutputs)
decoderModel = Model(
    [nduatiDecoderInputs] + decoderStateInputs,
    [myDecoderOutputs] + decoderStates)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverseInputCharIndex = dict(
    (i, char) for char, i in nduatiInputTokenIndex.items())
reverseTargetCharIndex = dict(
    (i, char) for char, i in nduatiTargetTokenIndex.items())

The next important step is defining an encoder and a decoder and using it.In this notebook I have used the encoder-decoder model which is a way of using recurrent neural networks for sequence-to-sequence prediction problems.


The encoder encodes the input sequence and the decoder decodes the encoded input sequence into the target sequence.The sequence to sequence model consists of three parts.

1. The encoder which accepts a single element of the input sequence at each time step,process it,collects nformation for that element and propagates it forward
2. The Intermediate vector which is the final internal state produced from the encoder part of the model.Information on the entire input sequence to help the decoder make accurate predictions is contained here.
3. Finally we have the decoder which when given an entire sentence it predicts an output at each time step.

In [407]:
# Defining an input sequence and processing it .

nduatiEncoderInputs = Input(shape=(None, my_encoder_tokens))
myEncoder = LSTM(latent_dim, return_state=True)
myEncoderOutputs, state_h, state_c = myEncoder(nduatiEncoderInputs)
# We discard myEncoderOutputs and only keep the states only.

myEncoderStates = [state_h, state_c]

# Set up the decoder, using myEncoderStates as the  initial state.

nduatiDecoderInputs = Input(shape=(None, my_decoder_tokens))
# Then we set up our decoder to return full output sequences,
# and to return internal states as well. 
#We don't use the return states in the training model, but we will use them in inference.

myDecoderLstm = LSTM(latent_dim,return_sequences=True, return_state=True)
myDecoderOutputs, _, _ = myDecoderLstm(nduatiDecoderInputs,
                                     initial_state=myEncoderStates)
decoderDense = Dense(my_decoder_tokens, activation='softmax')
myDecoderOutputs = decoderDense(myDecoderOutputs)

The next important thing I will use is EArly Stopping which works in reducing overfitting and underfitting in training the model thus increasing the accuracy in predicting data that was not initially in the training set.

The early stopping will stop the training of the model once the model validation accuracy has stopped improving to prevent the model from mastering all the data in the training set and lowering its performance in a new problem which was not inluded in the dataset.

In [408]:
from keras.callbacks import EarlyStopping
myEarlyStopping = EarlyStopping(monitor='val_accuracy' , patience=10, verbose=1)

Next preocess is the machine training which will use the RNN_LSTM Algorithm.

Recurrent Neural Networks (RNN) takes a sequence of text as inputs or returns sequences of texts as output or both.The RNN network hidden layer hsas a loop in which the output and cell state from each time step becomes an input of the next time step and the recurrence serves as a form of memory.

Long-Short-Term-Memory (LSTM) is a modofied version of RNN that makes it easier to remember pAST data in the memory solving the memory dependency problem thus improving the model performance.

In [409]:
# Define the model that will turn
# nduatiEncoderInputData  & nduatiDecoderInputData into nduatiDecoderTArgetData
model = Model([nduatiEncoderInputs, nduatiDecoderInputs], myDecoderOutputs)

# Run training
model.compile(optimizer='adam', loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit([nduatiEncoderInputData, nduatiDecoderInputData], nduatiDecoderTargetData,
          batch_size=myBatchSize,
          epochs=nduati_epochs,
          validation_split=0.2,
         callbacks=[myEarlyStopping])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x235aa40c760>

In [410]:
#model.save('WarwinuTranslates.h5')

Next step is to use the model to decode an english sentence input to a kikuyu sentence output using a function.

In [414]:
from tensorflow import keras
from keras.models import load_model
from keras.layers import Input, LSTM, Dense
from keras.models import Model

def nduatiDecodeSequence(myInputSeq):
    
    # Encode inputs as state vectors.
    nduatiStatesValue = encoderModel.predict(myInputSeq)

    # Generate empty target sequence of length 1.
    myTargetSeq = np.zeros((1, 1, my_decoder_tokens))
    
    # Populate the first character of target sequence with the start character.
    myTargetSeq[0, 0, nduatiTargetTokenIndex['\t']] = 1.

    # Sampling loop for a batch of sequences (to simplify, here we assume a batch of size 1).
    
    stopCondition = False
    decodedSentence = ''
    while not stopCondition:
        outputTokens, h, c = decoderModel.predict(
            [myTargetSeq] + nduatiStatesValue)
        
         # Sample a token and pick one with highest probability
            
        sampledTokenIndex = np.argmax(outputTokens[0, -1, :])
        sampledChar = reverseTargetCharIndex[sampledTokenIndex]
        decodedSentence += sampledChar

        # Exit condition: either hit max length or find stop character.
        
        if (sampledChar == '\n' or
           len(decodedSentence) > maxSeqDecoderLength):
            stopCondition = True

        # Update the target sequence (of length 1).
        targetSeq = np.zeros((1, 1, my_decoder_tokens))
        targetSeq[0, 0, sampledTokenIndex] = 1.

        # Update states
        nduatiStatesValue = [h, c]

    return decodedSentence

In [415]:
print(model)

<keras.engine.functional.Functional object at 0x00000235AA7023A0>


In [416]:
class nduatiTranslator:
  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  #Method to start the translator
  def start(self):
    user_response = input("Input an English sentence. :) \n")
    self.translate(user_response)
  
  #Method to handle the conversation
  def translate(self, reply):
    while not self.make_exit(reply):
      reply = input(self.generate_response(reply)+"\n")

  #Method to convert user input into a matrix
  def string_to_matrix(self, user_input):
    tokens = re.findall(r"[\w']+|[^\s\w]", user_input)
    user_input_matrix = np.zeros(
      (1, maxSeqEncoderLength, my_encoder_tokens),
      dtype='float32')
    for timestep, token in enumerate(tokens):
      if token in nduatiInputTokenIndex:
        user_input_matrix[0, timestep, nduatiInputTokenIndex[token]] = 1.
    return user_input_matrix
  
  #Method that will create a response using seq2seq model we built
  def generate_response(self, user_input):
    input_matrix = self.string_to_matrix(user_input)
    chatbot_response = decode_response(input_matrix)
    #Remove  and  tokens from chatbot_response
    chatbot_response = chatbot_response.replace("",'')
    chatbot_response = chatbot_response.replace("",'')
    return chatbot_response

 #Method to check for exit commands
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Have a Great Day!")
        return True
    return False
  
WarwinuTranslates = nduatiTranslator()
  
 
        

Next step I will build  a translator which will provide a kind of 'user interface' for input of an english statement and provides the corresponding kikuyu output.The translator will also have keywords which the user can type to terminate the translator program from running including words like :-
1. Quit                    
2. Pause                   
3. Exit                    
4. Goodbye
5. Bye
6. Later
6. Stop

In [419]:
WarwinuTranslates.start()

Input an English sentence. :) 
person
 * — + + + + x x x ‘ ‘ f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’ + + + + x x x x ‘ ‘ f f f f ’ ’
exit
Have a Great Day!


The model training has been successfull but the translation is where it's not behaving as expecte.I'm currently working on extending the dataset to evaluate if performance improves.