<h2>Normalizing informal language using Deep Learning</h2>

1 Overview:

1.1. Introduction:

The project's objective is to convert informal text into a more structured and normalized format using Natural Language Processing (NLP) techniques, which can have a variety of applications and benefits in areas such as translation, information retrieval, and accessibility. In this work, we will aim to tackle the problem by proposing an Encoder-Decoder sequence-to-sequence (seq2seq) architecture with attention mechanism.


1.2. Business Problem:

The problem we are trying to address is the difficulty of processing and understanding text
written in an informal or unstructured way. Informal language can be challenging to interpret for humans and machines, leading to misinterpretations or misunderstandings of the text's meaning. Normalizing the text allows us to produce a more structured representation that is easier to process and understand. This can help to improve the accuracy of machine translation systems, assist in information retrieval tasks, and make the text more accessible to individuals with language processing difficulties.

Informal input : U wan me to chop seat 4 u nt?

Formal input : Do you want me to reserve seat for you or not?


1.3. Dataset:

We are going to use the NUS Social Media Text Normalization and Translation Corpus dataset for the project. The corpus was created for social media text normalization and translation by randomly selecting 2,000 messages from the NUS English SMS corpus.

2 Loading data & Data Preprocessing

2.1 Loading data<br />
2.2 Data Augmentation<br />
2.3 Adding Beginning of the Sentence token and End of the Sentence token<br />
2.4 Visualization of distribution of length of Encoder Input, Decoder Input and Decoder Output using distance plot<br />
2.5 Splitting the data into training, validation and test sets<br />
2.6 Tokenizing data<br />
2.7 Padding Data<br />

In [None]:
# Import the required file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.translate.bleu_score import sentence_bleu
import plotly.figure_factory as ff
import time

import warnings
warnings.filterwarnings("ignore")

In [332]:
# Read the train, validation and test files & tokenizer models
train = pd.read_csv('../data/processed/train.csv', index_col=[0])
validation = pd.read_csv('../data/processed/validation.csv', index_col=[0])
test = pd.read_csv('../data/processed/test.csv', index_col=[0])

with open('../model/tokenizer.pkl', 'rb') as file:
    tokenizer = pickle.load(file)

3 Designing the Data Pipeline:

As the model expects tuples of length batch size of preprocessed data at runtime, we will construct a data pipeline before we can train the model. The data will be padded into sequences as we load the source and target tokenizers. After that, feed it based on the batch size.

3.1. Preprocessing the Data:

We will first convert sentences into sequences. First we will tokenize the samples and perform data padding to make sure all the vectors are of length 100.

In [334]:
class Dataset:
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __padding__(self, sequence):
        return pad_sequences(sequence, maxlen = self.max_length, dtype = 'int32', padding = 'post')
    
    def __getitem__(self, i):
        self.encoder_input_sequence = self.tokenizer['informal'].texts_to_sequences([self.data['encoder_input'].values[i]])
        self.decoder_input_sequence = self.tokenizer['normalized'].texts_to_sequences([self.data['decoder_input'].values[i]])
        self.decoder_output_sequence = self.tokenizer['normalized'].texts_to_sequences([self.data['decoder_output'].values[i]])
        return self.__padding__(self.encoder_input_sequence), self.__padding__(self.decoder_input_sequence), self.__padding__(self.decoder_output_sequence)
        

    def __len__(self):
        return len(self.encoder_input_sequence)

3.2. Designing Dataloader: The dataloader returns the tuple of form ([[encoder_inp], [decoder_inp]], decoder_out) at the runtime

In [335]:
class Dataloader(tf.keras.utils.Sequence):    
    def __init__(self, dataset, batch_size = 1):
        self.dataset = dataset
        self.batch_size = batch_size
        self.indexes = np.arange(len(self.dataset.data['encoder_input'].values))

    def __getitem__(self, i):
        data = [self.dataset[idx] for idx in range(i * self.batch_size, (i + 1) * self.batch_size)]
        batch = [np.squeeze(np.stack(samples, axis = 1), axis = 0) for samples in zip(*data)]
        return tuple([[batch[0],batch[1]],batch[2]])

    def __len__(self):
        return len(self.indexes) // self.batch_size

    def on_epoch_end(self):
        self.indexes = np.random.permutation(self.indexes)

3.3 Defining Parameters and Creating Dataloader: Define batch size, max length & load training and validation data.

In [None]:
BATCH_SIZE = 128
MAX_LEN = 200
train_dataset = Dataset(train, tokenizer, MAX_LEN)
validation_dataset  = Dataset(validation, tokenizer, MAX_LEN)
train_dataloader = Dataloader(train_dataset, batch_size = BATCH_SIZE)
validation_dataloader = Dataloader(validation_dataset, batch_size = BATCH_SIZE)

4 Designing the Attention based Encoder Decoder Model:

4.1. Designing Encoder: At each time step, the encoder will receive a sequence of word embeddings from the source sentences and encode the information in an encoded vector using the current state and LSTM hidden state. As a result, the encoder produces an encoded vector of the source sentence, also known as a vector of latent information.

In [336]:
class Encoder(tf.keras.Model):
    
    def __init__(self, inp_vocab_size, embedding_dim, lstm_size, input_length):
        super().__init__()
        self.lstm_size = lstm_size
        embedding_params = {'input_dim': inp_vocab_size,
                            'output_dim': embedding_dim,
                            'embeddings_initializer' : tf.keras.initializers.RandomNormal(mean = 0, stddev = 1, seed = 42),
                            'input_length' : input_length, 
                            'mask_zero' : True}
        
        lstm_params = {'units':self.lstm_size, 
                      'return_state' : True, 
                      'return_sequences' : True,
                      'kernel_initializer' : tf.keras.initializers.glorot_uniform(seed = 42),
                      'recurrent_initializer' : tf.keras.initializers.orthogonal(seed = 42)}
        
  
        self.embedding = Embedding(**embedding_params)
        self.lstm1 = LSTM(**lstm_params)
        self.lstm2 = LSTM(**lstm_params)

    def call(self, input):
        self.encoder_output, self.hidden_state, self.current_state = self.lstm1(self.embedding(input[0]), initial_state = input[1])
        return self.lstm2(self.encoder_output, [self.hidden_state, self.current_state])
    
    def initialize_states(self, batch_size):
      return tf.zeros([batch_size, self.lstm_size]), tf.zeros([batch_size, self.lstm_size])
      

4.2 Designing Attention Model: The decoder hidden state of the previous timestep and the encoder output serve as the two inputs for the attention model, which then calculates attention weights.

In [337]:
class Attention(tf.keras.Model):

    def __init__(self, lstm_size, scoring_function):
        super(Attention, self).__init__()
        self.scoring_function = scoring_function
       
        self.V = tf.keras.layers.Dense(1)
        self.W = tf.keras.layers.Dense(lstm_size)   
        self.W1 = tf.keras.layers.Dense(lstm_size)
        self.W2 = tf.keras.layers.Dense(lstm_size)
        self.V1 = tf.keras.layers.Dense(1)
    
    def call(self,input):        
        score = self.V(tf.linalg.matmul(input[1], tf.expand_dims(input[0], 1), transpose_b=True)) if self.scoring_function == 'dot' else (
            tf.keras.layers.Dot(axes=(2, 1))([self.W(input[1]), tf.expand_dims(input[0], axis = 2)]) if self.scoring_function == 'general' else 
            self.V1(tf.nn.tanh(self.W1(tf.expand_dims(input[0], 1)) + self.W2(input[1])))
        )
        return tf.reduce_sum(tf.nn.softmax(score, axis=1) * input[1], axis=1), tf.nn.softmax(score, axis=1)


4.3 Designing Step Decoder: Step decoder will perform a concatenation operation for each time step using the output from the previous step's decoder and the attention weights calculated by the attention model.

In [338]:
class Step_Decoder(tf.keras.Model):

    def __init__(self, out_vocab_size, embedding_dim, input_length, lstm_size, scoring_function, embedding_matrix = None):

        super().__init__()
        self.attention = Attention(lstm_size, scoring_function)
        
        embedding_params = {'input_dim' : out_vocab_size, 'output_dim' : embedding_dim,
                                       'embeddings_initializer' : tf.keras.initializers.RandomNormal(mean = 0, stddev = 1, seed = 42),
                                       'input_length' : input_length, 'mask_zero' : True}
        lstm_params = {'units':lstm_size, 'return_state' : True, 'return_sequences' : True, 
                            'kernel_initializer' : tf.keras.initializers.glorot_uniform(seed = 42), 
                            'recurrent_initializer' : tf.keras.initializers.orthogonal(seed = 42)}
        
        if embedding_matrix:
            embedding_params['embeddings_initializer'] = tf.keras.initializers.Constant(embedding_matrix)
            embedding_params['trainable'] = False
        
        self.embedding = Embedding(**embedding_params)
        self.lstm1 = LSTM(**lstm_params)
        self.lstm2 = LSTM(**lstm_params)
        self.dense = Dense(out_vocab_size)


    def call(self, input):

        encoder_hidden = input[2]
        encoder_current = input[3]
        dec_output, encoder_hidden, encoder_current = self.lstm1(tf.concat([tf.expand_dims(self.attention([encoder_hidden, input[1]])[0], 1), 
                                                                            self.embedding(input[0])], axis = -1), [encoder_hidden, encoder_current])
        dec_output, encoder_hidden, encoder_current = self.lstm2(dec_output, [encoder_hidden, encoder_current])
        output = self.dense(tf.reshape(dec_output, (-1, dec_output.shape[2])))
        
        return output, encoder_hidden, encoder_current

4.4 At each timestep, the decoder model invokes Step decoder and produces the final output tokens.

In [339]:
class Decoder(tf.keras.Model):

    def __init__(self, out_vocab_size, embedding_dim, input_length, lstm_size, scoring_function, embedding_matrix = None):
        super().__init__()
        self.timestepdecoder = Step_Decoder(out_vocab_size, embedding_dim, input_length,
                                                lstm_size, scoring_function, embedding_matrix)
        
    
    @tf.function
    def call(self, input):
        outputs = tf.TensorArray(tf.float32, size = tf.shape(input[0])[1])
        for timestep in range(tf.shape(input[0])[1]):
            outputs = outputs.write(timestep, self.timestepdecoder([input[0][:, timestep:timestep+1], input[1], input[2], input[3]])[0])
        
        return tf.transpose(outputs.stack(), [1,0,2])

4.5 Designing Final Model Architechture: The tuple of input sequences is provided to the Attention based Encoder Decoder model, which then uses the subclassing API to implement the Encoder, Attention, Step Decoder, and Decoder models.

In [340]:
class Encoder_Decoder(tf.keras.Model):
    
    def __init__(self, input_length, inp_vocab_size, out_vocab_size, lstm_size, scoring_function, batch_size, embedding_dim, embedding_matrix = None):
    
        super().__init__()
    
        encoder_args = {'inp_vocab_size' : inp_vocab_size + 1, 'embedding_dim' : embedding_dim, 'lstm_size' : lstm_size, 'input_length' : input_length}
        decoder_args = {'out_vocab_size' : out_vocab_size + 1, 'embedding_dim' : embedding_dim, 'lstm_size' : lstm_size,
                               'scoring_function' : scoring_function, 'input_length' : input_length, 'embedding_matrix' : embedding_matrix}
        self.batch_size = batch_size
        self.encoder = Encoder(**encoder_args)
        self.decoder = Decoder(**decoder_args)
    
    def call(self, data):
        encoder_output, encoder_hidden, encoder_current = self.encoder([data[0], self.encoder.initialize_states(self.batch_size)])
        return self.decoder([data[1], encoder_output, encoder_hidden, encoder_current])

5 Designing the Pipeline:

5.1. Designing Loss Function: In order to calculate losses more accurately, we will now design a loss function that will hide the padded zeros.

In [341]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True, reduction = 'none')

@tf.function
def loss_function(real, pred):
    # Refer https://www.tensorflow.org/tutorials/text/nmt_with_attention
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

5.2. Creating Tensorboard Callback: We will create a Tensorboard callback by providing log directory in order to keep track of loss while training the model.

In [342]:
def create_tensorboard_cb(model):
    root_logdir = os.path.join(os.curdir, model)
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    logdir = os.path.join(root_logdir, run_id)
    return tf.keras.callbacks.TensorBoard(logdir, histogram_freq = 1)

5.3. Creating Predict Function: The predict function will accept an informal input sentence and a model instance to use as input and return the output as a prediction.

In [344]:

def predict(input_sentence, model):
    inputs = tf.convert_to_tensor(tf.keras.preprocessing.sequence.pad_sequences([[tokenizer['informal'].word_index.get(i, 0) 
                                                                                  for i in input_sentence]], maxlen = MAX_LEN, padding = 'post'))
    sentence = ''
    enc_out, state_h, state_c = model.encoder([inputs, (tf.zeros([1, UNITS]), tf.zeros([1, UNITS]))])
    dec_input = tf.expand_dims([tokenizer['normalized'].word_index['<']], 0)
    for _ in range(MAX_LEN):
        output, state_h, state_c = model.decoder.timestepdecoder([dec_input, enc_out, state_h, state_c])
        character = tokenizer['normalized'].index_word.get(tf.argmax(output[0]).numpy(), '')
        if character == '>':
            break
        else:
            sentence += character
            dec_input = tf.expand_dims([tf.argmax(output[0]).numpy()], 0)
    return sentence

6 Training the Model

6.1. Compiling and Fitting the model using dot scoring technique: We can now train the model by using model fit method.

In [8]:
UNITS = 200
EPOCHS = 50
TRAIN_STEPS = train.shape[0]//BATCH_SIZE
VALID_STEPS = validation.shape[0]//BATCH_SIZE

cb_params = {'monitor': 'val_loss', 'factor': 0.5, 'verbose': 1, 'patience': 1, 'min_lr': 0.0001}
cb_stopper_cb = {'monitor': 'val_loss', 'patience': 3, 'verbose': 1, 'restore_best_weights': True}

model_dot  = Encoder_Decoder(input_length = MAX_LEN, inp_vocab_size = len(tokenizer['informal'].word_index.keys()),
                                            out_vocab_size = len(tokenizer['normalized'].word_index.keys()), lstm_size = UNITS,
                                            scoring_function = 'dot', batch_size = BATCH_SIZE,
                                            embedding_dim = len(tokenizer['normalized'].word_index.keys()), embedding_matrix = None)

optimizer = tf.keras.optimizers.Adam(0.01)
model_dot.compile(optimizer = optimizer, loss = loss_function)

learning_rate_cb = tf.keras.callbacks.ReduceLROnPlateau(**cb_params)
tensorboard_cb = create_tensorboard_cb("Model_Dot_logs")
stopper_cb = tf.keras.callbacks.EarlyStopping(**cb_stopper_cb)

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Model_Dot",
                                                    save_best_only = True, save_weights_only = False)
model_dot.fit(train_dataloader, steps_per_epoch = TRAIN_STEPS, epochs = EPOCHS,
              callbacks = [learning_rate_cb, tensorboard_cb, stopper_cb, checkpoint_cb],
              validation_data = validation_dataloader, validation_steps = VALID_STEPS)

model_dot.summary()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50

Epoch 00016: ReduceLROnPlateau reducing learning rate to 0.004999999888241291.
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50

Epoch 00020: ReduceLROnPlateau reducing learning rate to 0.0024999999441206455.
Epoch 21/50
Epoch 22/50

Epoch 00022: ReduceLROnPlateau reducing learning rate to 0.0012499999720603228.
Epoch 23/50
Epoch 24/50

Epoch 00024: ReduceLROnPlateau reducing learning rate to 0.0006249999860301614.
Epoch 25/50

Epoch 00025: ReduceLROnPlateau reducing learning rate to 0.0003124999930150807.
Restoring model weights from the end of the best epoch.
Epoch 00025: early stopping
Model: "attention__based__encoder__decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder (Encoder)            multiple        

6.2 Calculate BLEU score (for dot function): To quantify the model performance, we will clculate the BLEU score.

In [1]:
def post_processing(s):
    if s.startswith('<'):
        s = s[len('<'):]
    if s.endswith('>'):
        s = s[:-len('>')]
    return s

def predictor(s):
    return predict(s, model_dot)

def convert_formals(s):
    return [s.split()]

def convert_predictions(s):
    return s.split()

test['informals'] = test['encoder_input'].apply(post_processing)
test['formals'] = test['decoder_input'].apply(post_processing)
test['predictions'] = test['informals'].apply(predictor)
test['formals'] = test['formals'].apply(convert_formals)
test['predictions'] = test['predictions'].apply(convert_predictions)

bleu_scores = []
i = 0

while i < (len(test)):
    bleu_scores.append(sentence_bleu(test['formals'].iloc[i], test['predictions'].iloc[i]))
    i = i + 1

print('Average BLEU score for the predictions:', np.mean(bleu_scores))

Average BLEU score for the predictions: 0.5293823417436993


6.3 Distribution of BLEU score for dot scoring function

In [18]:
fig = ff.create_distplot([bleu_scores], ['Count'])
fig.update_layout(title= 'BLEU Score distribution (Dot scoring function)', autosize=False,
    width=750,
    height=500,)
fig.show()

6.4. Compiling and Fitting the model using general scoring technique: We can now train the model by using model fit method.

In [1]:
model_general  = Encoder_Decoder(input_length = MAX_LEN, inp_vocab_size = len(tokenizer['informal'].word_index.keys()),
                                            out_vocab_size = len(tokenizer['normalized'].word_index.keys()), lstm_size = UNITS,
                                            scoring_function = 'general', batch_size = BATCH_SIZE,
                                            embedding_dim = len(tokenizer['normalized'].word_index.keys()), embedding_matrix = None)


optimizer = tf.keras.optimizers.Adam(0.01)
model_general.compile(optimizer = optimizer, loss = loss_function)

learning_rate_cb = tf.keras.callbacks.ReduceLROnPlateau(**cb_params)
tensorboard_cb = create_tensorboard_cb("Model_General_logs")
stopper_cb = tf.keras.callbacks.EarlyStopping(**cb_stopper_cb)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Model_General",
                                                    save_best_only = True, save_weights_only = False)

model_general.fit(train_dataloader, steps_per_epoch = TRAIN_STEPS, epochs = EPOCHS,
            callbacks = [learning_rate_cb, tensorboard_cb, stopper_cb, checkpoint_cb],
            validation_data = validation_dataloader, validation_steps = VALID_STEPS)
model_general.summary()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50

Epoch 00018: ReduceLROnPlateau reducing learning rate to 0.004999999888241291.
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50

Epoch 00024: ReduceLROnPlateau reducing learning rate to 0.0024999999441206455.
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50

Epoch 00035: ReduceLROnPlateau reducing learning rate to 0.0012499999720603228.
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Epoch 00050: ReduceLROnPlateau reducing learning rate to 0.0006249999860301614.
Model: "attention__based__encoder__decoder_4"
________________________________________

6.5 Calculate BLEU score (for general function): To quantify the model performance, we will clculate the BLEU score.

In [2]:
def predictor(s):
    return predict(s, model_general)

test['informals'] = test['encoder_input'].apply(post_processing)
test['formals'] = test['decoder_input'].apply(post_processing)
test['predictions'] = test['informals'].apply(predictor)
test['formals'] = test['formals'].apply(convert_formals)
test['predictions'] = test['predictions'].apply(convert_predictions)

bleu_scores = []
i = 0

while i < (len(test)):
    bleu_scores.append(sentence_bleu(test['formals'].iloc[i], test['predictions'].iloc[i]))
    i = i + 1

print('Average BLEU score for the predictions:', np.mean(bleu_scores))

Average BLEU score for the predictions: 0.5472367499821276


6.6. Distribution of BLEU score for General scoring function

In [26]:
fig = ff.create_distplot([bleu_scores], ['Count'])
fig.update_layout(title= 'BLEU Score distribution (General scoring function)')
fig.show()


6.7. Compiling and Fitting the model using concat scoring technique: We can now train the model by using model fit method.

In [2]:
model_concat  = Encoder_Decoder(input_length = MAX_LEN, inp_vocab_size = len(tokenizer['informal'].word_index.keys()),
                                            out_vocab_size = len(tokenizer['normalized'].word_index.keys()), lstm_size = UNITS,
                                            scoring_function = 'concat', batch_size = BATCH_SIZE,
                                            embedding_dim = len(tokenizer['normalized'].word_index.keys()), embedding_matrix = None)


optimizer = tf.keras.optimizers.Adam(0.01)
model_concat.compile(optimizer = optimizer, loss = loss_function)

learning_rate_cb = tf.keras.callbacks.ReduceLROnPlateau(**cb_params)
tensorboard_cb = create_tensorboard_cb("Model_Concat_logs")
stopper_cb = tf.keras.callbacks.EarlyStopping(**cb_stopper_cb)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Model_Concat",
                                                    save_best_only = True, save_weights_only = False)

model_concat.fit(train_dataloader, steps_per_epoch = TRAIN_STEPS, epochs = EPOCHS,
            callbacks = [learning_rate_cb, tensorboard_cb, stopper_cb, checkpoint_cb],
            validation_data = validation_dataloader, validation_steps = VALID_STEPS)
model_concat.summary()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.0003124999930150807.
Epoch 6/50
Epoch 7/50
Epoch 8/50

Epoch 00008: ReduceLROnPlateau reducing learning rate to 0.00015624999650754035.
Epoch 9/50
Restoring model weights from the end of the best epoch.
Epoch 00009: early stopping
Model: "attention__based__encoder__decoder_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_4 (Encoder)          multiple                  566148    
_________________________________________________________________
decoder_4 (Decoder)          multiple                  782649    
Total params: 1,348,797
Trainable params: 1,348,797
Non-trainable params: 0
_________________________________________________________________
The model achieves the loss of 0.2196 on validation set which is better than the model using dot scoring function.


6.8. Calculate BLEU score (for Concat function): To quantify the model performance, we will clculate the BLEU score.

In [11]:
def predictor(s):
    return predict(s, model_concat)

test['informals'] = test['encoder_input'].apply(post_processing)
test['formals'] = test['decoder_input'].apply(post_processing)
test['predictions'] = test['informals'].apply(predictor)
test['formals'] = test['formals'].apply(convert_formals)
test['predictions'] = test['predictions'].apply(convert_predictions)

bleu_scores = []
i = 0

while i < (len(test)):
    bleu_scores.append(sentence_bleu(test['formals'].iloc[i], test['predictions'].iloc[i]))
    i = i + 1

print('Average BLEU score for the predictions:', np.mean(bleu_scores))

Average BLEU score for the predictions: 0.5821354723670114


6.9. Distribution of BLEU score for Concat scoring function

In [30]:
new_var = fig = ff.create_distplot([bleu_scores], ['Count'])
fig.update_layout(title= 'BLEU Score distribution (Concat scoring function)')
fig.show()

9 Error Analysis:
Out of the three models, the concat scoring function model has shown the best performance. We will now examine the model's performance on the test dataset by comparing its best and worst predictions. To do that, we must first sort the model's test set bleu scores before printing the related forecasts.

In [None]:
scores = np.array(bleu_scores)
indices = (np.argsort(scores)).tolist()
worst = indices[0]
best = indices[-1]

print('Best Predictions:')
print('Informal Input: ',test['informals'].iloc[indices[-1]])
print('Expected Output: ',test['formals'].iloc[indices[-1]][0])
print('Predicted Output: ',test['predictions'].iloc[indices[-1]])
print('Bleu Score of Prediction: ',scores[indices[-1]])
print("\n")

print('Worst Predictions:')
print('Informal Input: ',test['informals'].iloc[indices[0]])
print('Expected Output: ',test['formals'].iloc[indices[0]][0])
print('Predicted Output: ',test['predictions'].iloc[indices[0]])
print('Bleu Score of Prediction: ',scores[indices[0]])
print("\n")

Best Predictions:
Informal Input: How you doing?
Expected Output: How you doing?
Predicted Output:  How you doing?
Bleu Score of Prediction : 1.0


Worst Predictions:
Informal Input : Kid's shop selling clothes izit...
Expected Output : Kid's shop is selling clothes, is it?
Predicted Output : I'm still to some to see you all not.
Bleu Score of Prediction : 0.00
