This notebook is part of the below youtube Deep Learning with Pytorch : From Zero to GNN Series.

Deep Learning with Pytorch Youtube series: https://www.youtube.com/playlist?list=PLOrU905yPYXJsJSHJsiE779KfcrRCgz4v

#### Data Preprocessing 

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os , string
# we will be using pytorch to build the model , hence importing required torch essentials
import torch
from torch.utils.data import DataLoader , TensorDataset
from torch import nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

### Dataset Details:

Dataset contains English sentences and corresponding Korean sentences. 

Our Machine translation model will take the english sentence and will translate the same to Koren sentence. 

We have train/test dataset. 

In [2]:
# file paths
eng_train = '/kaggle/input/englishkorean-multitarget-ted-talks-task-mttt/multitarget-ted/en-ko/raw/ted_train_en-ko.raw.en'
ko_train = '/kaggle/input/englishkorean-multitarget-ted-talks-task-mttt/multitarget-ted/en-ko/raw/ted_train_en-ko.raw.ko'

eng_test = '/kaggle/input/englishkorean-multitarget-ted-talks-task-mttt/multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.en'
ko_test = '/kaggle/input/englishkorean-multitarget-ted-talks-task-mttt/multitarget-ted/en-ko/raw/ted_test1_en-ko.raw.ko'

In [3]:
# function to read raw text file
def read_text(filename):
    # open the file
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.readlines()
    file.close()
    return text

In [4]:
# reading the files
df_eng_train = read_text(eng_train)
df_ko_train = read_text(ko_train)

df_eng_test = read_text(eng_test)
df_ko_test = read_text(ko_test)

In [5]:
# looking at the length of the different datasets

len(df_eng_train) , len(df_ko_train) , len(df_eng_test) , len(df_ko_test)

(166215, 166215, 1982, 1982)

### Data Preprocessing

1. Remove punctuation
2. To lower case
3. Remove new line charecters
4. Split into indivisual words
5. Build vocabulary of English and Korean Language 
     - index2word
     - word2index
6. Encoding and Padding each sentences from both language
7. Prepare train/test batches of data 

In [6]:
# Remove punctuation
df_eng_train = [s.translate(str.maketrans('', '', string.punctuation)) for s in df_eng_train]
df_ko_train = [s.translate(str.maketrans('', '', string.punctuation)) for s in df_ko_train]
df_eng_test = [s.translate(str.maketrans('', '', string.punctuation)) for s in df_eng_test]
df_ko_test = [s.translate(str.maketrans('', '', string.punctuation)) for s in df_ko_test]

In [7]:
# looking at one of the sentence
df_eng_train[1]

'And were going to tell you some stories from the sea here in video \n'

In [8]:
# looking into 5 pair of sentence from both the languages

for i in range(5):
    print("English: {} \n Koren: {} \n".format(df_eng_train[i].strip(), df_ko_train[i].strip()))

English: Applause David Gallo This is Bill Lange Im Dave Gallo 
 Koren: 박수 이쪽은 Bill Lange 이고 저는 David Gallo입니다 

English: And were going to tell you some stories from the sea here in video 
 Koren: 우리는 여러분에게 바닷속 이야기를 영상과 함께 들려주고자 합니다 

English: Weve got some of the most incredible video of Titanic thats ever been seen and were not going to show you any of it 
 Koren: 저희는 끝내주는 타이타닉 비디오도 있긴 합니다만 뭐여기서는 눈꼽만큼도 보여줄 생각이없습니다 

English: Laughter The truth of the matter is that the Titanic  even though its breaking all sorts of box office records  its not the most exciting story from the sea 
 Koren: 웃음 비록 타이타닉이 박스오피스에서 굉장한 실적을 거두긴 했지만 바다가 들려주는 이야기 중 가장 재밌는 것은 아닙니다 

English: And the problem I think is that we take the ocean for granted 
 Koren: 문제라면 우리는 우리가 바다를 이미 알고있다고 믿는거죠 



In [9]:
# to lower the words and remove new lines charecters
for i in range(len(df_eng_train)):
    df_eng_train[i] = df_eng_train[i].lower().rstrip("\n").split()
    df_ko_train[i] = df_ko_train[i].lower().rstrip("\n").split()
    
for i in range(len(df_eng_test)):
    df_eng_test[i] = df_eng_test[i].lower().rstrip("\n").split()
    df_ko_test[i] = df_ko_test[i].lower().rstrip("\n").split()

In [10]:
# looking at one of the sentence after the inital preprocess
df_eng_train[1]

['and',
 'were',
 'going',
 'to',
 'tell',
 'you',
 'some',
 'stories',
 'from',
 'the',
 'sea',
 'here',
 'in',
 'video']

In [11]:
# # bulding the vocabulary for English and Korean Language
# # first we will build the index 2 word mapping


# en_index2word = ["<PAD>", "<SOS>", "<EOS>"] # <PAD> will be the zero'th index
# ko_index2word = ["<PAD>", "<SOS>", "<EOS>"]

# for ds in [df_eng_train , df_eng_test]:
#     for sentence in ds:
#         for token in sentence:
#             if token not in en_index2word:
#                 en_index2word.append(token)

# print('English index to word done')                
                
# for ds in [df_ko_train , df_ko_test]:
#     for sentence in ds:
#         for token in sentence:
#             if token not in ko_index2word:
#                 ko_index2word.append(token)

# print('Koren index to word done')

In [12]:
# # # dumping into pickle files for future use
# from pickle import dump

# dump(en_index2word, open('en_index2word.pkl', 'wb'))
# dump(ko_index2word, open('ko_index2word.pkl', 'wb'))

In [13]:
from pickle import load

en_index2word = load(open('../input/nlpasgn/en_index2word.pkl', 'rb'))
ko_index2word = load(open('../input/nlpasgn/ko_index2word.pkl', 'rb'))

# building the reverse word2index mapping using index2word dictionaries

en_word2idx = {token : idx for idx , token in enumerate(en_index2word)}
ko_word2idx = {token : idx for idx , token in enumerate(ko_index2word)}

In [14]:
# looking the dictionaries and the mapping

In [15]:
ko_index2word[7] , ko_word2idx['이고']

('이고', 7)

In [16]:
en_index2word[5] , en_word2idx['gallo']

('gallo', 5)

In [17]:
# Average sentence length in Training dataset for both the languages

print('English :',sum([len(sent) for sent in df_eng_train])/len(df_eng_train))

print('Korean :',sum([len(sent) for sent in df_ko_train])/len(df_ko_train))

English : 17.39973528261589
Korean : 12.162939566224468


In [18]:
# looking the average sentence lengths we are setting the seq length to max 25
seq_length = 25

In [19]:
# We will do encoding(using word2index mapping) and padding for each sentence to the max seq_length

def encoding_padding(vocab , sentence , max_sent_length):
    
    SOS = [vocab["<SOS>"]] # we will add start of sentence and end of sentence token at each sentence
    EOS = [vocab["<EOS>"]]
    PAD = [vocab["<PAD>"]]
    
    if len(sentence) < (max_sent_length - 2): # -2 is for SOS and EOS
        pads = ((max_sent_length - 2) - len(sentence)) * PAD
        encoding = [vocab[word] for word in sentence]
        return SOS + encoding + EOS + pads
    else:
        encoding = [vocab[word] for word in sentence[:(max_sent_length - 2)]]
        return SOS + encoding + EOS

#### Preparing training and test sets

In [20]:
# encoding every sentence in train and test
encoded_train_en = [encoding_padding(en_word2idx,sent,seq_length) for sent in df_eng_train]
encoded_train_ko = [encoding_padding(ko_word2idx,sent,seq_length) for sent in df_ko_train]
encoded_test_en = [encoding_padding(en_word2idx,sent,seq_length) for sent in df_eng_test]
encoded_test_ko = [encoding_padding(ko_word2idx,sent,seq_length) for sent in df_ko_test]

# creating numpy array for train and test
train_x = np.array(encoded_train_en)
train_y = np.array(encoded_train_ko)

test_x = np.array(encoded_test_en)
test_y = np.array(encoded_test_ko)

In [21]:
# #creating pickles for future use

# dump(train_x, open('train_x.pkl', 'wb'))
# dump(train_y, open('train_y.pkl', 'wb'))
# dump(test_x, open('test_x.pkl', 'wb'))
# dump(test_y, open('test_y.pkl', 'wb'))

In [22]:
# # loading the train test numpy array from pickles 
# train_x = load(open('../input/nlpasgn/train_x.pkl', 'rb'))
# train_y = load(open('../input/nlpasgn/train_y.pkl', 'rb'))
# test_x = load(open('../input/nlpasgn/test_x.pkl', 'rb'))
# test_y = load(open('../input/nlpasgn/test_y.pkl', 'rb'))

In [23]:
# checking the GPU availability and setting the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [24]:
# Creating the Torch tensor dataloaders

train_data = TensorDataset(torch.from_numpy(train_x) , torch.from_numpy(train_y))
test_data = TensorDataset(torch.from_numpy(test_x) , torch.from_numpy(test_y))

batch_size = 32

train_dl = DataLoader(train_data,shuffle=True,batch_size=batch_size,drop_last=True)
test_dl = DataLoader(test_data,shuffle=True,batch_size=batch_size,drop_last=True)

# The drop_last=True parameter ignores the last batch (when the number of examples in your dataset 
# is not divisible by your batch_size) 

In [25]:
input_length = target_length = seq_length # setting the sizes

SOS = en_word2idx["<SOS>"]
EOS = en_word2idx["<EOS>"]

### Building the neural Network
We will create the Encoder and Decoder networks , inside them we will use Reccurent Neural Network.

In [26]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # Embedding layer
        self.embedding = nn.Embedding(input_size, hidden_size, padding_idx=0) # as we specified padding as zero
        
        # RNN layer. The input and output are both of the same size 
        #  since embedding size = hidden size in this example
        self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)

    def forward(self, input, hidden):
        # The inputs are first transformed into embeddings
        embedded = self.embedding(input)
        output = embedded

        # As in any RNN, the new input and the previous hidden states are fed
        #  into the model at each time step 
        output, hidden = self.rnn(output, hidden)
        return output, hidden

    def initHidden(self):
        # This method is used to create the innitial hidden states for the encoder
        return torch.zeros(1, batch_size, self.hidden_size)

In [27]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)

        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        self.rnn = nn.RNN(2 * hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(seq_length):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        embedded =  self.dropout(self.embedding(input))

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.rnn(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [28]:
#initializing the networks

hidden_size = 256
encoder = EncoderRNN(len(en_index2word), hidden_size).to(device)
decoder = AttnDecoderRNN(hidden_size, len(ko_index2word)).to(device)

In [29]:
# looking at the networks
print(encoder)

EncoderRNN(
  (embedding): Embedding(53838, 256, padding_idx=0)
  (rnn): RNN(256, 256, batch_first=True)
)


In [30]:
print(decoder)

AttnDecoderRNN(
  (embedding): Embedding(295179, 256)
  (attention): BahdanauAttention(
    (Wa): Linear(in_features=256, out_features=256, bias=True)
    (Ua): Linear(in_features=256, out_features=256, bias=True)
    (Va): Linear(in_features=256, out_features=1, bias=True)
  )
  (rnn): RNN(512, 256, batch_first=True)
  (out): Linear(in_features=256, out_features=295179, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)


In [31]:
criterion = nn.NLLLoss() # loss function
enc_optimizer = torch.optim.Adam(encoder.parameters(), lr = 3e-3) #encoder optimizer with models params and Learning rate
dec_optimizer = torch.optim.Adam(decoder.parameters(), lr = 3e-3) #decoder optimizer with models params and Learning rate

#### Model Traning for Encoder and Decoders

In [32]:
teacher_forcing_ratio = 0.5
import random

def train_epoch(train_dl, encoder, decoder, enc_optimizer,
          dec_optimizer, criterion):

    total_loss = 0
    for batch in train_dl:
        # Assigning the input and sending to device
        input_tensor = batch[0].to(device)

        # Assigning the output and sending to device
        target_tensor = batch[1].to(device)
        
        # Creating initial hidden states for the encoder
        encoder_hidden = encoder.initHidden()

        # Sending to device 
        encoder_hidden = encoder_hidden.to(device)

        enc_optimizer.zero_grad()
        dec_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor , encoder_hidden)
        
        ### Teacher forcing
        use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
        if use_teacher_forcing:
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)
        else:
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, None)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        enc_optimizer.step()
        dec_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(train_dl)

In [33]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

In [34]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every


    for epoch in range(1, n_epochs + 1):
        loss = train_epoch(train_dataloader, encoder, decoder, enc_optimizer, dec_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

In [35]:
%%time
train(train_dl, encoder, decoder, 10, print_every=2, plot_every=2)

47m 26s (- 189m 46s) (2 20%) 5.3472
94m 45s (- 142m 8s) (4 40%) 5.3824
142m 5s (- 94m 43s) (6 60%) 5.3798
189m 28s (- 47m 22s) (8 80%) 5.5199
236m 51s (- 0m 0s) (10 100%) 5.4914
CPU times: user 3h 29min 58s, sys: 26min 54s, total: 3h 56min 53s
Wall time: 3h 56min 51s


In [36]:
# # saving the models for future use

# encoder_path = "encoder.pth"
# torch.save(encoder, encoder_path)

# decoder_path = "decoder.pth"
# torch.save(decoder, decoder_path)

In [37]:
# # # load the models
# encoder = torch.load('../input/nlpasgn/encoder.pth')
# decoder = torch.load('../input/nlpasgn/dencoder.pth')

### Making Predictions on the test set

In [38]:
def make_predictions(test_sentence):
    with torch.no_grad():
        # Tokenizing, Encoding, transforming to Tensor
        test_sentence = torch.tensor(encoding_padding(en_word2idx, test_sentence, seq_length)).unsqueeze(dim=0)
    
        encoder_hidden = torch.zeros(1, 1, hidden_size) # initial input to encoder
        encoder_hidden = encoder_hidden.to(device) # taking to the device

        input_tensor = test_sentence.to(device) # taking to the device

        encoder_outputs, encoder_hidden = encoder(input_tensor,encoder_hidden)
        decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

        _, topi = decoder_outputs.topk(1)
        decoded_ids = topi.squeeze()

        decoded_words = []
        for idx in decoded_ids:
            if idx.item() == EOS:
                break
            decoded_words.append(ko_index2word[idx.to('cpu').item()])
    return decoded_words[1:] # not inclusding SOS

In [39]:
for i in range(5):
    print('Sentence Number in test set: ----------', i, '---------------\n')
    print('Actual English Sentence : '," ".join(df_eng_test[i]))
    print('Translated Predictions:' , make_predictions(df_eng_test[i]))
    print('Actual Sentence:' , df_ko_test[i])
    print('\n')

Sentence Number in test set: ---------- 0 ---------------

Actual English Sentence :  allison hunt my three minutes hasnt started yet has it
Translated Predictions: ['그리고', '이', '킹스']
Actual Sentence: ['아직', '3분', '시작된', '건', '아니죠', '그렇죠']


Sentence Number in test set: ---------- 1 ---------------

Actual English Sentence :  chris anderson no you cant start the three minutes
Translated Predictions: ['그리고', '이', '킹스']
Actual Sentence: ['크리스', '앤더슨네', '맘대로', '시작하실', '수', '없습니다']


Sentence Number in test set: ---------- 2 ---------------

Actual English Sentence :  reset the three minutes thats just not fair
Translated Predictions: ['그리고', '이', '킹스']
Actual Sentence: ['3분', '다시', '설정해주세요', '이건', '반칙입니다']


Sentence Number in test set: ---------- 3 ---------------

Actual English Sentence :  ah oh my god its harsh up here
Translated Predictions: ['그리고', '저는']
Actual Sentence: ['앨리슨', '헌트', '어머나', '여기', '참', '냉정하네요']


Sentence Number in test set: ---------- 4 ---------------

Actual Engl

In [40]:
from torchtext.data.metrics import bleu_score

# https://pytorch.org/text/stable/data_metrics.html

def bleu(src_data, tar_data):
    targets = []
    outputs = []
    for i in zip(src_data , tar_data):
        src = i[0]
        trg = i[1]
        prediction = make_predictions(src)
        targets.append([str(ko_word2idx[word]) for word in trg])
        outputs.append([str(ko_word2idx[word]) for word in prediction])
    return bleu_score(outputs, targets,max_n=1,weights=[1])

In [41]:
# Bleu score on test dataset
score = bleu(df_eng_test,df_ko_test)
print(f"Bleu score {score*100:.2f}")

Bleu score 4.33
