# **Sentence Tokenization**

Sentence tokenization is the process of splitting a text into individual sentences. It is a crucial step in natural language processing, as many algorithms and models operate on a sentence-level basis. Tokenization refers to the process of breaking a large text into smaller chunks, called tokens. In the case of sentence tokenization, the text is broken into individual sentences, with each sentence being considered as a separate token.

Sentence tokenization can be achieved using various techniques, such as using regular expressions to match patterns of punctuation marks, or using machine learning models that have been trained on large datasets. Once the text has been tokenized into individual sentences, it can then be further preprocessed and analyzed, such as by removing stop words, stemming, or lemmatization.

Sentence tokenization is an important preprocessing step in many NLP tasks, such as machine translation, text summarization, sentiment analysis, and more. It allows for the text to be segmented into smaller, more manageable units, which can then be analyzed and processed more efficiently. Additionally, sentence tokenization helps to ensure that the output of an NLP model is coherent and meaningful, as it ensures that each output is a complete sentence that can be understood by a human reader.

# Importing Libraries

In [1]:
import torch
import numpy as np
import torch.nn as nn
import math

# Importing/Loading the File containing the English and its corresponding Gujarati Sentence

In [2]:
english_file = '/kaggle/input/english-to-gujarati-machine-translation-dataset/en-gu/train.en'
gujarati_file = '/kaggle/input/english-to-gujarati-machine-translation-dataset/en-gu/train.gu'

# **Creating the Alpha Syllabery/Vocabulary for both Languages**

In [3]:
START_TOKEN = '<s>'
PADDING_TOKEN = '</s>'
END_TOKEN = '<pad>'

gujarati_vocabulary = [START_TOKEN, ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/',
                       '૦', '૧', '૨', '૩', '૪', '૫', '૬', '૭', '૮', '૯', ':', '<', '=', '>', '?', '@',
                       'અ', 'આ', 'ઇ', 'ઈ', 'ઉ', 'ઊ', 'ઋ', 'ૠ', 'ઌ', 'ૡ', 'ઍ', 'એ', 'ઐ', 'ઑ', 'ઓ', 'ઔ',
                       'ક', 'ખ', 'ગ', 'ઘ', 'ઙ',
                       'ચ', 'છ', 'જ', 'ઝ', 'ઞ',                        
                       'ટ', 'ઠ', 'ડ', 'ઢ', 'ણ',                        
                       'ત', 'થ', 'દ', 'ધ', 'ન',                        
                       'પ', 'ફ', 'બ', 'ભ', 'મ',                        
                       'ય', 'ર', 'લ', 'વ', 'શ', 'ષ', 'સ', 'હ', '઼', 'ા', 'િ', 'ી', 'ુ', 'ૂ', 'ૃ', 'ૄ', 'ૅ', 'ે', 'ૈ', 'ૉ', 'ો', 'ૌ', '્', 'ૐ', 'ૠ', 'ૡ', 'ં', 'ઃ',
                       PADDING_TOKEN, END_TOKEN]


english_vocabulary = [START_TOKEN, ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', 
                        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
                        ':', '<', '=', '>', '?', '@', 
                        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 
                        'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 
                        'Y', 'Z',
                        "[", "/", "]", "^", "_", "`", 
                        'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
                        'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 
                        'y', 'z', 
                        '{', '|', '}', '~', PADDING_TOKEN, END_TOKEN]

# **Traversing each of the Indic Unicode Characters**

In [4]:
index_to_gujarati = {k:v for k,v in enumerate(gujarati_vocabulary)}
gujarati_to_index = {v:k for k,v in enumerate(gujarati_vocabulary)}
index_to_english = {k:v for k,v in enumerate(english_vocabulary)}
english_to_index = {v:k for k,v in enumerate(english_vocabulary)}

In [5]:
gujarati_to_index

{'<s>': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '#': 4,
 '$': 5,
 '%': 6,
 '&': 7,
 "'": 8,
 '(': 9,
 ')': 10,
 '*': 11,
 '+': 12,
 ',': 13,
 '-': 14,
 '.': 15,
 '/': 16,
 '૦': 17,
 '૧': 18,
 '૨': 19,
 '૩': 20,
 '૪': 21,
 '૫': 22,
 '૬': 23,
 '૭': 24,
 '૮': 25,
 '૯': 26,
 ':': 27,
 '<': 28,
 '=': 29,
 '>': 30,
 '?': 31,
 '@': 32,
 'અ': 33,
 'આ': 34,
 'ઇ': 35,
 'ઈ': 36,
 'ઉ': 37,
 'ઊ': 38,
 'ઋ': 39,
 'ૠ': 98,
 'ઌ': 41,
 'ૡ': 99,
 'ઍ': 43,
 'એ': 44,
 'ઐ': 45,
 'ઑ': 46,
 'ઓ': 47,
 'ઔ': 48,
 'ક': 49,
 'ખ': 50,
 'ગ': 51,
 'ઘ': 52,
 'ઙ': 53,
 'ચ': 54,
 'છ': 55,
 'જ': 56,
 'ઝ': 57,
 'ઞ': 58,
 'ટ': 59,
 'ઠ': 60,
 'ડ': 61,
 'ઢ': 62,
 'ણ': 63,
 'ત': 64,
 'થ': 65,
 'દ': 66,
 'ધ': 67,
 'ન': 68,
 'પ': 69,
 'ફ': 70,
 'બ': 71,
 'ભ': 72,
 'મ': 73,
 'ય': 74,
 'ર': 75,
 'લ': 76,
 'વ': 77,
 'શ': 78,
 'ષ': 79,
 'સ': 80,
 'હ': 81,
 '઼': 82,
 'ા': 83,
 'િ': 84,
 'ી': 85,
 'ુ': 86,
 'ૂ': 87,
 'ૃ': 88,
 'ૄ': 89,
 'ૅ': 90,
 'ે': 91,
 'ૈ': 92,
 'ૉ': 93,
 'ો': 94,
 'ૌ': 95,
 '્': 96,
 'ૐ': 97,
 'ં': 100,
 'ઃ': 101,
 '</s>

# Reading Lines from File

In [6]:
with open(english_file, 'r') as file:
    english_sentences = file.readlines()
with open(gujarati_file, 'r') as file:
    gujarati_sentences = file.readlines()

In [7]:
# Limit Number of sentences
TOTAL_SENTENCES = 200000
english_sentences = english_sentences[:TOTAL_SENTENCES]
gujarati_sentences = gujarati_sentences[:TOTAL_SENTENCES]
english_sentences = [sentence.rstrip('\n') for sentence in english_sentences]
gujarati_sentences = [sentence.rstrip('\n') for sentence in gujarati_sentences]

In [8]:
english_sentences[:10]

['Are you doing online transactions?',
 'Kunwar explains:',
 'A passenger train is sitting at a station.',
 'heavy snow shower',
 'It was plain that their intensive study of the Scriptures over their five months of training had reached their heart and motivated them to share with others what they had learned.',
 'Jesus Christ is overseeing the greatest preaching campaign in history',
 'He had gained victory by a margin of 67,000 votes.',
 'The Moskals immediately included the reading of the Harp book in their regular Bible - reading sessions.',
 'Gas lasers.',
 'Effective December 2 midnight, petrol, diesel and gas outlets will be removed from the exempt category for receipt of old Rs 500 notes']

In [9]:
gujarati_sentences[:10]

['ઓનલાઈન ટ્રાન્ઝેક્શન કરી શકાય?',
 'કુરાન તે વર્ણવે છે:',
 'એક પેસેન્જર ટ્રેન સ્ટેશન પર બેઠેલું છે.',
 'ભારે બરફના ટૂકડાweather forecast',
 'પાંચ મહિનાના કોર્સમાં પોતે જે કંઈ શીખ્યો, એ એક વિદ્યાર્થીએ પોતાના નાના ભાઈને જણાવ્યું.',
 'આજે પૃથ્વી પર થઈ રહેલા મહાન પ્રચાર કાર્યની ઈસુ દેખરેખ રાખે છે',
 'આમ, તેઓ 67,000થી વધુ મતથી જીતી ગયા છે.',
 'મૉસ્કેલ કુટુંબે બાઇબલ સાથે સાથે એ પુસ્તક પણ વાંચવાનું શરૂ કરી દીધું.',
 'ગેસ લેસર્સ.',
 '10 ડિસેમ્બરથી 500 રુપિયની જૂની નોટ રેલવે, મેટ્રો અને બસમાં ચાલવાનું બંધ થઇ જશે']

In [10]:
max(len(x) for x in gujarati_sentences), max(len(x) for x in english_sentences),

(1182, 1004)

# Choosing the top 99 Percentile of data

In [11]:
PERCENTILE = 99
print( f"{PERCENTILE}th percentile length Kannada: {np.percentile([len(x) for x in gujarati_sentences], PERCENTILE)}" )
print( f"{PERCENTILE}th percentile length English: {np.percentile([len(x) for x in english_sentences], PERCENTILE)}" )

99th percentile length Kannada: 227.0
99th percentile length English: 245.0


# Sentences having vocab context and length of sentence should be upto the max_sentence_length

In [12]:
max_sequence_length = 300

In [13]:
def is_valid_tokens(sentence, vocab):
    for token in list(set(sentence)):
        if token not in vocab:
            return False
    return True

def is_valid_length(sentence, max_sequence_length):
    return len(list(sentence)) < (max_sequence_length - 1) # need to re-add the end token so leaving 1 space

valid_sentence_indicies = []
for index in range(len(gujarati_sentences)):
    gujarati_sentence, english_sentence = gujarati_sentences[index], english_sentences[index]
    if is_valid_length(gujarati_sentence, max_sequence_length) \
      and is_valid_length(english_sentence, max_sequence_length) \
      and is_valid_tokens(gujarati_sentence, gujarati_vocabulary):
        valid_sentence_indicies.append(index)

print(f"Number of sentences: {len(gujarati_sentences)}")
print(f"Number of valid sentences: {len(valid_sentence_indicies)}")

Number of sentences: 200000
Number of valid sentences: 139624


In [14]:
gujarati_sentences = [gujarati_sentences[i] for i in valid_sentence_indicies]
english_sentences = [english_sentences[i] for i in valid_sentence_indicies]

# Making Dataset out of the final choosen English and Gujarati Sentences

In [15]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):

    def __init__(self, english_sentences, gujarati_sentences):
        self.english_sentences = english_sentences
        self.gujarati_sentences = gujarati_sentences

    def __len__(self):
        return len(self.english_sentences)

    def __getitem__(self, idx):
        return self.english_sentences[idx], self.gujarati_sentences[idx]

In [16]:
dataset = TextDataset(english_sentences, gujarati_sentences)

In [17]:
dataset[3]

('It was plain that their intensive study of the Scriptures over their five months of training had reached their heart and motivated them to share with others what they had learned.',
 'પાંચ મહિનાના કોર્સમાં પોતે જે કંઈ શીખ્યો, એ એક વિદ્યાર્થીએ પોતાના નાના ભાઈને જણાવ્યું.')

In [18]:
batch_size = 10
train_loader = DataLoader(dataset, batch_size)
iterator = iter(train_loader)

for batch_num, batch in enumerate(iterator):
    print(batch)
    if batch_num > 3:
        break

[('Are you doing online transactions?', 'Kunwar explains:', 'A passenger train is sitting at a station.', 'It was plain that their intensive study of the Scriptures over their five months of training had reached their heart and motivated them to share with others what they had learned.', 'Jesus Christ is overseeing the greatest preaching campaign in history', 'The Moskals immediately included the reading of the Harp book in their regular Bible - reading sessions.', 'Gas lasers.', 'Then the job.', 'Australia announce ODI squad for India series', 'This was another topping.'), ('ઓનલાઈન ટ્રાન્ઝેક્શન કરી શકાય?', 'કુરાન તે વર્ણવે છે:', 'એક પેસેન્જર ટ્રેન સ્ટેશન પર બેઠેલું છે.', 'પાંચ મહિનાના કોર્સમાં પોતે જે કંઈ શીખ્યો, એ એક વિદ્યાર્થીએ પોતાના નાના ભાઈને જણાવ્યું.', 'આજે પૃથ્વી પર થઈ રહેલા મહાન પ્રચાર કાર્યની ઈસુ દેખરેખ રાખે છે', 'મૉસ્કેલ કુટુંબે બાઇબલ સાથે સાથે એ પુસ્તક પણ વાંચવાનું શરૂ કરી દીધું.', 'ગેસ લેસર્સ.', 'પછી તો કામ જ કામ છે.', 'ઓસ્ટ્રેલિયા સામેની વનડે શ્રેણી માટે ટીમ ઈન્ડિયાની જા

# **Tokenization Sentences**

In [19]:
def tokenize(sentence, language_to_index, start_token=True, end_token=True):
    sentence_word_indicies = [language_to_index[token] for token in list(sentence)]
    if start_token:
        sentence_word_indicies.insert(0, language_to_index[START_TOKEN])
    if end_token:
        sentence_word_indicies.append(language_to_index[END_TOKEN])
    for _ in range(len(sentence_word_indicies), max_sequence_length):
        sentence_word_indicies.append(language_to_index[PADDING_TOKEN])
    return torch.tensor(sentence_word_indicies)

In [20]:
eng_tokenized, gu_tokenized = [], []
for sentence_num in range(batch_size):
    eng_sentence, gu_sentence = batch[0][sentence_num], batch[1][sentence_num]
    eng_tokenized.append( tokenize(eng_sentence, english_to_index, start_token=False, end_token=False) )
    gu_tokenized.append( tokenize(gu_sentence, gujarati_to_index, start_token=True, end_token=True) )
eng_tokenized = torch.stack(eng_tokenized)
gu_tokenized = torch.stack(gu_tokenized)

In [21]:
gu_tokenized

tensor([[  0,  74,  94,  ..., 102, 102, 102],
        [  0,  34,  80,  ..., 102, 102, 102],
        [  0,  64,  91,  ..., 102, 102, 102],
        ...,
        [  0,  69,  96,  ..., 102, 102, 102],
        [  0,  49,  91,  ..., 102, 102, 102],
        [  0,  44,  77,  ..., 102, 102, 102]])

In [22]:
eng_tokenized

tensor([[50, 73, 71,  ..., 95, 95, 95],
        [44, 73, 77,  ..., 95, 95, 95],
        [51, 79, 13,  ..., 95, 95, 95],
        ...,
        [48, 82, 73,  ..., 95, 95, 95],
        [55, 72, 89,  ..., 95, 95, 95],
        [40, 79, 87,  ..., 95, 95, 95]])