Downloading the [Zip file](http://www.manythings.org/anki/fra-eng.zip)

In [1]:
!wget http://www.manythings.org/anki/fra-eng.zip

--2025-01-29 17:01:48--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7943074 (7.6M) [application/zip]
Saving to: ‘fra-eng.zip’


2025-01-29 17:01:49 (13.2 MB/s) - ‘fra-eng.zip’ saved [7943074/7943074]



**Extracting the Zip file**

In [2]:
import zipfile
zip = zipfile.ZipFile('fra-eng.zip')
zip.extractall()

**Dependencies**

In [3]:
import string,re
from unicodedata import normalize
from numpy import array,argmax
from pickle import load,dump
from numpy.random import rand,shuffle

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Embedding, RepeatVector, TimeDistributed

**Loading the file and reading the content of the file**

In [5]:
# load file into memory
def load_file(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

**Splitting the sentence into pairs**

In [6]:
# split a loaded document into sentences
def splitting_sentence(doc):
	sentences = doc.strip().split('\n')
	pairs = [sentence.split('\t') for sentence in  sentences]
	return pairs

**Cleaning the pairs**

In [7]:
# cleaning a list of sentences and creating pairs

def clean_pairs(sentences):
	cleaned = list()
 
	# preparing regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))

	# preparing translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)

  # iterating over each pair
	for pair in sentences:
		clean_pair = list()
  
		for sentence in pair:
			# normalizing unicode characters
			sentence = normalize('NFD', sentence).encode('ascii', 'ignore')
			sentence = sentence.decode('UTF-8')
			# tokenizing on white space
			sentence = sentence.split()
			# converting to lowercase
			sentence = [word.lower() for word in sentence]
			# removing punctuation from each token
			sentence = [word.translate(table) for word in sentence]
			# removing non-printable chars form each token
			sentence = [re_print.sub('', w) for w in sentence]
			# removing tokens with numbers in them
			sentence = [word for word in sentence if word.isalpha()]
			# storing as string
			clean_pair.append(' '.join(sentence))
		cleaned.append(clean_pair)
	return array(cleaned)

**Saving the Cleaned data**

In [8]:
def saving_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print(filename,': Saved')

**Saving data in .pkl format**

In [9]:
# load dataset

filename = 'fra.txt'
doc = load_file(filename)

# split into english-french pairs
pairs = splitting_sentence(doc)

# clean sentences
clean_pairs = clean_pairs(pairs)

# save clean pairs to file
saving_clean_data(clean_pairs, 'english-french.pkl')

print('English','-->',"French")
# spot check
for i in range(25):
	print(clean_pairs[i,0],'-->',clean_pairs[i,1])

english-french.pkl : Saved
English --> French
go --> va
go --> marche
go --> en route
go --> bouge
hi --> salut
hi --> salut
run --> cours
run --> courez
run --> prenez vos jambes a vos cous
run --> file
run --> filez
run --> cours
run --> fuyez
run --> fuyons
run --> cours
run --> courez
run --> prenez vos jambes a vos cous
run --> file
run --> filez
run --> cours
run --> fuyez
run --> fuyons
who --> qui
wow --> ca alors
wow --> waouh


**Loading the cleaned data**

In [10]:
# load a clean dataset
def loading_cleaned_data(filename):
	return load(open(filename, 'rb'))

In [11]:
# load dataset
data = loading_cleaned_data('english-french.pkl')
print(data.shape) 

(232736, 3)


**Scaling of data** 

**Size**

1.Dataset - 20000

2.Training - 18000

3.Testing - 2000   



In [12]:
# reducing dataset size (scaling) 

new_data_size = 20000
dataset = data[:new_data_size, :]

# randomly shuffling the dataset to get proper training and testing data
shuffle(dataset)

# splitting into training and testing (90%-10%)
train, test = dataset[:18000], dataset[18000:]

# saving the cleaned data,train data and test data 
saving_clean_data(dataset, 'english-french-both.pkl')
saving_clean_data(train, 'english-french-train.pkl')
saving_clean_data(test, 'english-french-test.pkl')

english-french-both.pkl : Saved
english-french-train.pkl : Saved
english-french-test.pkl : Saved


In [13]:
# loading datasets and saving it into variables
dataset = loading_cleaned_data('english-french-both.pkl')
train = loading_cleaned_data('english-french-train.pkl')
test = loading_cleaned_data('english-french-test.pkl')

**Creating a tokenizer for the lines and finding the maximum length phrase**

In [14]:
# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# max sentence length
def max_length(lines):
	return max(len(line.split()) for line in lines)

**Size of English & French vocabulary and their max phrase length**

In [15]:
# preparing the english tokenizer

eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])

print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

# preparing the french tokenizer

fra_tokenizer = create_tokenizer(dataset[:, 1])
fra_vocab_size = len(fra_tokenizer.word_index) + 1
fra_length = max_length(dataset[:, 1])
print('French Vocabulary Size: %d' % fra_vocab_size)
print('French Max Length: %d' % (fra_length))


English Vocabulary Size: 3316
English Max Length: 5
French Vocabulary Size: 6875
French Max Length: 11


**Encoding to integers and padding to the maximum phrase length**

In [16]:
# Input and Output sequence must be encoded to integers and padded to the maximum phrase length
def encode_sequences(tokenizer, length, lines):
	# integer encode sequences
	x = tokenizer.texts_to_sequences(lines)
	# pad sequences with 0 values
	x = pad_sequences(x, maxlen=length, padding='post')
	return x

# One hot encoding to max phrase length
def one_hot_encoding(sequences, vocab_size):
	y_1 = list()
	for sequence in sequences:
		encoded = to_categorical(sequence, num_classes=vocab_size)
		y_1.append(encoded)
	y = array(y_1)
	y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
	return y

**Training and Testing Data**

In [17]:
# preparing training data
trainX = encode_sequences(fra_tokenizer, fra_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = one_hot_encoding(trainY, eng_vocab_size)

# prepare testing data
testX = encode_sequences(fra_tokenizer, fra_length, test[:, 1])
testY = encode_sequences(eng_tokenizer,eng_length, test[:, 0])
testY = one_hot_encoding(testY, eng_vocab_size)

In [18]:
print('training size:',trainX.shape,trainY.shape)
print('testing size:',testX.shape,testY.shape)

training size: (18000, 11) (18000, 5, 3316)
testing size: (2000, 11) (2000, 5, 3316)


**Building the model**

In [19]:
def model_building(source_vocab, target_vocab, source_len, target_len, units):
	model = Sequential()
	model.add(Embedding(source_vocab, units, input_length=source_len, mask_zero=True))
	model.add(LSTM(units))
	model.add(RepeatVector(target_len))
	model.add(LSTM(units, return_sequences=True))
	model.add(TimeDistributed(Dense(target_vocab, activation='softmax')))
	return model

**Defining and Compiling the model**

In [20]:
model = model_building(fra_vocab_size, eng_vocab_size, fra_length, eng_length, 512)
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['acc'])



**Model Summary**

In [21]:
print(model.summary())

None


In [22]:
# Stop model if accuracy of the model doesn't changes by more than 0.01 
# Patience = 5 : After each 5 epochs if no improvement is there then training will be stopped.
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_acc',patience= 5,min_delta=0.01)

**Fitting the model**

1.Epochs = 50

2.Batch_size = 25

In [23]:
# fit model
model.fit(trainX, trainY, epochs= 50, batch_size=25, validation_data=(testX, testY), verbose=2,callbacks=[es])

Epoch 1/50
720/720 - 17s - 23ms/step - acc: 0.4768 - loss: 3.4779 - val_acc: 0.5437 - val_loss: 2.9136
Epoch 2/50
720/720 - 9s - 12ms/step - acc: 0.5710 - loss: 2.5870 - val_acc: 0.6071 - val_loss: 2.3889
Epoch 3/50
720/720 - 9s - 12ms/step - acc: 0.6268 - loss: 2.0455 - val_acc: 0.6329 - val_loss: 2.1138
Epoch 4/50
720/720 - 9s - 12ms/step - acc: 0.6750 - loss: 1.6283 - val_acc: 0.6617 - val_loss: 1.8967
Epoch 5/50
720/720 - 9s - 12ms/step - acc: 0.7187 - loss: 1.2831 - val_acc: 0.6839 - val_loss: 1.7369
Epoch 6/50
720/720 - 9s - 12ms/step - acc: 0.7682 - loss: 0.9846 - val_acc: 0.7007 - val_loss: 1.6391
Epoch 7/50
720/720 - 9s - 12ms/step - acc: 0.8115 - loss: 0.7503 - val_acc: 0.7124 - val_loss: 1.5802
Epoch 8/50
720/720 - 9s - 12ms/step - acc: 0.8478 - loss: 0.5696 - val_acc: 0.7277 - val_loss: 1.5394
Epoch 9/50
720/720 - 9s - 12ms/step - acc: 0.8775 - loss: 0.4429 - val_acc: 0.7270 - val_loss: 1.5403
Epoch 10/50
720/720 - 9s - 12ms/step - acc: 0.8982 - loss: 0.3540 - val_acc: 0.73

<keras.src.callbacks.history.History at 0x7f37e4fe0af0>

**Evaluating model and calculating BLEU Score**

Evaluation involves two steps: 

1.Generating a translated output sequence, and 

2.then repeating this process for many input examples and summarizing the skill of the model across multiple cases.

In [24]:
# mapping integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

In [25]:
# generating target given source sequence
def predict_sequence(model, tokenizer, source):
	prediction = model.predict(source, verbose=0)[0]
	integers = [argmax(vector) for vector in prediction]
	target = list()
	for i in integers:
		word = word_for_id(i, tokenizer)
		if word is None:
			break
		target.append(word)
	return ' '.join(target)

In [26]:

# evaluating the skill of the model
def evaluate_model(model, tokenizer, sources, raw_dataset):
  
  # Creating empty lists for actual phrases(French) and predicted phrases(English) 
  actual,predicted = list(),list()
  a,b,c = list(),list(),list()
  for i,source in enumerate(sources):

    # reshaping to the required size
    source = source.reshape((1, source.shape[0]))

    # predicting for the english tokenizer
    translation = predict_sequence(model, eng_tokenizer, source)
    # raw_dataset = raw_dataset[i].split(' ') 
    # print(raw_dataset[i][1])

    raw_src,raw_target = raw_dataset[i][1],raw_dataset[i][0]
    
    # First 10 Predictions
    if i <= 10:
      print('source = ',raw_src,'<--->', ' target = ',raw_target,'<--->','  predicted = ',translation)

    actual.append([raw_target.split()])
    predicted.append(translation.split())
  
  # calculating BLEU score
  print('-------------------------------------------')
  print('BLEU Score :')
  print('BLEU score-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0),smoothing_function=smoothie,auto_reweigh=False))
  print('BLEU score-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0),smoothing_function=smoothie,auto_reweigh=False))
  print('BLEU score-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0),smoothing_function=smoothie,auto_reweigh=False))
  print('BLEU score-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25),smoothing_function=smoothie,auto_reweigh=False))

In [28]:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

smoothie = SmoothingFunction().method4  # Define smoothing function

**Evaluating Model on training data**

In [29]:
evaluate_model(model,eng_tokenizer,trainX,train)

source =  cela pourrait etre tom <--->  target =  it could be tom <--->   predicted =  it could be tom
source =  nous avons rompu <--->  target =  we broke up <--->   predicted =  we broke up
source =  bouge <--->  target =  get out <--->   predicted =  go away
source =  tom est serieux <--->  target =  tom is serious <--->   predicted =  tom means
source =  estu enceinte <--->  target =  are you pregnant <--->   predicted =  are you pregnant
source =  tu as le cancer <--->  target =  you have cancer <--->   predicted =  you have cancer
source =  aidemoi a me lever <--->  target =  help me get up <--->   predicted =  help me get up
source =  je suis toubib <--->  target =  im a doctor <--->   predicted =  im a doctor
source =  tu es mon patron <--->  target =  youre my boss <--->   predicted =  youre my boss
source =  dites ce que vous pensez <--->  target =  speak your mind <--->   predicted =  speak your mind
source =  il le mentionna <--->  target =  he mentioned it <--->   predicte

**Evaluating Model on testing data**

In [33]:
evaluate_model(model, eng_tokenizer, testX, test)

source =  pas grave <--->  target =  skip it <--->   predicted =  lets try
source =  cest ma voiture <--->  target =  this cars mine <--->   predicted =  this is my car
source =  pensestu la meme chose <--->  target =  do you think so <--->   predicted =  thanks like dream
source =  monte le cheval <--->  target =  get on the horse <--->   predicted =  get your the horse
source =  nous sommes impliques <--->  target =  were involved <--->   predicted =  were involved
source =  nous sommes ennemis <--->  target =  were enemies <--->   predicted =  were enemies
source =  laije emporte <--->  target =  did i win <--->   predicted =  did i win
source =  je ne peux pas admettre ma defaite <--->  target =  i cant give up <--->   predicted =  i cant give up
source =  il faut que je sois aveugle <--->  target =  i must be blind <--->   predicted =  i must to hurry
source =  je deteste les tomates <--->  target =  i hate tomatoes <--->   predicted =  i hate liars
source =  je suis tres triste <