In this initial implementation, we utilized a dataset comprising over 20,000 English-Arabic sentence pairs. This dataset served as the foundation for training our Transformer-based machine translation model. The model achieved an average BLEU score of approximately 0.23, indicating that it could capture basic translation patterns but had limitations in handling more complex or nuanced sentences.

A significant challenge was the dataset's limited size and diversity. While 20,000 sentences provide a starting point, it is relatively small compared to the extensive corpora used in state-of-the-art machine translation systems, which often involve millions of sentence pairs. This limitation restricts the model's exposure to a wide range of vocabulary, idiomatic expressions, and varied syntactic structures. Such exposure is crucial, especially when translating between English and Arabic, due to their linguistic differences and the morphological richness of Arabic.

To enhance the model's performance, it is essential to acquire more extensive and diverse datasets. Additional high-quality data would enable the model to learn a broader vocabulary and better understand context, leading to more accurate and fluent translations. Moreover, increasing computational resources would allow for training deeper models with larger embedding dimensions and more Transformer layers, which could capture the complexities of both languages more effectively.

Investing in these areas—expanding the dataset and utilizing more computational power—would address the current limitations. It would significantly improve the model's ability to handle complex sentence structures, idiomatic expressions, and the nuanced linguistic patterns necessary for high-quality machine translation between English and Arabic.

In [1]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)


Using device: cuda


Step 1: Data Preparation
Tokenization:

Tokenize the English and Arabic sentences separately.
Consider using subword tokenization (e.g., Byte-Pair Encoding or WordPiece) to handle out-of-vocabulary words. This is especially useful for Arabic, as it has complex morphology.
Vocabulary Creation:

Create separate vocabularies for English and Arabic tokens.
Map each token to a unique integer for use in the model.
Padding and Batching:

Pad sequences to a fixed length to handle variable-length inputs and outputs.
Group similar-length sentences in batches to optimize training speed and memory use.

In [2]:
english_sentences = []
arabic_sentences = []

with open('ara_eng.txt', 'r', encoding='utf-8') as file:
    for line in file:
        eng, ara = line.strip().split('\t')
        english_sentences.append(eng)
        arabic_sentences.append(ara)

print("Sample English Sentence:", english_sentences[0])
print("Sample Arabic Sentence:", arabic_sentences[0])


Sample English Sentence: Hi.
Sample Arabic Sentence: مرحبًا.


In [3]:
with open('english_sentences.txt', 'w', encoding='utf-8') as eng_file:
    for sentence in english_sentences:
        eng_file.write(sentence + '\n')

with open('arabic_sentences.txt', 'w', encoding='utf-8') as ara_file:
    for sentence in arabic_sentences:
        ara_file.write(sentence + '\n')


In [4]:
import sentencepiece as spm

spm.SentencePieceTrainer.Train(input='english_sentences.txt', model_prefix='eng_tokenizer', vocab_size=8000)

spm.SentencePieceTrainer.Train(input='arabic_sentences.txt', model_prefix='ara_tokenizer', vocab_size=8000)


In [5]:
sp_eng = spm.SentencePieceProcessor(model_file='eng_tokenizer.model')
sp_ara = spm.SentencePieceProcessor(model_file='ara_tokenizer.model')

eng_tokens = [sp_eng.encode(sentence, out_type=int) for sentence in english_sentences]
ara_tokens = [sp_ara.encode(sentence, out_type=int) for sentence in arabic_sentences]

print("Tokenized English:", eng_tokens[:2])
print("Tokenized Arabic:", ara_tokens[:2])


Tokenized English: [[5434, 4], [13, 0, 890, 288]]
Tokenized Arabic: [[6541, 362, 8], [37, 3650, 293]]


In [6]:
eng_vocab_size = sp_eng.get_piece_size()
ara_vocab_size = sp_ara.get_piece_size()

eng_id_to_token = {i: sp_eng.id_to_piece(i) for i in range(eng_vocab_size)}
ara_id_to_token = {i: sp_ara.id_to_piece(i) for i in range(ara_vocab_size)}

eng_token_to_id = {v: k for k, v in eng_id_to_token.items()}
ara_token_to_id = {v: k for k, v in ara_id_to_token.items()}


In [7]:
import torch
from torch.nn.utils.rnn import pad_sequence

eng_tensors = [torch.tensor(tokens) for tokens in eng_tokens]
ara_tensors = [torch.tensor(tokens) for tokens in ara_tokens]

eng_padded = pad_sequence(eng_tensors, batch_first=True, padding_value=0)
ara_padded = pad_sequence(ara_tensors, batch_first=True, padding_value=0)

print("Padded English Sentences:\n", eng_padded)
print("Padded Arabic Sentences:\n", ara_padded)


Padded English Sentences:
 tensor([[5434,    4,    0,  ...,    0,    0,    0],
        [  13,    0,  890,  ...,    0,    0,    0],
        [ 102,  153,  185,  ...,    0,    0,    0],
        ...,
        [ 236,  172,  125,  ...,    0,    0,    0],
        [   6,  151,  152,  ...,    0,    0,    0],
        [  21,   54,   78,  ...,    0,    0,    0]])
Padded Arabic Sentences:
 tensor([[6541,  362,    8,  ...,    0,    0,    0],
        [  37, 3650,  293,  ...,    0,    0,    0],
        [7852,  293,    0,  ...,    0,    0,    0],
        ...,
        [ 600,   30,   99,  ...,    0,    0,    0],
        [2715,    6, 7124,  ...,    0,    0,    0],
        [ 144,  729,  108,  ...,    0,    0,    0]])


In [8]:
from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(eng_padded, ara_padded)
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)

for eng_batch, ara_batch in data_loader:
    print("English Batch:\n", eng_batch)
    print("Arabic Batch:\n", ara_batch)
    break


English Batch:
 tensor([[ 203,  194,   21,  ...,    0,    0,    0],
        [  73, 3973,   15,  ...,    0,    0,    0],
        [1069,   19,  159,  ...,    0,    0,    0],
        ...,
        [ 146,  259,    5,  ...,    0,    0,    0],
        [ 271,    6,    3,  ...,    0,    0,    0],
        [ 316,    3,  736,  ...,    0,    0,    0]])
Arabic Batch:
 tensor([[ 248, 2114,  125,  ...,    0,    0,    0],
        [ 579, 1751,  808,  ...,    0,    0,    0],
        [2670,   13,   56,  ...,    0,    0,    0],
        ...,
        [  83, 3360, 1884,  ...,    0,    0,    0],
        [  10,  504,   34,  ...,    0,    0,    0],
        [ 288, 1845, 1553,  ...,    0,    0,    0]])


In [9]:
!pip install sentencepiece
!pip install nltk


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [10]:
!pip install torch



Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting torch
  Downloading torch-2.5.1-cp312-cp312-win_amd64.whl.metadata (28 kB)
Downloading torch-2.5.1-cp312-cp312-win_amd64.whl (203.0 MB)
   ---------------------------------------- 0.0/203.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/203.0 MB ? eta -:--:--
   ---------------------------------------- 0.8/203.0 MB 4.2 MB/s eta 0:00:49
   ---------------------------------------- 2.1/203.0 MB 5.3 MB/s eta 0:00:38
    --------------------------------------- 3.4/203.0 MB 5.8 MB/s eta 0:00:35
    --------------------------------------- 4.7/203.0 MB 5.7 MB/s eta 0:00:35
   - -------------------------------------- 6.0/203.0 MB 6.1 MB/s eta 0:00:33
   - -------------------------------------- 7.3/203.0 MB 6.1 MB/s eta 0:00:32
   - -------------------------------------- 8.9/203.0 MB 6.2 MB/s eta 0:00:32
   -- --------

In [11]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Besher\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:

import torch
import torch.nn as nn
import torch.optim as optim
import math
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
import sentencepiece as spm
import random
import numpy as np
import gc
from nltk.translate.bleu_score import sentence_bleu

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

english_sentences = []
arabic_sentences = []

with open('ara_eng.txt', 'r', encoding='utf-8') as file:
    for line in file:
        eng, ara = line.strip().split('\t')
        english_sentences.append(eng)
        arabic_sentences.append(ara)

print("Sample English Sentence:", english_sentences[0])
print("Sample Arabic Sentence:", arabic_sentences[0])


with open('english_sentences.txt', 'w', encoding='utf-8') as eng_file:
    for sentence in english_sentences:
        eng_file.write(sentence + '\n')

with open('arabic_sentences.txt', 'w', encoding='utf-8') as ara_file:
    for sentence in arabic_sentences:
        ara_file.write(sentence + '\n')

spm.SentencePieceTrainer.Train(input='english_sentences.txt', model_prefix='eng_tokenizer', vocab_size=8000)
spm.SentencePieceTrainer.Train(input='arabic_sentences.txt', model_prefix='ara_tokenizer', vocab_size=8000)

sp_eng = spm.SentencePieceProcessor(model_file='eng_tokenizer.model')
sp_ara = spm.SentencePieceProcessor(model_file='ara_tokenizer.model')

del english_sentences, arabic_sentences
gc.collect()


max_seq_length = 100

class TranslationDataset(Dataset):
    def __init__(self, src_file, tgt_file, tokenizer_src, tokenizer_tgt, max_length):
        self.src_sentences = open(src_file, 'r', encoding='utf-8').readlines()
        self.tgt_sentences = open(tgt_file, 'r', encoding='utf-8').readlines()
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.max_length = max_length

    def __len__(self):
        return len(self.src_sentences)

    def __getitem__(self, idx):
        src_sentence = self.src_sentences[idx].strip()
        tgt_sentence = self.tgt_sentences[idx].strip()
        src_tokens = self.tokenizer_src.encode(src_sentence, out_type=int)[:self.max_length]
        tgt_tokens = self.tokenizer_tgt.encode(tgt_sentence, out_type=int)[:self.max_length]
        return torch.tensor(src_tokens), torch.tensor(tgt_tokens)

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=0)
    tgt_batch = pad_sequence(tgt_batch, batch_first=True, padding_value=0)
    return src_batch, tgt_batch

dataset = TranslationDataset('english_sentences.txt', 'arabic_sentences.txt', sp_eng, sp_ara, max_seq_length)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

batch_size = 16  # Adjust as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn)


class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(EmbeddingLayer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)

    def forward(self, x):
        return self.embedding(x)

class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros((max_len, embed_size))
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_size=256, num_heads=4, num_layers=2, dropout=0.1):
        super(TransformerModel, self).__init__()

        self.src_embedding = EmbeddingLayer(src_vocab_size, embed_size)
        self.tgt_embedding = EmbeddingLayer(tgt_vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, dropout)

        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dropout=dropout,
            batch_first=True
        )

        self.fc_out = nn.Linear(embed_size, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_padding_mask(self, sequences, pad_idx=0):
        return (sequences == pad_idx)

    def forward(self, src, tgt):
        src_padding_mask = self.create_padding_mask(src)
        tgt_padding_mask = self.create_padding_mask(tgt)
        memory_key_padding_mask = src_padding_mask.clone()

        src = self.positional_encoding(self.src_embedding(src))
        tgt = self.positional_encoding(self.tgt_embedding(tgt))

        output = self.transformer(
            src,
            tgt,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=memory_key_padding_mask
        )
        output = self.fc_out(output)
        return output


src_vocab_size = sp_eng.get_piece_size()
tgt_vocab_size = sp_ara.get_piece_size()
model = TransformerModel(src_vocab_size, tgt_vocab_size, embed_size=256, num_heads=4, num_layers=2, dropout=0.1)
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=0.0005)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)


def train(model, data_loader, criterion, optimizer, scheduler, num_epochs=10, pad_idx=0):
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            output = model(src_batch, tgt_input)

            output = output.reshape(-1, output.shape[-1])
            tgt_output = tgt_output.reshape(-1)

            loss = criterion(output, tgt_output)

            optimizer.zero_grad()
            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            epoch_loss += loss.item()

        scheduler.step()
        avg_epoch_loss = epoch_loss / len(data_loader)
        print(f'Epoch {epoch + 1}, Loss: {avg_epoch_loss:.4f}')

def evaluate(model, data_loader, pad_idx=0):
    model.eval()
    total_bleu_score = 0
    total_sentences = 0

    with torch.no_grad():
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            output = model(src_batch, tgt_input)
            output = output.argmax(dim=-1)

            for i in range(tgt_output.size(0)):
                reference = [tgt_output[i].tolist()]
                candidate = output[i].tolist()
                # Remove padding tokens
                reference = [[token for token in ref if token != pad_idx] for ref in reference]
                candidate = [token for token in candidate if token != pad_idx]
                total_bleu_score += sentence_bleu(reference, candidate)
                total_sentences += 1

    avg_bleu_score = total_bleu_score / total_sentences
    print(f'Average BLEU Score: {avg_bleu_score:.4f}')


train(model, train_loader, criterion, optimizer, scheduler, num_epochs=10, pad_idx=0)


evaluate(model, val_loader, pad_idx=0)


torch.save(model.state_dict(), 'transformer_model.pth')


Using device: cuda
Sample English Sentence: Hi.
Sample Arabic Sentence: مرحبًا.
Epoch 1, Loss: 5.8666
Epoch 2, Loss: 3.2910
Epoch 3, Loss: 1.7649
Epoch 4, Loss: 1.1056
Epoch 5, Loss: 0.7678
Epoch 6, Loss: 0.5619
Epoch 7, Loss: 0.4288
Epoch 8, Loss: 0.3348
Epoch 9, Loss: 0.2653
Epoch 10, Loss: 0.2155


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  output = torch._nested_tensor_from_mask(
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Average BLEU Score: 0.2378


# Average BLEU Score: 0.2378


In [15]:
from bert_score import score

def compute_bert_score(model, data_loader, sp_src, sp_tgt, pad_idx=0):
    """Compute BERTScore for translations."""
    model.eval()
    references = []
    candidates = []

    with torch.no_grad():
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            output = model(src_batch, tgt_input)
            output = output.argmax(dim=-1)

            for i in range(tgt_output.size(0)):
                # Decode reference and candidate sentences
                ref_sentence = sp_tgt.decode([token for token in tgt_output[i].tolist() if token != pad_idx])
                candidate_sentence = sp_tgt.decode([token for token in output[i].tolist() if token != pad_idx])

                references.append(ref_sentence)
                candidates.append(candidate_sentence)

    # Compute BERTScore
    P, R, F1 = score(candidates, references, lang="ar")  # Set "lang" to Arabic ("ar")
    avg_f1 = F1.mean().item()

    print(f"Average BERTScore (F1): {avg_f1:.4f}")
    return P, R, F1


In [16]:
P, R, F1 = compute_bert_score(model, val_loader, sp_eng, sp_ara, pad_idx=0)



Average BERTScore (F1): 0.7342


# Average BERTScore (F1): 0.7342

In [13]:
import numpy as np
from nltk.translate.bleu_score import sentence_bleu

def compute_bleu_scores(model, data_loader, pad_idx=0):
    """Compute BLEU scores for all samples in the validation dataset."""
    model.eval()
    bleu_scores = []

    with torch.no_grad():
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            output = model(src_batch, tgt_input)
            output = output.argmax(dim=-1)

            for i in range(tgt_output.size(0)):
                reference = [tgt_output[i].tolist()]
                candidate = output[i].tolist()
                # Remove padding tokens
                reference = [[token for token in ref if token != pad_idx] for ref in reference]
                candidate = [token for token in candidate if token != pad_idx]
                bleu_score = sentence_bleu(reference, candidate)
                bleu_scores.append(bleu_score)

    return np.array(bleu_scores)

def bootstrap_confidence_interval(data, num_samples=1000, confidence_level=0.95):
    """Compute the confidence interval for the mean using bootstrap sampling."""
    bootstrapped_means = []
    n = len(data)

    for _ in range(num_samples):
        sample = np.random.choice(data, size=n, replace=True)
        bootstrapped_means.append(np.mean(sample))

    lower_bound = np.percentile(bootstrapped_means, (1 - confidence_level) / 2 * 100)
    upper_bound = np.percentile(bootstrapped_means, (1 + confidence_level) / 2 * 100)
    return lower_bound, upper_bound

# Compute BLEU scores
bleu_scores = compute_bleu_scores(model, val_loader)

# Compute 95% confidence interval
lower, upper = bootstrap_confidence_interval(bleu_scores)
print(f"Average BLEU Score: {np.mean(bleu_scores):.4f}")
print(f"95% Confidence Interval: [{lower:.4f}, {upper:.4f}]")


Average BLEU Score: 0.2378
95% Confidence Interval: [0.2294, 0.2450]


# 95% Confidence Interval: [0.2294, 0.2450]