In this initial implementation, we utilized a dataset comprising over 20,000 English-Arabic sentence pairs. This dataset served as the foundation for training our Transformer-based machine translation model. The model achieved an average BLEU score of approximately 0.23, indicating that it could capture basic translation patterns but had limitations in handling more complex or nuanced sentences.

A significant challenge was the dataset's limited size and diversity. While 20,000 sentences provide a starting point, it is relatively small compared to the extensive corpora used in state-of-the-art machine translation systems, which often involve millions of sentence pairs. This limitation restricts the model's exposure to a wide range of vocabulary, idiomatic expressions, and varied syntactic structures. Such exposure is crucial, especially when translating between English and Arabic, due to their linguistic differences and the morphological richness of Arabic.

To enhance the model's performance, it is essential to acquire more extensive and diverse datasets. Additional high-quality data would enable the model to learn a broader vocabulary and better understand context, leading to more accurate and fluent translations. Moreover, increasing computational resources would allow for training deeper models with larger embedding dimensions and more Transformer layers, which could capture the complexities of both languages more effectively.

Investing in these areas—expanding the dataset and utilizing more computational power—would address the current limitations. It would significantly improve the model's ability to handle complex sentence structures, idiomatic expressions, and the nuanced linguistic patterns necessary for high-quality machine translation between English and Arabic.

In [None]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)


Using device: cuda


Step 1: Data Preparation
Tokenization:

Tokenize the English and Arabic sentences separately.
Consider using subword tokenization (e.g., Byte-Pair Encoding or WordPiece) to handle out-of-vocabulary words. This is especially useful for Arabic, as it has complex morphology.
Vocabulary Creation:

Create separate vocabularies for English and Arabic tokens.
Map each token to a unique integer for use in the model.
Padding and Batching:

Pad sequences to a fixed length to handle variable-length inputs and outputs.
Group similar-length sentences in batches to optimize training speed and memory use.

In [None]:
# Reading data from ara_eng.txt
english_sentences = []
arabic_sentences = []

with open('ara_eng.txt', 'r', encoding='utf-8') as file:
    for line in file:
        # Assuming sentences are separated by a tab (adjust if needed)
        eng, ara = line.strip().split('\t')
        english_sentences.append(eng)
        arabic_sentences.append(ara)

print("Sample English Sentence:", english_sentences[0])
print("Sample Arabic Sentence:", arabic_sentences[0])


Sample English Sentence: Hi.
Sample Arabic Sentence: مرحبًا.


In [None]:
# Save English and Arabic sentences to separate temporary files
with open('english_sentences.txt', 'w', encoding='utf-8') as eng_file:
    for sentence in english_sentences:
        eng_file.write(sentence + '\n')

with open('arabic_sentences.txt', 'w', encoding='utf-8') as ara_file:
    for sentence in arabic_sentences:
        ara_file.write(sentence + '\n')


In [None]:
import sentencepiece as spm

# Train English tokenizer
spm.SentencePieceTrainer.Train(input='english_sentences.txt', model_prefix='eng_tokenizer', vocab_size=8000)

# Train Arabic tokenizer
spm.SentencePieceTrainer.Train(input='arabic_sentences.txt', model_prefix='ara_tokenizer', vocab_size=8000)


In [None]:
# Load the trained tokenizers
sp_eng = spm.SentencePieceProcessor(model_file='eng_tokenizer.model')
sp_ara = spm.SentencePieceProcessor(model_file='ara_tokenizer.model')

# Tokenize the first few sentences to verify
eng_tokens = [sp_eng.encode(sentence, out_type=int) for sentence in english_sentences]
ara_tokens = [sp_ara.encode(sentence, out_type=int) for sentence in arabic_sentences]

print("Tokenized English:", eng_tokens[:2])
print("Tokenized Arabic:", ara_tokens[:2])


Tokenized English: [[5434, 4], [13, 0, 890, 288]]
Tokenized Arabic: [[6541, 362, 8], [37, 3650, 293]]


In [None]:
# Get vocabulary sizes
eng_vocab_size = sp_eng.get_piece_size()
ara_vocab_size = sp_ara.get_piece_size()

# Create dictionaries for each language
eng_id_to_token = {i: sp_eng.id_to_piece(i) for i in range(eng_vocab_size)}
ara_id_to_token = {i: sp_ara.id_to_piece(i) for i in range(ara_vocab_size)}

# Reverse mappings
eng_token_to_id = {v: k for k, v in eng_id_to_token.items()}
ara_token_to_id = {v: k for k, v in ara_id_to_token.items()}


In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence

# Convert lists of tokens to PyTorch tensors
eng_tensors = [torch.tensor(tokens) for tokens in eng_tokens]
ara_tensors = [torch.tensor(tokens) for tokens in ara_tokens]

# Pad sequences
eng_padded = pad_sequence(eng_tensors, batch_first=True, padding_value=0)
ara_padded = pad_sequence(ara_tensors, batch_first=True, padding_value=0)

print("Padded English Sentences:\n", eng_padded)
print("Padded Arabic Sentences:\n", ara_padded)


Padded English Sentences:
 tensor([[5434,    4,    0,  ...,    0,    0,    0],
        [  13,    0,  890,  ...,    0,    0,    0],
        [ 102,  153,  185,  ...,    0,    0,    0],
        ...,
        [ 236,  172,  125,  ...,    0,    0,    0],
        [   6,  151,  152,  ...,    0,    0,    0],
        [  21,   54,   78,  ...,    0,    0,    0]])
Padded Arabic Sentences:
 tensor([[6541,  362,    8,  ...,    0,    0,    0],
        [  37, 3650,  293,  ...,    0,    0,    0],
        [7852,  293,    0,  ...,    0,    0,    0],
        ...,
        [ 600,   30,   99,  ...,    0,    0,    0],
        [2715,    6, 7124,  ...,    0,    0,    0],
        [ 144,  729,  108,  ...,    0,    0,    0]])


In [None]:
from torch.utils.data import DataLoader, TensorDataset

# Create a dataset and dataloader
dataset = TensorDataset(eng_padded, ara_padded)
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Sample batch
for eng_batch, ara_batch in data_loader:
    print("English Batch:\n", eng_batch)
    print("Arabic Batch:\n", ara_batch)
    break


English Batch:
 tensor([[  63,   54,  131,  ...,    0,    0,    0],
        [ 641,    3,  354,  ...,    0,    0,    0],
        [ 739, 1218,   46,  ...,    0,    0,    0],
        ...,
        [ 102, 4131,   11,  ...,    0,    0,    0],
        [2501, 1279, 6443,  ...,    0,    0,    0],
        [  35,  152,    7,  ...,    0,    0,    0]])
Arabic Batch:
 tensor([[7054,   55,   67,  ...,    0,    0,    0],
        [ 510,    3, 4227,  ...,    0,    0,    0],
        [  59,   53,   30,  ...,    0,    0,    0],
        ...,
        [   3, 4741, 5973,  ...,    0,    0,    0],
        [ 471, 1260, 4488,  ...,    0,    0,    0],
        [  88,  560,    7,  ...,    0,    0,    0]])


In [None]:
!pip install sentencepiece
!pip install nltk




In [None]:
!pip install torch





In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Full code from beginning to end

# 1. Imports and Setup
import torch
import torch.nn as nn
import torch.optim as optim
import math
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
import sentencepiece as spm
import random
import numpy as np
import gc
from nltk.translate.bleu_score import sentence_bleu

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Check if CUDA is available and set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# 2. Data Preparation

# Reading data from ara_eng.txt
english_sentences = []
arabic_sentences = []

with open('ara_eng.txt', 'r', encoding='utf-8') as file:
    for line in file:
        # Assuming sentences are separated by a tab
        eng, ara = line.strip().split('\t')
        english_sentences.append(eng)
        arabic_sentences.append(ara)

print("Sample English Sentence:", english_sentences[0])
print("Sample Arabic Sentence:", arabic_sentences[0])

# Optional: Limit dataset size for testing (uncomment if needed)
# english_sentences = english_sentences[:10000]
# arabic_sentences = arabic_sentences[:10000]

# 3. Tokenization with SentencePiece

# Save English and Arabic sentences to separate files
with open('english_sentences.txt', 'w', encoding='utf-8') as eng_file:
    for sentence in english_sentences:
        eng_file.write(sentence + '\n')

with open('arabic_sentences.txt', 'w', encoding='utf-8') as ara_file:
    for sentence in arabic_sentences:
        ara_file.write(sentence + '\n')

# Train SentencePiece tokenizers
spm.SentencePieceTrainer.Train(input='english_sentences.txt', model_prefix='eng_tokenizer', vocab_size=8000)
spm.SentencePieceTrainer.Train(input='arabic_sentences.txt', model_prefix='ara_tokenizer', vocab_size=8000)

# Load the trained tokenizers
sp_eng = spm.SentencePieceProcessor(model_file='eng_tokenizer.model')
sp_ara = spm.SentencePieceProcessor(model_file='ara_tokenizer.model')

# Clean up to save memory
del english_sentences, arabic_sentences
gc.collect()

# 4. Dataset and DataLoader

# Define maximum sequence length
max_seq_length = 100  # Adjust as needed

class TranslationDataset(Dataset):
    def __init__(self, src_file, tgt_file, tokenizer_src, tokenizer_tgt, max_length):
        self.src_sentences = open(src_file, 'r', encoding='utf-8').readlines()
        self.tgt_sentences = open(tgt_file, 'r', encoding='utf-8').readlines()
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.max_length = max_length

    def __len__(self):
        return len(self.src_sentences)

    def __getitem__(self, idx):
        src_sentence = self.src_sentences[idx].strip()
        tgt_sentence = self.tgt_sentences[idx].strip()
        src_tokens = self.tokenizer_src.encode(src_sentence, out_type=int)[:self.max_length]
        tgt_tokens = self.tokenizer_tgt.encode(tgt_sentence, out_type=int)[:self.max_length]
        return torch.tensor(src_tokens), torch.tensor(tgt_tokens)

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=0)
    tgt_batch = pad_sequence(tgt_batch, batch_first=True, padding_value=0)
    return src_batch, tgt_batch

# Create dataset and data loaders
dataset = TranslationDataset('english_sentences.txt', 'arabic_sentences.txt', sp_eng, sp_ara, max_seq_length)

# Split dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

# Create data loaders with reduced batch size
batch_size = 16  # Adjust as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn)

# 5. Model Definition

# Embedding and Positional Encoding Layers
class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(EmbeddingLayer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)

    def forward(self, x):
        return self.embedding(x)

class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros((max_len, embed_size))
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # Shape: [1, max_len, embed_size]
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# Transformer Model Definition
class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_size=256, num_heads=4, num_layers=2, dropout=0.1):
        super(TransformerModel, self).__init__()

        self.src_embedding = EmbeddingLayer(src_vocab_size, embed_size)
        self.tgt_embedding = EmbeddingLayer(tgt_vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, dropout)

        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dropout=dropout,
            batch_first=True  # Set batch_first to True
        )

        self.fc_out = nn.Linear(embed_size, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_padding_mask(self, sequences, pad_idx=0):
        return (sequences == pad_idx)

    def forward(self, src, tgt):
        src_padding_mask = self.create_padding_mask(src)
        tgt_padding_mask = self.create_padding_mask(tgt)
        memory_key_padding_mask = src_padding_mask.clone()

        src = self.positional_encoding(self.src_embedding(src))
        tgt = self.positional_encoding(self.tgt_embedding(tgt))

        output = self.transformer(
            src,
            tgt,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=memory_key_padding_mask
        )
        output = self.fc_out(output)
        return output

# 6. Training and Evaluation Functions

# Initialize the model
src_vocab_size = sp_eng.get_piece_size()
tgt_vocab_size = sp_ara.get_piece_size()
model = TransformerModel(src_vocab_size, tgt_vocab_size, embed_size=256, num_heads=4, num_layers=2, dropout=0.1)
model.to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding index during loss calculation
optimizer = optim.Adam(model.parameters(), lr=0.0005)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

# Training function
def train(model, data_loader, criterion, optimizer, scheduler, num_epochs=10, pad_idx=0):
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            # Forward pass
            output = model(src_batch, tgt_input)

            # Reshape output and target for loss calculation
            output = output.reshape(-1, output.shape[-1])
            tgt_output = tgt_output.reshape(-1)

            loss = criterion(output, tgt_output)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            # Accumulate loss
            epoch_loss += loss.item()

        scheduler.step()
        avg_epoch_loss = epoch_loss / len(data_loader)
        print(f'Epoch {epoch + 1}, Loss: {avg_epoch_loss:.4f}')

# Evaluation function
def evaluate(model, data_loader, pad_idx=0):
    model.eval()
    total_bleu_score = 0
    total_sentences = 0

    with torch.no_grad():
        for src_batch, tgt_batch in data_loader:
            src_batch = src_batch.to(device)
            tgt_batch = tgt_batch.to(device)

            tgt_input = tgt_batch[:, :-1]
            tgt_output = tgt_batch[:, 1:]

            # Generate prediction
            output = model(src_batch, tgt_input)
            output = output.argmax(dim=-1)

            # Calculate BLEU for each sentence
            for i in range(tgt_output.size(0)):
                reference = [tgt_output[i].tolist()]
                candidate = output[i].tolist()
                # Remove padding tokens
                reference = [[token for token in ref if token != pad_idx] for ref in reference]
                candidate = [token for token in candidate if token != pad_idx]
                total_bleu_score += sentence_bleu(reference, candidate)
                total_sentences += 1

    avg_bleu_score = total_bleu_score / total_sentences
    print(f'Average BLEU Score: {avg_bleu_score:.4f}')

# 7. Training Execution

# Train the model
train(model, train_loader, criterion, optimizer, scheduler, num_epochs=10, pad_idx=0)

# Evaluate the model
evaluate(model, val_loader, pad_idx=0)

# Optional: Save the model
torch.save(model.state_dict(), 'transformer_model.pth')


Using device: cuda
Sample English Sentence: Hi.
Sample Arabic Sentence: مرحبًا.
Epoch 1, Loss: 5.8839
Epoch 2, Loss: 3.3133
Epoch 3, Loss: 1.7921
Epoch 4, Loss: 1.1230
Epoch 5, Loss: 0.7994
Epoch 6, Loss: 0.5994
Epoch 7, Loss: 0.4718
Epoch 8, Loss: 0.3761
Epoch 9, Loss: 0.3071
Epoch 10, Loss: 0.2537


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  output = torch._nested_tensor_from_mask(
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Average BLEU Score: 0.2313
